key: cord-0214189-xuf4hjwt authors: Xia, Fei; Li, Bin; Weng, Yixuan; He, Shizhu; Liu, Kang; Sun, Bin; Li, Shutao; Zhao, Jun title: LingYi: Medical Conversational Question Answering System based on Multi-modal Knowledge Graphs date: 2022-04-20 journal: nan DOI: nan sha: 0bac63c484a9d78964f7c5a06a513e59f32cc2f5 doc_id: 214189 cord_uid: xuf4hjwt The medical conversational system can relieve the burden of doctors and improve the efficiency of healthcare, especially during the pandemic. This paper presents a medical conversational question answering (CQA) system based on the multi-modal knowledge graph, namely"LingYi", which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures including medical triage, consultation, image-text drug recommendation and record. To conduct knowledge-grounded dialogues with patients, we first construct a Chinese Medical Multi-Modal Knowledge Graph (CM3KG) and collect a large-scale Chinese Medical CQA (CMCQA) dataset. Compared with the other existing medical question-answering systems, our system adopts several state-of-the-art technologies including medical entity disambiguation and medical dialogue generation, which is more friendly to provide medical services to patients. In addition, we have open-sourced our codes which contain back-end models and front-end web pages at https://github.com/WENGSYX/LingYi. The datasets including CM3KG at https://github.com/WENGSYX/CM3KG and CMCQA at https://github.com/WENGSYX/CMCQA are also released to further promote future research. Conversation question answering (CQA) system is an emerging research topic, it is the natural evolution of the traditional question answering (QA) paradigm (Gao et al., 2018; Ghazarian et al., 2021) , allowing more natural conversational interactions between patients and the systems (Zaib et al., 2021) . CQA can improve the patients' experience by present conversational interaction (Zhou et al., 2021) . It can be applied to many scenarios such as electricity business (Meng et al., 2021) , medical healthcare , and personal assistants (Ugurlu et al., 2020) , etc. With the pandemic of the COVID-19, it is significant for building the medical CQA system, which is advantageous to improving the efficiency of medical services and reducing the burden on doctors with broad application prospects (Palanica et al., 2019) . Recently, related medical service applications have become more and more popular, such as automatic diagnosis (Wang et al., 2021b; Moreira et al., 2019) , Identify symptoms (Zheng et al., 2017) , medical image recognition (C et al., 2021) , etc. Most of these applications are equipped with external medical KBs for QA services, which are required to respond with manual-designed text templates. Moreover, the labor of constructing the templates is huge, where the fixed-type text responses sometimes lead to the bad patients experience. How to design a CQA system with external knowledge remains a great challenge. In this paper, we present a medical conversational question answering system with the multimodal knowledge graph, namely LingYi, which is designed in a pipeline manner for high flexibility. As shown in Figure 1 , it presents three main processes in our system: before diagnosis, during Consultation and after diagnosis. The before diagnosis phase consists of the triage module, entity recognition and disambiguation. The consultation phase includes Central Records Memory, Symptoms Selection Algorithm and prompt&prefix generation. The after diagnosis phase focuses on image-text recommendation and medical record summary. In summary, LingYi provides patients with a natural cnversational QA services, which is relatively rare (Wang et al., 2021b) but more meaningful (Liu et al., 2020a) . LingYi has the following highlights: 1. LingYi implements a pipelined manner for automating medical procedures, providing patients a more friendly and helpful medical service with the natural conversational QA. 2. Multiple functions such as medical triage, consultation, image-text drug recommendation and record are integrated into LingYi. In addition, our framework combines the first Chinese medical multi-modal knowledge graph CM3KG and large-scale medical conversation datasets CMCQA 4 . 3. LingYi integrates advanced technologies such as entity disambiguation and medical response generation. It is competitive with other stateof-the-art (SOTA) on both automated and manual evaluations. LingYi aims at providing automated medical services for the majority of patients. Preliminary experiments have been performed in the Xiangya Hospital of Central South University (Changsha, China), which demonstrates the research prospects and practical applications of the proposed system. The Figure 2 overview the main framework of our proposed system, and the whole process include: 4 https://github.com/WENGSYX/LingYi 1) LingYi's input is the conversational QA history, and the patients' entities are obtained through entity recognition and entity disambiguation. 2) The patients' entities are then sent to the central memory recording (CRM) module for storage. 3) The dynamic symptom selection algorithm and entity knowledge reasoning are used to obtain the inference results and update the CRM. 4) The reasoned entities from the CRM and the conversation QA history are combined by prompt (Jin et al., 2022) for response generation. 5) Finally, the response is generated through the bi-direction encoder and the prefix (Li and Liang, 2021) autoregressive decoder. In this section, we introduce the entity disambiguation module, which is consisted of technologies of the named entity recognition and contrastive pre-training. As for the named entity recognition, we implement the method (Sarker et al., 2019) for recognizing the medical entities in the utterance. We achieve the accuracy of 90.9% in the simple medical entity recognition dataset of the IFLYTK 5 , which ranks top-3 of the competition. After obtaining the entities, the entity disambiguation with contrastive learning is performed, which is shown in Figure 3 . We introduce the contrastive pretraining framework with Smedbert (Zhang et al., 2021a) for medical entity disambiguation, which is our champion scheme in SDU@AAAI-22-Shared Task2 Acronym Disambiguation 6 . Specifically, we design a contrastive pre-training method that enhances the model's generalization ability by learning the medical phrase-level contrastive distributions between true meaning and ambiguous phrases . During the pretraining, we cover up the student model's medical entities, then make the student model output closer to the meaning of the teacher model, and away from other unrelated medical entities. Both two models initialize the same parameters, where the parameters of the teacher model is frozen. For the masking of these medical entities, we adopt the expert medical dictionary THUOCL (Han et al., 2016) for experiments. After entities are obtained, we adopt the pre-trained model for matching the recognized entities and medical entities in the knowledge graph. Finally, we map the ambiguitive phrases into the entities in the knowledge graph. Central Record Memory (CRM) has storage and reasoning functions, and it is mainly composed of data formats in the form of a dictionary of entity triples. First, the CRM maps the medical entities obtained in the disambiguation module to specific attributes on the knowledge graph, and stores the past entites into the dictionary. In the next round of dialogue conversation, CRM will not only map the entities of the current round on the graph but also update the current state. The information of the entities on the current round are appended into the new dictionary again. After that, the CRM will send the entities obtained by the knowledge graph inference into the generation module for further sentence generation. In order to reduce the redundant asking rounds as much as possible and ensure accurate diagnosis of the disease, we designed a symptom selection algorithm based on dynamic programming to solve the optimization problem without recursively solving 6 https://sites.google.com/view/ s-du-aaai22/home all sub-problems in turn and avoiding unnecessary calculations. As shown in the Algorithm 1, we regard each round of consultation as a sub-question to be judged, that is, we only need to select the symptom in the current state that can rule out the most diseases at one time. We traverse all the diseases in the knowledge graph, and if the intersection of the symptoms of the disease and the symptoms of the patient is not an empty set, it will be added to the list of suspected diseases. Once the symptoms of all suspected diseases are counted, the symptom with the most frequent occurrences will be found and the symptom can be judged as the output symptom. When disease reasoning is required, the CRM will perform this algorithm until the final state of the patient's disease is confirmed. As for the entity knowledge reasoning, LingYi will strictly abide by the actual consultation process (Ha and Longnecker, 2010) . The first stage is symptom reasoning, the second stage is examination reasoning, and the third stage is drug reasoning. Our system will initially conduct repeated symptom consultation to patients to ensure that the system sends complete patient symptoms entities to the CRM. After this, LingYi will synthesize all symptoms entities from the CRM, reasoning on the basis of related entities in our CM3KG to get the pa- Entity Symptom Dialogue COVID-19-CN COVID-19 / / / MedDG (Liu et al., 2020b) Gastroenterology 160 12 17864 Chunyu (Lin et al., 2020) / 5682 15 12842 MedDialog-CN 29 Departments / 172 3407494 M 2 MedDialog 40 Departments 4728 843 95408 CMCQA(Ours) 45 Departments 33615 8808 1294753 tient's medical examination. Finally, if the patient continues to consult with drug for the treatment of the disease, LingYi will preform drug recommendations with corresponding image based on CM3KG, so that the patient can find the right suggestions. In order to avoid misdiagnosis, in the symptom reasoning stage, the LingYi system will ask the patient about the symptoms in a "diagnosed" style until the patient's disease is confirmed. We adopt the method of entity prompt learning for training and prediction (Liu et al., 2021a) . More precisely, we append the reasoned entities input with the conversational QA history, forming a prompt for response generation. Moreover, we design the prefix template for auto-regressive decoding. Specifically, we manually design templates of different reasoning processes described in the section 2.4 to further increase the controllability of the generated responses. In this way, we use the prompt and prefix method to fuse the context information with the reasoned entities from CRM. As a result, the generated response will be the condition on the prompt and prefix, so as to improve the factual accuracy and controllability of the model. Our LingYi system covers other functional modules, including triage, image-text drug recommendation and medical record modules, where these modules will be introduced in turn. We implement the triage function with the Smedbert (Zhang et al., 2021a) model, which is finetuned in medical entity triage data provided by IFlytek 7 . The final results can achieve the F1 values of 90.37% for medical triage classification. Finally, we apply this method to our medical triage module. We utilize the CM3KG dataset and implement medical knowledge entity reasoning, linking drug entities to corresponding images to achieve image-text drug recommendations. It will be helpful for patients to find drug information more easily. The last module is the medical record. In order to make it easier for patients to conduct secondary treatment more conveniently and quickly, LingYi will write a medical record from the whole conversation after patients finish consultation. Specifically, we process the unstructured conversations history information based on CPT (Shao et al., 2021) model to generate the key summarization of the patient's condition from this consultation. At the same time, this module will also process structured information stored in central records memory, such as department, examinations, drugs, and other information. Finally, two kinds of information are integrated through post-processing splicing to generate the patient's medical record. (Liu et al., 2020b) 3.22 3.12 3.17 HERD-Entity (Liu et al., 2020b) 3.83 3.77 3.74 BertGPT-Entity 3.71 3.78 3.82 CPM2-prompt (Zhang et al., 2021b) 4 3 Experimental Details CMCQA 8 is a huge conversational questionand-answer data set for the Chinese medical field, where the statistics of medical conversation datasets is shown in Table 1 . It is collected from the Chinese medical conversational question answering website ChunYu 9 , and has medical conversational materials in 45 departments, such as andrology, stormotologry, gynaecology and obstetrics. Specifically, CMCQA has 1.3 million complete sessions or 19.83 million statements or 0.65 billion tokens. At the same time, we further open source all data to promote the development of related fields of conversational question answering in the medical field. CM3KG 10 is open-sourced multi-modal knowledge graphs. We have processed the data crawled from the website, and then sorted it into the form of tables. For example, for the symptom of stomachache, the "disease" attributes include "gastritis", "gastric cancer", "gastric ulcer" and other diseases. The "examination" attributes include "gastroscopy" and "pathological biopsy of gastric mucosa" , etc. After that, we search and link the entities in the knowledge graphs in Bing image database 11 . After the completion of the construction, the authors manually correct it again, eliminated about 20% of the obvious error information, and then submit it to the expert doctors for final verification to ensure the accuracy of the multimodal knowledge graphs. We train the model based on the Pytorch (Paszke et al., 2019) and use the hugging-face (Wolf et al., 2020) framework. All the finetuned models are implemented in the collected medical corpus 12 . During training, we employ the AdamW optimizer (Loshchilov and Hutter, 2017) . The learning rate is set to 1e-5 with the warm-up (He et al., 2016) . Four 3090 GPUs are implemented for all experiments. To ensure correct medical entity information and fluent responses, we provide the automatic and manual evaluations. Specifically, we conduct experiments entity disambiguation dataset of SDU@AAAI 2021 13 and medical dialogue generation dataset of the CCKS 14 . We adopt the evaluation metrics, including the F1, BLEU (Papineni et al., 2002) , and Dist. scores. The F1 score reflects the correctness of medical entity knowledge. The BLEU score reflects the relativity of the generated responses. The Dist. score represents the diversity of the generated sentences. We further prepare manual evaluation for randomly picking 100 cases from the test dataset. Each generated sentence is scored by three independent persons with a medical background. We adopt the same human evaluation metrics as the work (Liu et al., 12 Hello, according to the situation you describe, it may be caused by gastrointestinal dysfunction. It is recommended to eat light food at ordinary times, do not eat cold and spicy food, pay attention to keeping the abdomen warm, and avoid catching a cold. I have a stomachache and diarrhea now and I feel very sick. I suggest that you should pay attention to a light diet, do not eat cold, spicy and irritating food, drink more water, eat more vegetables and fruits, exercise and rest more, and keep your mood comfortable. You can eat some probiotics to regulate your routine. , to see if there are uncomfortable problems. You can go to the drugstore to buy some drugs like diyiyabaogan, etc. (1) Chief complaint: diarrhea. (2) History of present illness: diarrhea, diarrhea and abdominal pain. (3) Auxiliary examination: temporarily absent. (4) Past history: unknown. 2020b). The rating scale for each metric is ranged from 1 to 5, where 1 represents the worst and 5 the best. The experimental results are shown in Table 3 . It can be seen that our results achieve the best results compared to other entity disambiguation SOTA methods. At the same time, we also conducted related evaluations on the test dataset, which is shown in Table 4 . As is shown from the table, our method achieves the best results against recent strong baselines and leads in accuracy, relevance and diversity. We provide a manual evaluation to further judge the performance between different methods. As shown in Table 5 , our method achieves competitiveness in human evaluation compared to other SOTA methods. It is noted that there is still a long way from the generated responses to the real responses of people. What's more, the average pairwise Cohen's kappa (Randolph, 2005) scores between annotators range between 0.4 and 0.6 for all metrics, which represents a moderate annotator agreement. We present the application of LingYi at the website 15 , where the snapshot are shown in Figure 4 . Figure 4 shows that if the patient says that he is sick in his stomach, the system will get the entity "gassralgia" from the entity disambiguation mod-ule. Afterwards it will obtain the entity "gastritis" from the knowledge graphs through entity knowledge reasoning. The reasoned entity is sent to the generating module for further recommending the patient to do diagnosis in the hospital. Finally, if a patient needs urgent drug, the system will recommend the proper drug through the knowledge graphs. A medical record will be generated after the consultation, which will significantly facilitate the patient's secondary treatment 16 . In this paper, we introduce a conversational medical QA system, LingYi, which integrates multi-modal knowledge graphs CM3KG, and medical conversation dataset CMCQA. We design a pipeline to analyze patients' statements and present SOTA performance on both entity disambiguation and response generation tasks, providing patients with a full range of medical consulting services. We hope our system will alleviate the problem of worldwide medical resource scarcity during the COVID-19 and provide a feasible direction for subsequent researchers to develop medical artificial intelligence. Identification of malnutrition and prediction of bmi from 16 Neural approaches to conversational ai DiS-CoL: Toward engaging dialogue systems through conversational line guided response generation Doctor-patient communication: a review Thuocl: Tsinghua open chinese lexicon Deep residual learning for image recognition Jiajun Zhang, and Chengqing Zong. 2022. Instance-aware prompt learning for language understanding and generation Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Simclad: A simple framework for contrastive learning of acronym disambiguation Prefix-tuning: Optimizing continuous prompts for generation Graph-evolving meta-learning for lowresource medical dialogue generation Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing A review of medical artificial intelligence Heterogeneous graph reasoning for knowledge-grounded medical dialogue system Meddg: A large-scale medical consultation dataset for building medical dialogue system Roberta: A robustly optimized bert pretraining approach Fixing weight decay regularization in adam Research on short text similarity calculation method for power intelligent question answering Postpartum depression prediction through pregnancy data analysis for emotion-aware smart systems Physicians' perceptions of chatbots in health care: Crosssectional web-based survey Bert-based acronym disambiguation with multiple training strategies Bleu: a method for automatic evaluation of machine translation Pytorch: An imperative style, high-performance deep learning library Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss' fixed-marginal multirater kappa An interpretable natural language processing system for written medical examination assessment Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation A smart virtual assistant answering questions about covid-19 Pre-trained language models in biomedical domain: A systematic survey COVID-19 literature knowledge graph construction and drug repurposing report generation AD-BCMM : Acronym disambiguation by building counterfactuals and multilingual mixing Transformers: State-of-the-art natural language processing On the generation of medical dialogues for covid-19 Conversational question answering: A survey MedDialog: Large-scale medical dialogue datasets SMedBERT: A knowledge-enhanced pre-trained language model with structured semantics for medical text mining A machine learning-based framework to identify type 2 diabetes through electronic health records Leveraging domain agnostic and specific knowledge for acronym disambiguation CRSLab: An open-source toolkit for building conversational recommender system The constructed system aims to generate professional, fluent, and consistent medical responses. We have also realized that, due to adapt of pretrained models which learning with the medical data from Internet, the proposed approach may produce inappropriate text such as offensive, racially or gender-sensitive responses. Meanwhile, although the proposed method can cover the stages of before, during, and after the medical treatment, it may also be maliciously exploited, for example, using forged medical reports to fabricate false medical reports.We have carefully considered the above issues and provided the following detailed explanations:(1) All used medical data is collected from the Internet, and it is inevitable contain offensive, racially or gender-sensitive doctor-patient conversations. Due to the limited space, we briefly describe the characteristics and cleaning rules of the datasets and delete the utterances of doctorpatient dialogue that are offensive, racially, or gender-sensitive. The detailed process can be found in the README file on the website https: //github.com/Wengsyx/LingYi. (2) The quality of the processed datasets will affect the credibility of the robustness evaluation. Compared with previous works, we adapt four types of criteria to evaluate the credibility of our system, they are: offline index evaluation (BLEU, Distinct and F1), online patients evaluation, dialogue rounds testing, and professional doctor evaluation. We hope to maximize the reliability and implement ability of the system based on such evaluation benchmarks.(3) LingYi is a medical system with a suggestion nature, which uses knowledge graph to provide multimodal medical responses. Our system may produce incorrect medical results. Therefore, the responses of the system are only for reference. Normal patients should not seek medical treatment indiscriminately. (4) Our work does not contain identity information, the doctor only responds by the patient's condition, it will not harm anyone and doesn't invade people's privacy. (5) The medicines recommended by LingYi are over-the-counter medicines. Patients need to consult their doctor for further confirmation when purchasing the drugs for prescription drugs. (6) Our system supports applications on different terminals.In the future, we will adopt federated learning to capture the patient's condition and provide compre-hensive protection more accurately, such to the federated learning is able to provide privatized and personalized learning services for each patient. Finally, since the proposed method uses external knowledge graphs, the information sources of these knowledge graphs also suffer from several issues such as risk and bias. Reducing these potential risks requires ongoing research.