key: cord-0217475-p8hi064n
authors: Jahanshahi, Hadi; Kazmi, Syed; Cevik, Mucahit
title: Auto Response Generation in Online Medical Chat Services
date: 2021-04-26
journal: nan
DOI: nan
sha: 8c7cf2664a0f691090590abfae79dc5831a4344e
doc_id: 217475
cord_uid: p8hi064n

Telehealth helps to facilitate access to medical professionals by enabling remote medical services for the patients. These services have become gradually popular over the years with the advent of necessary technological infrastructure. The benefits of telehealth have been even more apparent since the beginning of the COVID-19 crisis, as people have become less inclined to visit doctors in person during the pandemic. In this paper, we focus on facilitating the chat sessions between a doctor and a patient. We note that the quality and efficiency of the chat experience can be critical as the demand for telehealth services increases. Accordingly, we develop a smart auto-response generation mechanism for medical conversations that helps doctors respond to consultation requests efficiently, particularly during busy sessions. We explore over 900,000 anonymous, historical online messages between doctors and patients collected over nine months. We implement clustering algorithms to identify the most frequent responses by doctors and manually label the data accordingly. We then train machine learning algorithms using this preprocessed data to generate the responses. The considered algorithm has two steps: a filtering (i.e., triggering) model to filter out infeasible patient messages and a response generator to suggest the top-3 doctor responses for the ones that successfully pass the triggering phase. The method provides an accuracy of 83.28% for precision@3 and shows robustness to its parameters.

Online chat services have been used across various sectors for providing customer service, tech support, consultancy/advisory, sales support, and education. Compared to in-person and over-the-phone encounters, live chat provides the highest level of customer satisfaction [1] . As more people join online chat platforms, and with the use of smartphones and smartwatches, as well as an increase in on-the-go communication, smart response generation has become an integral part of online chat platforms.

The smart response suggestions have made businesses more productive as well. Since customer inquiries follow a predictable pattern, which is especially true for domain-specific businesses, smart replies allow for quick and accurate responses. The improved efficiency reduces customers' wait times and thereby results in service satisfaction. Smart response systems also enable employees to handle multiple chats simultaneously, and as a result, businesses can save on additional hiring costs as they grow.

As healthcare is moving towards online chat services, the smart response system plays a prominent role in allowing smooth and effective doctor-patient interactions. According to the Association of American Medical Colleges (AAMC), the demand for physicians will exceed supply in the U.S. by 2032, leading to an approximate shortage of 46,900 to 121,900 full-time physicians [2] . Hawkins [3] reports that the average wait time for a physician appointment for 15 major metropolitan areas in the U.S. is 24.1 days, representing a 30% increase over 2014. Furthermore, Mehrotra et al. [4] 's findings suggest a 60% decline in the number of visits to ambulatory care practices, whereas there has been a rapid growth in telehealth usage during the COVID-19 pandemic. Due to this high imbalance in the doctor-to-patient ratio and the increase in peoples' reluctance to visit doctors in-person for various reasons (e.g., during the pandemic), telehealth has the potential to become an essential component of our daily lives.

To facilitate the patient-doctor e-conversations, we develop a smart response generation approach for an online doctor-patient chat service. We use historical doctor-patient anonymous chats to develop a method applicable to any online doctor consultation service and apps. There exist certain challenges regarding these types of datasets. First, in many cases, patients take multiple chat-turns to convey a message, and it needs to be manually determined what part of the chat must be used to match the corresponding doctor's reply. In addition, extensive data preprocessing is required to correct misspellings, punctuation misuses, and grammatical errors. In our response generation mechanism for the medical chats, we consider various machine learning and deep learning models. Specifically, our algorithm has two steps: a triggering model to filter out infeasible patient messages and a response generator to suggest the top-3 doctor responses for the ones that successfully pass the triggering phase. We observe that response generation mechanisms benefit considerably from the high performance of the deep learning models for the natural language processing tasks at both phases.

The rest of the paper is organized as follows. In Section 2, we present the related literature and summarize our contributions with respect to the previous studies. We define our problem and solution methodology for smart response generation in Section 3. Afterwards, we summarize our numerical results with the smart response generator using actual patient-doctor conversations in Section 4. The paper concludes with a summary, limitations, and future research suggestions in Section 5.

The effectiveness of the smart response systems has made them popular in industries where user communication is deemed significant. The speed and convenience of simply selecting the most appropriate response make it suitable for high volume and multitask settings, e.g., when an operator has to chat with multiple customers simultaneously. A diverse set of suggested options presents users with perspectives they might otherwise have not considered. The correct grammar and vocabulary in machine-generated responses enhance communication clarity and helps users avoid confusion over the context of a message. These attributes can be crucial for businesses that rely on the speed and accuracy of the information and, most importantly, users who lack English proficiency. Additionally, smart reply systems mitigate risks associated with messaging while driving and some health concerns such as De Quervain's tenosynovitis syndrome [5] .

Google's smart reply system for Gmail [6] serves as a means of convenience for its users. With an ever-increasing volume of emails exchanged along with the rise in smartphone use, generating responses on-the-go with a single tap of the screen can be very practical. One aspect of the end-user utility discussed in this paper is the diversification of the suggested replies. To maximize usability, Google employs rule-based approaches to ensure that the responses contain diverse sentiments, i.e., covering both positive and negative intents. The paper also suggests using a triggering mechanism to detect whether a reply needs to be generated beforehand to save from unnecessary computations and make the model scalable. Uber also devised a one-clickchat model to address driver safety required for responding to customer texts while driving [7] . Their proposed algorithm detects only the intention of the user message, and, using historical conversations, it suggests the most likely replies. The replies are kept short to reduce the time spent reading, and thereby maximizing safety and utility. Galke et al. [8] analyzed a similar problem of response suggestion where users of a digital library ask librarians for support regarding their search. They used information retrieval methods such as TF-IDF and word centroid distance instead of sequence-to-sequence models, noting that such algorithms are more accurate when the training data is limited.

While the models above are task-oriented and designed to accomplish industry-specific goals, they do not address user engagement issues and the motive to make conversations seem more natural. Microsoft XiaoIce used an empathetic computing module designed to understand users' emotions, intents, and opinions on a topic [9] . It can learn the user's preferences and generate interpersonal responses. Yan [10] proposed social chatbot models that serve the purpose of conversing with humans seamlessly and appropri-ately. Yan et al. [11] devised a conversation system in which there were two tasks involved: response ranking and next utterance suggestion. The response ranking aimed to rank responses from a set given a query, whereas the next utterance suggestion was to proactively suggest new contents for a conversation about which users might not originally intend to talk. They used a novel Dual-LSTM Chain Model based on recurrent neural networks, allowing for simultaneous learning of the two tasks. Similarly, Yan and Zhao [12] designed a coupled context modeling framework for human-computer conversations, where responses are generated after performing relevance ranking using contextual information.

The studies mentioned above discuss the applicability of smart reply and other AI-enabled conversational models in various settings and domains.

Smart reply models that are built specifically for online conversations have to adhere to a distinct criterion. In online conversations, users may adopt words and sentence structures differently. The very intention of a user is often expressed and clarified in multiple chats-turns, and responses do not always immediately follow questions or inquiries. To overcome these challenges, Li et al. [13] extracted common sub-sequences in the chat data by pairwise dialogue comparisons, which allow the generative model to optimize more on common chat flow in the conversation. They then applied a hierarchical encoder to encode input information where the turn-level RNN encodes the sequential word information while the session-level RNN encodes the sequential interaction information.

Another challenge with regards to smart reply models built specifically for online conversational chats is scalability. Large-scale deployment of online smart reply models requires energy and resource efficiency. Kim et al. [14] presented the idea of using sentiment analysis to determine the underlying subject of a message, deciding between character vs. word vs. sentence level tokenization, and whether to limit queries to only nouns without affecting the quality of the model. Jain et al. [15] discussed the idea of using conversational intelligence to reduce both the time and the number of messages exchanged in an online conversation. It includes presenting intelligent suggestions that would engage the user in a meaningful conversation and improve dialogue efficiency. Lastly, Lee et al. [16] proposed using human factors to enable smooth and accurate selection of the suggested replies.

Concerning doctor-patient conversation, there have been several studies in recent years to help doctors with artificial intelligence-based diagnostics and treatment recommendations [17, 18, 19, 20] . Nevertheless, to the best of our knowledge, there is no specific example of a smart response mechanism in the healthcare domain. Related studies focused on language models such as chatbots and not on real-time chat conversations. Oh et al. [21] proposed a chatbot for psychiatric counseling in mental healthcare service that uses emotional intelligence techniques to understand user emotions by incorporating conversational, voice, and video/facial expression data. In another study, Kowatsch et al. [22] analyzed the usage of a text-based healthcare chatbot for the intervention of childhood obesity. Their observations revealed a good attachment bond between the participants and the chatbot.

As more emphasis is being placed on the quality of patient-physician communication [23] , an AI-based communication model can facilitate direct and meaningful conversations. However, it is essential to consider the ethical issues related to AI-enabled care [24, 25] as well as the acceptability of AI-led chatbot services in the healthcare domain [26] . There is hesitancy to use this technology due to accuracy in responses, cyber-security, privacy, and lack of empathy.

Our study differs from the aforementioned works in the literature in various ways. For instance, messages exchanged on the Uber platform are typically short and average between 4-5 words. The average length of messages exchanged on an online medical consultation service tends to be longer, e.g., 10 to 11 words on average, and can be up to 100 words, due to the necessity to clearly describe a certain medical condition. Similarly, the suggested responses created on Gmail are shorter. In terms of the corpus size, Uber and Google's general-purpose datasets might reach millions of instances, whereas a typical training data is substantially smaller (e.g., in tens of thousands), especially for a start-up company or local clinics. This challenge augments the complexity of our model in that it should learn proper responses using a smaller dataset. Galke et al. [8] work with a domain-specific dataset consisting of 14k question-answer pairs and generate responses using retrieval-based methods. On the other hand, their model does not consider diversity and only generates one suggested response for every message. Our model prompts multiple responses with different semantic intents, resulting in better utility for the users. Zhou et al. [9] , Yan [10] and Yan and Zhao [12] have developed models that are successful in making natural and human-like conversations.

However, they are generic and not suitable for a domain-specific task such as ours that should take into account the medical jargon.

In this study, we develop an algorithm for smart response generation in online doctor-patient chats. Our analysis is aimed at addressing the challenge of generating smart responses in the medical domain with a limited and constantly evolving (e.g. due to entry of new patients and diseases) dataset.

We summarize the contributions of our study as follows.

• To the best of our knowledge, this is the first study to propose an autoresponse suggestion in a medical chat service. As conversations include medical jargon, we use medical word embeddings and retrain them on our large conversational corpus.

• Our method involves employing a novel clustering approach to create a canned response set for the doctors.

• Our detailed numerical study shows the effectiveness of the proposed methodology on a medical chat dataset. Moreover, our method demonstrates robustness to its parameters. Accordingly, our study provides an empirical analysis of smart auto-response generation mechanisms.

• The proposed method can be used to generate fast smart responses and can be easily integrated into the chat software.

AI-assisted tools have become increasingly prevalent in the medical domain over the years. As services such as appointment scheduling and doctor consultations move online, there is an increasing need for auto-reply generation methods to increase the overall system efficiency. However, as is the case in any domain-specific application, there are certain challenges in developing smart response mechanisms for online medical chat services. While particular challenges such as scalability and response quality have been addressed in previous works [6, 7] , there has not been much focus on the speed of response generation and disorderly chat flows. We summarize these two issues within the context of a medical chat service as follows.

• Speed: As online chats between doctors and patients follow a rapid pace, the model must generate a response instantaneously (i.e., within a second) to be of practical use. This issue does not persist for systems that generate a reply in an offline setting.

• Disorderly chat flows: In a chat platform, a message may or may not be followed by an immediate response. There are instances where messages are exchanged in various turns with their orders being completely random. A message may be replied to immediately or at a later turn.

This issue does not apply to the works that deal with email exchanges, as most of the emails are a direct response to the previous email, or they have a reply-to option to circumvent this challenge. However, the impact of disorderly chat flows might be magnified in doctor-patient chats as the doctors, who might be overwhelmed by conversations, are typically slower than the patients.

We address both of these challenges through our comprehensive analysis.

According to the challenges and the available data, we consider the belowdescribed steps to construct our suggested response mechanism.

In this study, we use a dataset obtained from Your Doctors Online 1 , which is an online application that connects patients with doctors. The dataset 1 https://yourdoctors.online includes a collection of anonymized doctor-patient chats between October 6, 2019, and July 15, 2020. We extracted 38,135 patient-doctor conversations, consisting of 901,939 messages exchanged between them. Note that the indepth data exploratory analysis, e.g., n-gram analysis, is excluded due to information sensitivity.

Each chat between a patient and a doctor has two characteristics: the number of messages and the number of turns, i.e., the back and forth messages between them. The violin plot in Figure In the next step, we divide the chat into pairs of patient-doctor messages. Each paired message is manually labeled as "feasible" or "infeasible", indicating whether a paired message should trigger a smart reply or not. 

We define the following filtering conditions to maintain high-quality messages.

• We remove any patient/doctor message that is longer than 200 words.

That is, we choose not to trigger any responses for those long messages.

• As we face a chat environment, there is a plethora of idioms, abbreviations, and mispronounced vocabularies. Therefore, we create a dictionary of abbreviations and replace each abbreviated word with its long-form, e.g., "by the way" as a substitute for "BTW" or "do not know" in place of "dunno". Moreover, using the "pyspellchecker" package in Python, we generate a comprehensive dictionary of typos in the medical domain. This misspelling dictionary includes 30,295 words extracted from the chats and is used to clean the dataset. Nevertheless, not all the misspelled words are retrievable. We are unable to suggest the proper replacement for some typos that are not similar to any words in the package's corpus.

• We decide to keep stopwords in the dataset as the final response needs to be grammatically correct. Therefore, we examined our method with and without stopwords and found that keeping them enhances the replies' quality. Similarly, we do not apply lemmatization because it is detrimental to syntactic comprehension.

• Other preprocessing steps include removing extra white spaces, deleting punctuation and URLs, converting all characters to lower case, and, finally, removing non-Unicode characters and line-breaks.

As the response generation task requires labeled data, and considering that pairing patient and doctor messages is a tedious task, we select a portion of the data that captures the most significant characteristic of the desired output. Hence, we divide the work into two folds. First, we explore the similarity between doctors' messages and cluster them, and second, we find the patient pair for each doctor message in only dense, frequent clusters.

After data cleaning, we pinpoint the most frequent responses by doctors.

However, we cannot make it by solely exploring response occurrence since many responses deliver the same message. For instance, "you're welcome", "happy to help", "no problem" and "my pleasure" are different possible answers to the same patient message. Therefore, we create a semantic cluster of the responses and examine the total frequency of responses in each cluster.

In other words, the model should only learn the messages most commonly sent to the patients. Figure 3 demonstrates the steps in the manual labeling process.

As shown in Figure 3 , we convert each textual message to numeric vectors through the weighted average word embedding. TF-IDF value for each word generates its weight, and its word embedding is treated as its value. As there are many medical terms in messages exchanged, we use Wikipedia

PubMed word embedding 2 . This word2vec dataset is induced on a combination of PubMed and PMC corpus, together with the texts extracted from an English Wikipedia dump. Therefore, it is suitable for both medical terms and daily language. We compared the performance of this word embedding with the Glove embedding [27] and found that many vocabularies that the Glove embedding does not support do exist in Wikipedia PubMed embedding. Moreover, the manual exploration of the generated responses points to a better performance of the Wikipedia PubMed embedding (i.e., it is improved by 3.4% in our case). After finding the proper embedding, we use the weighted average word embedding for doctor's messages and apply agglomerative clustering on the responses through the cosine similarity. Next, using average silhouette width, we found the optimal number of clusters as 158. Among them, we chose the clusters whose densities are more than 80%.

Unlike dense clusters that include distinct message types, sparse ones contain a high volume of irrelevant messages with little or no similarity, and hence, we exclude the non-dense clusters.

After obtaining the possible clusters for doctor messages, we pair each doctor message to its related patient message. During the manual labeling process, we encounter some challenges in the chat context. First, not all messages are a response to their previous message. For instance, a doctor may give some information regarding possible drugs without being asked to, or in some cases, a response is too generic or too specific and cannot be considered feasible. In such cases, instead of finding the paired patient message, we mark them as "infeasible". Second, message flow is not always in order. For instance, a patient may ask a question, and then in the following messages,

give some additional information. Then, the doctor starts responding by asking something about the patient's first message. As the dataset does not include the "reply-to" option, we need to manually trace chats back to find a relevant patient message given each doctor's reply. It is a cumbersome task in the labeling process, which does not exist in previous works. Figure 4a and Figure 4b show examples of disorderly and correct flow, respectively. In some cases, a doctor's response may be relevant to something asked much earlier. For instance, in Figure 4c , the question of "how dark is the urine" is related to a message sent earlier. Ultimately, the manual labeling process leads to a set of paired patientdoctor chats and some infeasible cases. In total, we obtain 31,407 paired messages, 23.1% of which are "infeasible".

After finding the appropriate responses and associating them with patient messages, we consider strategies for diversifying the generated responses.

Based on the comprehensive rules that we identified, we generate a set of diverse canned messages. Note that determining such rules for response diversification requires domain expertise and interaction with the stakeholders (e.g., physicians and end-users). Table 1 shows an example of our rule-based response diversification. We diversify the response "You are welcome" based on some predefined rules. For instance, if the patient message implies the end of the conversation, we use "You are welcome. Take care. Bye." instead.

As we consider a platform that suggests the top-3 responses to the doctors, our algorithm can benefit considerably from a more diverse set, including all possible situations. Otherwise, many irrelevant messages might pop-up on the platform, all pointing to the semantically identical response. 

In real-time chat conversations, it is typically not required to generate a response for all the received messages, which is unlike a chat-bot. Therefore, after preprocessing the messages, we define a triggering model that decides whether or not to trigger a reply for a given patient message. Triggering is a binary classification task based on the "feasible"/"infeasible" manual labeling explained in Section 3.1.3. If a patient message passes the triggering model with a prediction probability greater than a predetermined value p, then it enters the smart response generator phase; otherwise, we do not generate a reply for it. Figure 5 illustrates the processes of triggering and response generation.

The reply suggestion phase integrates different models to generate a proper suggested response. Since a typical usage in practice involves recommending top-k responses (e.g., k = 3), our main aim is to propose the most appropriate response within the first k suggestions. 

In the triggering phase, we aim to find the feasibility of the response generation. If a patient's message is too specific, (i.e., not applicable to other people), too generic (e.g., "OK" or "done"), or not seen in the training set (i.e., the chance of irrelevant suggestion is high), then there is no need to trigger any smart response. Furthermore, the triggering model ignores messages that are too complex or lengthy. On the other hand, the system should facilitate a doctor's job since they might be busy with multiple chats. The triggering should pass a message to response generation only if a proper response suggestion is likely. We experiment with different binary classification methods to identify the most suitable model for the triggering phase. Accordingly, the value 0 for the dependent binary variable represents patient messages for which it is not ideal to generate a reply, and the value 1 indicates feasible patient messages. We use the preprocessed patient messages along with their length as the independent variables. The textual feature is converted to numeric values in different ways for each algorithm; therefore, we discuss the data conversion process within each model's explanation.

Unlike the triggering phase that deals with a binary output, the response generation decides on the proper reply only if it passes the first phase. Hence, the diversity of responses is higher, reducing the accuracy of the multi-class classification task. In this phase, we used all possible replies that are manually labeled as the dependent feature. However, we only used patients' messages as the dependent variables. We did not find any significant correlation between the length of a message and the generated response; therefore, we did not include patients' message length as a feature.

Although there are many machine learning (ML) algorithms for text classification, we chose to experiment with those commonly used in different domains. Moreover, in our preliminary analysis, we experimented with other ML methods (e.g., Random Forest and Naive Bayes); however, we did not find those to outperform the methods we summarized below.

XGBoost enhanced with weighted embedding. XGBoost, as a scalable tree boosting system [28] , builds an ensemble of weak trees by incrementally adding new instances that contribute the most to the learning objectives.

To accommodate the distributed text representation in numeric format, we average the embedding of each word per message as proposed by Stein et al. [29] with some modifications. First, as we deal with medical conversations, we use Wikipedia PubMed as our word embedding representation. Second, since simple averaging does not reflect the importance of each word, we use a weighted average where TF-IDF values of the words are the weights. By these slight adaptations, we ensure that unimportant words do not have an impact on the averaged output for a given message [30] . Finally, we append the length of the patient message as a new independent feature. Hence, the text representation along with its length contributes to 201 independent attributes for each patient message. For XGBoost, while we include the message length in the triggering phase, we exclude it in the response generation phase. It is also important to note that we compared this approach with both simple TF-IDF representation [31] and unweighted word embedding average and found it to perform better.

SVM enhanced with weighted embedding. Support Vector Machine (SVM) has been widely used for text categorization and classification in different domains [32, 33, 34] . It identifies support vectors -i.e., data points closer to the hyperplane -to position a hyperplane that maximizes the classifier's margin. SVM learns independently the dimensionality of the feature space, which eliminates the need for feature selection. It typically performs well for text classification tasks with less computational effort and hyperparameter tuning while also being less prone to overfitting [35] . Hence, we consider SVM as a baseline and compare it with other classification approaches.

Bi-directional LSTM enhanced with Wikipedia PubMed embedding. Long shortterm memory (LSTM) units, as the name suggests, capture both the longterm and the short-term information through the input, forget, and output gate. Therefore, it has the ability to forget uncorrelated information while passing the relevant ones [36] . Since the patient messages consist of long sentences, such gates are ideal to have the least information loss. They can detect message contents stored as the long-term memory inside the cell while keeping invaluable information provided towards the end of a sentence. In our algorithm, we use Bi-directional LSTM (BiLSTM) units that learn information from both directions, enabling them to access both the preceding and succeeding contexts [37] . This way, equal weight is provided towards the beginning and the end of a sentence. BiLSTM units are an appropriate remedy for our problem since a patient message may contain useful information either at the beginning of a sentence or at the end. Using the attention mechanism, BiLSTM disregards generic comments and concentrates on more pertinent information [38] .

Seq2Seq enhanced with Wikipedia PubMed embedding. Sequence-to-sequence models turn one sequence into another. It is used primarily in text translation, caption generation, and conversational models. Therefore, we only apply it to the reply suggestion phase as it is not generalizable to the triggering phase. Our Seq2Seq model consists of an encoder, decoder, and an attention layer [39] . The encoder encodes the complete information of a patient message into a context vector, which is then passed on to the decoder to produce an output sequence of a doctor's reply. Since our data consists of long sentences, we use an attention mechanism to assign more weight to relevant parts of the context vector to improve computational efficiency as well as accuracy [40] .

To prepare our data for the Seq2Seq model, we tokenize both doctors' and patients' messages and pad them to match the length of the longest sentence in our data. Start and end tokens are added to each sequence.

Furthermore, we use a pre-trained Wikipedia PubMed embedding layer to capture the text semantics. The encoder additionally uses a Bi-directional LSTM layer for enhanced learning of encoded patient messages. We train it using the Adamax optimizer and sparse categorical cross-entropy to calculate the losses.

We employ beam search [41] to retrieve the predicted outcomes of the model using a beamwidth of three and apply length normalization to avoid biases against lengthier doctor replies. We rank replies according to their beam scores and choose the top-k responses. As the model generates responses word by word, there is a tendency for the model to suggest inappropriate or grammatically incorrect sentences. To overcome this issue, we apply cosine similarity to match the generated responses with our canned response set and select the ones with the highest cosine similarity score. Hence, we ensure that the proposed options will have proper word choice and grammar.

Nevertheless, when the final top-k suggested replies overlap, we iteratively cycle through their cosine scores and pick the next best response until we reach k unique suggestions.

We use 5-fold nested cross-validation and tune the most important parameters of the models by dividing the training dataset into validation and training sets. In each grid search procedure of the hyperparameters, we identify the best models for text classification. In the testing phase, we use the models that perform best on the validation set. We provide the final model configurations as follows.

• • For our Seq2Seq model, we initiate our encoder with the Wikipedia Pubmed embedding layer, followed by a bidirectional LSTM layer of size 1024. Next, we use an LSTM layer for our decoder, along with the Luong Attention Mechanism. We calculate sparse categorical crossentropy loss, and the model is trained for 15 epochs using the Adamax optimizer.

In this section, we first briefly define the performance metrics used to evaluate different algorithms. Then, we report the performance of the smart response suggestion together with sample generated responses.

To investigate the performance of two different phases of the algorithm, we relied on two different sets of metrics. For the triggering phase, we used "accuracy", the ratio of correct predictions to the number of instances, "precision", how many instances predicted as class c belongs to the same class, "recall", how many data points that belong to class c are found correctly, and "F1-score", which is the harmonic mean of precision and recall. These four performance metrics are threshold-dependent (i.e., model predictions constitute a probability distribution over class labels, and binary predictions are determined based on a probability threshold, e.g., 0.5). Therefore, we also utilize the area under the ROC curve (AUC-ROC) as a threshold-independent approach to mitigate the problems with threshold settings [42] .

We employ different metrics to assess the performance of response generation models. We mainly rely on the "precision@k" metric to report the accuracy of the suggestions [6] . If the suggested response is among the top k responses, we call it a correct suggestion; otherwise, it is not a suitable suggestion. We take the number of generated responses as k = 3. Therefore, if the model is adept enough to include the proper reply among the top 3, it will be considered an appropriate suggestion. We also report "precision@1"

and "precision@5" to gain more insights regarding the models' performances.

Another useful metric is the rank of the suggested response. If a model puts forward a reply in rank 4, there is a likelihood that it can be improved further by some parameter tuning. On the other hand, if the response is ranked 20, the model is unlikely to suggest a proper response. Consequently, we report the Mean Reciprocal Rank (MRR) metric, that is,

where N is the total number of messages. MRR ranges from 0 to 1, where 1 indicates the optimal performance (all the suggestions are ranked first).

Accurate triggering is important since an infeasible patient's message passing this filter not only leads to an irrelevant message but also increases the computational complexity. Consequently, an inferior model will reduce the quality of the suggested replies and degrade the performance of the overall response generation mechanism. Table 2 shows the performance of different models for the triggering phase.

All the models outperform the baseline approach, which generates the response based on frequency. The reason for the relatively high accuracy of the frequency-based approach is the imbalanced ratio of feasible and infeasible cases. Therefore, by overestimating the majority group, it still can reach acceptable performance. However, when it comes to the thresholdindependent metric, AUC-ROC, the frequency-based approach performs as poorly as a random guess with almost 50% AUC-ROC. On the other hand, other approaches show significant improvement over the frequency-based suggestion. We also note that BiLSTM provides the best performance when combined with its response suggestion algorithm; however, we do not provide a detailed analysis with combined triggering and response generation algorithm for the sake of brevity.

After a message successfully passes the triggering filter, it enters into the response suggestion model. Response suggestion aims to suggest proper messages within the top responses to facilitate the patient-doctor conversation.

Here, we only concentrate on doctors' response generation processes.

We compare machine learning algorithms with the baseline, frequencybased suggestion. The baseline selects doctor responses from the canned messages based on their occurrence probability in the training set. As our response set includes certain frequent categories, the precision of the baseline might seem relatively high. However, the difference between machine learning algorithms and the baseline is statistically significant. Table 3 accuracy in intent detection of the generated replies. However, our work mainly concentrates on the precision of the suggested responses and not just their intentions. In comparison to these previous studies, we can conclude that our adopted algorithm for response generation in online medical services is successful and surpasses the baseline (i.e., frequency-based response generation) by enhancing its precision more than 2.5 times (see Table 3 ).

We demonstrate the distribution of the actual frequency of the most frequent medical responses (i.e., excluding casual responses such as "You are welcome." and "Thanks.") in Figure 6 . The ground-truth frequency is shown in black, while the predicted frequency is grey. We observe that both the prediction and the ground truth follow a similar distribution. For the casual responses, which are excluded from the graph, the generated frequencies exhibit a similar pattern as the actual ones. ing the suggested messages ranked first, loses its superiority shortly after.

One reason for the performance drop is the beam search associated with the response selection. It does not have the option to diversify the message, and adding rule-based diversification increases its computation time. Therefore, the model fails to generate high-quality responses considering the pace needed to output a reply. All the analyses highlight the advantage of using BiLSTM in automated doctor response recommendations. 

One of the most significant parameters of the algorithm is the triggering threshold. As the triggering model suggests the probability of generating a response, it is important to determine how to convert that probability to a binary decision. As a rule of thumb, we round numbers greater than or equal to 0.5 to 1 and smaller ones to 0. However, the question is whether the threshold of 0.5 provides the best-suggested replies. When the threshold is too small, the model tends to generate responses for most infeasible cases; on the other hand, when it is close to 1, the model becomes more conservative as it avoids generating inappropriate responses. We use the actual conversations between patients and doctors coming from an online medical chat service. Accordingly, after exploratory data analysis, we clean the dataset by devising a canned response set. Using clustering techniques, we find the densest clusters of doctors' messages and extract frequent responses from those. Afterward, we match the patient and doctor messages being aware of the complexity of disorderly exchanged chats, which results in 31,407 paired messages. Not all patient messages require smart replies; therefore, we also label the pairs as "feasible" or "infeasible".

Our algorithm proceeds in two steps: predicting whether we need to trigger a smart reply and suggesting the proper response given a message passes the triggering phase. We explore different combinations of machine learning and deep learning algorithms to address each step. Furthermore, we tune the parameter and report the performance using 5-fold nested cross-validation.

We assess each algorithm's performance using threshold-dependent andindependent metrics and observe that Bidirectional LSTM is the best method for the triggering phase. It has a balanced score for both majority and minority class labels, i.e., feasible and infeasible cases. In addition, its suggested replies are also the most appropriate in the response generation phase. Moreover, we tested its robustness to the triggering threshold and found it to be resilient to its parameter changes.

A relevant venue for future research would be to improve the method by including more data points (i.e., more labeled-conversations). To the best of our knowledge, there is no publicly available dataset for medical conversations. Therefore, we only apply the algorithm to our proprietary dataset.

Besides, in response to the COVID-19 pandemic, our dataset is continuously being updated. Specifically, we find constant changes in patient queries and doctor answers. For instance, with regards to the modeling symptoms' questions, we observe that the vaccine queries become dominant. Accordingly, an automated mechanism to retrain the models according to unprecedented challenges can be developed. We note that the overall response generation mechanism becomes feasible by introducing enough paired messages and updating the model weights. Moreover, as manual labeling is a tedious task, we plan to investigate semi-supervised learning for semantic clustering and labeling big datasets.

Consumers prefer live chat for customer service: stats

Physician supply and demand. a 15-year outlook: Key findings

Survey of physician appointment wait times and medicare and medicaid acceptance rates

The impact of the covid-19 pandemic on outpatient visits: A rebound emerges

Texting thumb

Smart reply: Automated response suggestion for email

Occ: A smart reply system for efficient in-app communications

A case study of closed-domain response suggestion with limited training data

The design and implementation of xiaoice, an empathetic social chatbot

chitty-chitty-chat bot": Deep learning for conversational ai

Joint learning of response ranking and next utterance suggestion in human-computer conversation system

Coupled context modeling for deep chit-chat: towards conversations between human and computer

Enhancing response generation using chat flow identification

A picture is worth a thousand words: Improving mobile messaging with real-time autonomous image suggestion

Evaluating and informing the design of chatbots

Solutionchat: Real-time moderator support for chat-based structured discussion

Artificial intelligence in medicine

High-performance medicine: the convergence of human and artificial intelligence

The practical implementation of artificial intelligence technologies in medicine

Conversational agents in health care: Scoping review and conceptual analysis

A chatbot for psychiatric counseling in mental healthcare service based on emotional dialogue analysis and sentence generation

Text-based healthcare chatbots supporting patient and health professional teams: Preliminary results of a randomized controlled trial on childhood obesity

Measuring the quality of patient-physician communication

The potential for artificial intelligence in healthcare

AI-Mediated Communication: Definition, Research Agenda, and Ethical Considerations

Acceptability of artificial intelligence (ai)-led chatbot services in healthcare: A mixed-methods study

Glove: Global vectors for word representation

Xgboost: A scalable tree boosting system

An analysis of hierarchical text classification using word embeddings

ECNU: Using traditional similarity measurements and word embedding for semantic textual similarity estimation

The text classification of theft crime based on tf-idf and xgboost model

Comparing automated text classification methods

Hybrid feature selection for text classification

Class-indexing-based term weighting for automatic text classification

Text categorization with support vector machines: Learning with many relevant features

Novel efficient rnn and lstm-like architectures: Recurrent and gated broad learning systems and their applications for text classification

Bidirectional lstm with attention mechanism and convolutional layer for text classification

Revisiting lstm networks for semi-supervised text classification via mixed objective function

Sequence to sequence learning with neural networks

Effective approaches to attentionbased neural machine translation

Beam search algorithms for multilabel learning

A unified view of performance metrics: Translating threshold choice into expected classification loss

The authors would like to thank Your Doctors Online for funding and supporting this research. This work was also funded and supported by Mitacs through the Mitacs Accelerate Program. The authors would also like to thank Gagandip Chane for his help with the data labeling.