Natural Language Engineering 1 (1): 1–23. Printed in the United Kingdom c© 2008 Cambridge University Press 1 Designing an Interactive Open-Domain Question Answering System S. Q U A R T E R O N I The University of York, York, YO10 5DD, United Kingdom S. M A N A N D H A R The University of York, York, YO10 5DD, United Kingdom ( Received 20 September 2007) Abstract Interactive question answering (QA), where a dialogue interface enables follow-up and clarification questions, is a recent although long-advocated field of research. We report on the design and implementation of YourQA, our open-domain, interactive QA system. YourQA relies on a Web search engine to obtain answers to both fact-based and complex questions, such as descriptions and definitions. We describe the dialogue moves and management model making YourQA interactive, and discuss the architecture, implementation and evaluation of its chat-based dialogue interface. Our Wizard-of-Oz study and final evaluation results show how the designed architecture can effectively achieve open-domain, interactive QA. 1 Introduction Question answering (QA) systems can be seen as information retrieval systems that aim to respond to queries in natural language by returning concise answers rather than informative documents. State-of-the-art QA systems often compete in the annual TREC-QA evaluation campaigns, where participating systems must find concise answers to a benchmark of test questions within a document collection compiled by NIST (http://trec.nist.gov). A commonly observed behaviour is that users of information retrieval systems often issue queries not as standalone questions but in the context of a wider infor- mation need, for instance when researching a specific topic (e.g. "William Shake- speare"). In this case, efficient ways for entering successive related queries have been advocated to avoid users having to enter contextually independent queries (Hobbs 2002). Efforts have been carried out in recent TREC-QA in order to approach the issue of context management by the introduction of "targets" in the question sets from TREC 2004. Here, questions are grouped according to a common topic, 2 S. Quarteroni and S. Manandhar upon which different queries (that require factoid, list, or "other" answer types) are formulated. Since TREC-QA 2004, queries can contain references (such as pronominal anaphora) to their targets without such targets being explicitly mentioned in the query texts. However, the current TREC requirements only address one aspect of the complex issue of context management: the problem of detecting that one query is related to a topic introduced by a previous one is artificially solved by the presence of an explicit target, which would not be specified in a real interaction context. It has been argued that providing a Question Answering system with a dialogue interface would encourage and accommodate the submission of multiple related questions and handle the user’s requests for clarification: the 2006 Interactive QA workshop aimed to set a roadmap for information-seeking dialogue applications of Question Answering (Webb and Strzalkowski 2006). Indeed, Interactive QA systems are often reported at an early stage, such as Wizard-of-Oz studies, or applied to closed domains (Bertomeu et al. 2006; Jönsson and Merkel 2003; Kato et al. 2006). In this paper, we report on the design, implementation and evaluation of the dia- logue interface for our open-domain, personalized QA system, YourQA (Quarteroni and Manandhar 2007). The core QA component in YourQA is organized according to the three-tier partition underlying most state-of-the-art QA systems (Kwok et al. 2001): question processing, document retrieval and answer extraction. An addi- tional component in YourQA is the User Modelling (UM) component, introduced to overcome the traditional inability of standard QA systems to accommodate the users’ individual needs (Voorhees 2003). This article is structured as follows: Sections 2 – 3 focus on the two main com- ponents of the system, i.e. a User Modelling component to provide personalized answers and the core QA module which is able to provide both factoid and com- plex answers. Sections 4 – 7 discuss a dialogue model and dialogue manager suitable for interactive QA and Section 8 describes an exploratory study conducted to con- firm our design assumptions. The implementation and evaluation of the dialogue model are reported in Sections 9 – 10. Section 11 briefly concludes on our experience with open-domain QA dialogue. 2 User Modelling Component A distinguishing feature of our model of QA is the presence of a User Modelling component. User Modelling consists in creating a model of some of the target users’ characteristics (e.g. preferences or level of expertise in a subject), and is commonly deployed in information retrieval applications to adapt the presentation of results to the user characteristics (Teevan et al. 2005). It seemed natural to adapt User Modelling within QA, with the purpose of fil- tering the documents from which to search for answers and for reranking candidate answers based on the degree of match with the user’s profile. Since the current application scenario of YourQA is a system to help students find information on the Web, we designed the following User Model (UM) parameters: Designing an Interactive QA system 3 • Age range, a ∈ {7-11, 11-16, adult}; this matches the partition between pri- mary school, secondary school and higher education age in Britain; • Reading level, r ∈ {basic, medium, good}; its values ideally (but not neces- sarily) correspond to the three age ranges and may be further refined; • Interests, i: a set of topic key-phrases extracted from webpages, bookmarks and text documents of interest to the user. A detailed account on how the UM parameters are applied during answer pre- sentation has been reported in (Quarteroni and Manandhar 2006; Quarteroni and Manandhar 2007). As the focus of this paper is on the dialogue management com- ponent of YourQA, the contribution of the UM to the core QA component is only briefly mentioned in this paper. Within this paper, we assume an adult user able to read any document and an empty set of interests, hence no UM-based answer filtering or re-ranking is performed in the experiments reported in this paper. 3 Core Question Answering Component The core QA component, illustrated in Figure 1, carries on three Question Answer- ing phases: question processing, document retrieval and answer extraction. Question Question Classification Web Retrieval Question Processing Document Retrieval Answer Extraction Web Documents Answers Factoid Answer Extraction Non-Factoid Answer Extraction Factoid Question? Document Processing Fig. 1. Core QA architecture: question processing, retrieval, answer extraction 3.1 Question Processing and Document Retrieval Question processing is centered on question classification (QC), the task that maps a question into one of k expected answer classes in order to constrain the search space of possible answers and contribute towards selecting specific answer extraction strategies for each answer class. Answer classes generally belong to two types: factoid ones – seeking short fact- based answers (e.g. names, dates), and non-factoid, seeking descriptions or defi- nitions. An ad hoc question taxonomy has been constructed for YourQA with a particular attention to questions that require non-factoid answers, such as lists, de- scriptions and explanations. To compile it, we studied the questions in the TREC-8 4 S. Quarteroni and S. Manandhar to TREC-12 testsets1. Based on these, we designed a coarse-grained question tax- onomy, which consists of the eleven question types described in Table 1. While the six classes in Column 1 can be considered of the factoid type, the five in Column 2 are non-factoid; depending on such type, the answer extraction process is different, as described in Section 3.2. Table 1. YourQA’s eleven class expected answer taxonomy Question class Expected answer Question class Expected answer PERS human LIST list of items LOC geographical expression DEF definition, description ORG collective, group HOW procedure, manner QTY numerical expression WHY cause TIME temporal expression WHY-F salient facts OBJ generic entity (e.g. “famous for . . . ”) Most QC systems apply supervised machine learning, e.g. Support Vector Ma- chines (SVMs) (Zhang and Lee 2003) or the SNoW model (Li and Roth 2005), where questions are represented using lexical, syntactic and semantic features. (Moschitti et al. 2007) extensively studied a QC model based on SVMs: the learning algorithm combined tree kernel functions to compute the number of com- mon subtrees between two syntactic parse trees. As benchmark data, the question training and test set available at: l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/, were used, where the test set are the TREC 2001 test questions. Based on such experiments, which reached a state of the art accuracy (i.e. 86.1% in 10-fold cross-validation) using the the question’s bag-of-words and parse tree, we applied the same features to learn multiclassifiers for the 11-class YourQA taxon- omy. The overall SVM accuracy using the dataset of 3204 TREC 8 – TREC 12 test questions and obtained using five-fold cross-validation was 82.9%. We also tested the SNoW algorithm for YourQA, following (Li and Roth 2005). We found the most effective question features to be: 1) bag-of-words, bigrams and trigrams; 2) bag-of-Named Entities2; 3) Part-Of-Speech unigrams, bigrams and trigrams. In the YourQA task, we achieved an accuracy of 79.3%3. The second phase carried out by the core QA module is document retrieval, where relevant documents to the query are obtained via an information retrieval engine, 1 publicly available at http://trec.nist.gov 2 extracted using Lingpipe, http://www.alias-i.com/lingpipe/ 3 The lower result when compared to SVMs is confirmed by the experiment with the same features and the previous task, where the accuracy reached 84.1%. Designing an Interactive QA system 5 then downloaded and analyzed. YourQA uses Google (http://www.google.com) to retrieve the top 20 Web documents for the query. 3.2 Answer Extraction Answer extraction takes as input the expected answer type, estimated during ques- tion classification, and the set of documents retrieved for the question during doc- ument retrieval. In this process, the similarity between the question and the docu- ment passages is computed in order to return the best passages in a ranked list. Each remaining retrieved document D is then split into sentences, which are compared one by one to the question; the most similar sentence to the question is selected as the most likely sentence from document D to answer the question. For factoid answers – i.e. PERS, ORG, LOC, QTY, TIME, MONEY – the re- quired factoid can be pinpointed down to the phrase/word level in each candidate answer sentence. For non-factoids, other criteria are adopted to compute the simi- larity between the original question and each candidate answer, as explained below. In both cases, the bag-of-word similarity bw(q,a) is computed between the ques- tion q and a candidate answer a. This is the number of matches between the key- words in q and a divided by |q|, the number of keywords in q: bw(q,a) = ∑|q| i<|q|,j<|a| match(qi,aj ) |q| , where match(qi,aj ) = 1 if qi = aj, 0 otherwise. 3.2.1 Factoid answers Our primary focus is on non-factoid QA and the criteria we apply for factoids are simple. We distinguish between two cases: a) the expected type is a person (PERS), organization (ORG) or location (LOC), which correspond to the types of Named Entities (NEs) recognized by the NE recognizer (Lingpipe in our case); b) the ex- pected answer type is QTY, TIME, MONEY. PERS, ORG, LOC – In this case, NE recognition is performed on each can- didate answer a. If a phrase p labelled with the required NE class is found, wpk, i.e. the minimal distance d (in terms of number of words) between p and each of the question keywords k found in a, is computed: wpk = mink∈ad(p,k). In turn, ne(a) = minp∈aw p k, i.e. the minimal w p k among all the NE phrases of the required class found in a, is used as a secondary ranking criterion for a (after the bag-of- words criterion). QTY, TIME, MONEY – In this case, class-specific rules are applied to find fac- toids of the required class in each candidate answer a. These are manually written based on regular expressions and the candidate answer’s POS tags (e.g. the ordinal number tag). The presence of a substring of a matching such rules is the second similarity criterion between the question q and a after the bag-of-words criterion. 3.2.2 Non-factoid answers We assign to the non-factoid group the WHY, HOW, WHY-F, DEF and LIST types, as well as the OBJ type which is too generic to be seized using a factoid 6 S. Quarteroni and S. Manandhar answer approach. In these cases, we aim at more sophisticated sentence similarity metrics than the ones applied for factoids. We compute a number of normalized similarity measures each measuring the degree of match between sentences and the question. The final similarity is a weighted sum of all such measures. Beyond the bag-of-word similarity, we compute the metrics below. Bigram similarity – N-gram similarity is a function of the number of common keyword n-grams between q and a: ng(q,a) = commonN(q,a)|ngrams(q)| , where commonN is the number of shared n-grams between q and a and ngrams(q) is the set of question n-grams. We adopt bigrams (n = 2) as Web data is very noisy and allows for differ- ent formulations using the same words, making it unlikely that matches of longer keyword sequences be found. Chunk similarity – Sentence chunks can be defined as groups of consecutive, se- mantically connected words in one sentence, which can be obtained using a shallow parser (in our case, the OpenNLP chunker). Compared to bigrams, chunks encode a more semantic type of information. Chunk similarity, ck(q,a), is defined as the number of common chunks between q and a divided by the total number of chunks in q: ck(q,a) = commonC(q,a)|chunks(q)| , where commonC is the number of shared chunks between q and a and chunks(q) is the set of question chunks. Head NP-VP-PP similarity – The idea behind this metric is to find matching groups consisting of a noun phrase (NP), verb phrase (VP) and prepositional phrase (PP) chunk in q and a. Head NP-VP-PP similarity is defined as: hd(q,a) = µ×HNPmatch(q,a) + ν ×V Pmatch(q,a) + ξ ×PPmatch(q,a). For generalization, VPs are lemmatized and the semantically most important word in the NP (called “head NP”) is used instead of the NP. In case q contains several VPs, we choose the VP for which hd(q,a) is maximal. Based on empirical observa- tion of YourQA’s results, we are currently using µ = ν = .4, ξ = .2. WordNet similarity – This semantic metric is based on the WordNet lex- ical database (http://wordnet.princeton.edu) and the Jiang-Conrath word- level distance (Jiang and Conrath 1997). WordNet similarity is: wn(q,a) = 1 −∑ i<|q|,j<|a| jc(qi,aj ) |q| , jc(qi,aj ) being the J.-C. distance between qi and aj. Combined similarity –The similarity formula combining the above metrics is: sim(q,a) = α× bw(q,a) + β ×ng(q,a) + γ × ck(q,a) + δ ×hd(q,a) + �×wn(q,a). For efficiency reasons, we do not compute wn(q,a) at the moment. We have esti- mated α = .6,β = .2,γ = δ = .1 as suitable coefficients. The bag-of-word criterion has a higher impact than metrics which rely on word structures (i.e. bigrams or chunks) because of the noisy Web data we are processing. (Moschitti et al. 2007) took the above criteria for non-factoid QA as a baseline and applied various combinations of features to learn SVM answer re-rankers. The experiments on 1309 YourQA answers to the TREC 2001 non-factoid questions, showed that the baseline MRR of 56.21±3.18 was greatly improved by adding a combination of lexical, deep syntactic and shallow semantic features, reaching 81.12±2.12. Designing an Interactive QA system 7 3.3 Answer presentation From the preceding steps, YourQA obtains a list of answer sentences ranked by decreasing distance to the query. Windows of up to 5 sentences centered around these sentences are then produced to be returned as answer passages. To present answers, we fix a threshold th for the maximal number of passages to be returned (currently th=5); these are ordered following the ranking exposed above. In case of a tie between two candidate answers, the Google ranks of their respective documents are compared and the answer with the highest Google rank index obtains a higher position in the list. The answer passages are listed in an HTML page where each list item consists of a document title and result passage obtained as described above. In the passages, the sentence which best answers the query according to the similarity metric described above is highlighted. In case the expected answer is a factoid, the recognized factoids are highlighted in different colors based on their type. A link to the URL of the original document is also available if the user wants to read more (see Figure 2). Fig. 2. YourQA: sample result (from http://www.cs.york.ac.uk/aig/aqua/). Section 4 discusses the issues and design of a dialogue interface for YourQA to achieve interactive QA. 4 Modelling Interactive Question Answering Interactive QA dialogue can be considered as a form of information-seeking dialogue where two roles are modelled: inquirer (the user), looking for information on a given topic, and expert (the system), interpreting the inquirer’s needs and providing the required information. We agree with (Dahlbaeck et al. 1993) that attempting to perfectly emulate hu- man dialogue using a machine is an unrealistic and perhaps unimportant goal. On the other hand, we believe that an understanding of human dialogues can greatly fa- cilitate building human-machine information-seeking dialogue systems. Hence, the design of task-oriented dialogue systems cannot happen without an accurate anal- ysis of the conversational phenomena observed in human-human dialogue. 4.1 Salient Features of Human Information-seeking Dialogue For the purpose of describing information-seeking dialogue, we focussed on the following aspects: 8 S. Quarteroni and S. Manandhar • Overall structure: as observed by (Sinclair and Coulthard 1975), human di- alogues usually have an opening, a body and a closing. Based on actual hu- man conversations, the authors elaborate a hierarchical discourse grammar representing dialogue as a set of transactions, composed by exchanges, in turn made of moves, whose elementary components are speech acts. In this framework, which has dominated computational approaches to dialogue to the present day, utterances are therefore considered as dialogue acts as they aim at achieving an effect (obtaining information, planning a trip, etc.). • Mixed initiative: initiative refers to who is taking control of the interaction. When one of the interlocutors is a computer system, the literature typically distinguishes between mixed-, user-, and system-initiative (Kitano and Ess- Dykema 1991). In mixed-initiative dialogue, the system must be able to take control in order to confirm given information, clarify the situation, or con- strain user responses. The user may take the initiative for most of the dia- logue, for instance by introducing information that has not been specifically asked or by changing the subject and therefore the focus of the conversation, as it often happens in human interaction (Hearst et al. 1999). • Over-informativeness: dialogue participants often contribute more informa- tion than required by their interlocutors (Churcher et al. 1997). This usually enables dialogue to be more pleasant and time-efficient as the latter do not need to explicitly ask for all the desired information. • Contextual interpretation: human interaction relies on the conversation partic- ipants sharing common notion of context and topic (Grosz and Sidner 1986). Such common context is used by participants to issue and correctly interpret rhetorical phenomena such as ellipsis, anaphora and more complex phenom- ena such as reprise and sluice (see Section 4.2). • Grounding: it has been observed that to prevent or recover from possible misunderstandings, speakers engage in a collaborative, coordinated series of exchanges, instantiating new mutual beliefs and making contributions to the common ground of a conversation. This process is known as grounding (Cahn and Brennan 1999). Section 4.2 underlines the fundamental issues implied by accounting for such phenomena when modelling information-seeking human-computer dialogue. 4.2 Issues in Modelling Information-Seeking Dialogue Based on the observed features of human information-seeking dialogue, we summa- rize the main issues in modelling task-oriented human-computer dialogue, with an eye on the relevance of such issues to Interactive Question Answering. Ellipsis Ellipsis is an omission of part of the sentence, resulting in a sentence with no verbal phrase. Consider the exchange: User: “When was Shakespeare born?” , System: “In 1564.”, User: “Where?”. The interpretation and resolution of ellipsis requires an efficient modelling of the conversational context to complete the infor- mation missing from the text. Designing an Interactive QA system 9 Anaphoric references An anaphora is a linguistic form whose full meaning can only be recovered by reference to the context; the entity to which anaphora refers is called the antecedent. The following exchange contains an example of anaphoric reference: User: “When was Shakespeare born?”, System: “In 1564.”, User: “Whom did he marry?”, where "he" is the anaphora and "Shakespeare" is the antecedent. A common form of anaphora is third person pronoun/adjective anaphora, where pro- nouns such as “he/she/it/they” or possessive adjectives such as “his/her/its/their” are used in place of the entities they refer to: the latter can be single or com- pound nouns (such as William Shakespeare), or even phrases ("The Taming of the Shrew"). Solving an anaphora, i.e. finding its most likely referent, is a critical prob- lem in QA as it directly affects the creation of a meaningful query. However, in information-seeking dialogue, resolution is simpler than in tasks such as document summarization (Steinberger et al. 2005) as the exchanged utterances are generally brief and contain fewer cases of anaphora. Grounding and Clarification While in formal theories of dialogue complete and flawless understanding between speakers is assumed, there exists a practical need for grounding (Cahn and Brennan 1999). A typical Question Answering scenario where requests for confirmation should be modelled is upon resolution of anaphora: User: “When did Bill Clinton meet Yasser Arafat in Camp David?”, “ System: In 2000.”, “User: How old was he?”. The user’s question contains two named enti- ties of type “person”: hence, he can yield two candidate referents, i.e. Bill Clinton and Yasser Arafat. Having resolved the anaphoric reference, the system should de- cide whether to continue the interaction by tacitly assuming that the user agrees with the replacement it has opted for (possibly “he = Bill Clinton”) or to issue a grounding utterance (“Do you mean how old was Bill Clinton?”) as a confirmation. Turn-taking According to conversation analysis, the nature by which a conversation is done in and through turns or pairs of utterances, often called adjacency pairs (Schegloff and Sacks 1973). Our dialogue management system encodes adjacency pairs, where participants speak in turns so that dialogue can be modelled as a sequence of 〈request, response〉 pairs. In natural dialogue, there is very little overlap between when one participant speaks and when the other does, resulting in a fluid discourse. To ensure such fluidity, the computer’s turn and the human’s turn must be clearly determined in a dialogue system. While this is an important issue in spoken dialogue, where a synthesizer must output a reply to the user’s utterance, it does not appear to be very relevant to textual dialogue, where system replies are instantaneous and system/user overlap is virtually impossible. 4.3 Summary of Desiderata for Interactive Question Answering Based on the phenomena and issues observed in Section 4.2, we summarize the desiderata for Interactive Question Answering in the following list: • context maintenance: maintaining the conversation context and topic to allow 10 S. Quarteroni and S. Manandhar the correct interpretation of the user’s utterances (in particular of follow-up questions requests for clarification); • utterance understanding in the context of the previous dialogue; this includes follow-up/clarification detection and the solution of issues like ellipsis and anaphoric expressions; • mixed initiative: users should be able to take the initiative during the conver- sation, for example to issue clarification requests and to quit the conversation when they desire to do so; • follow-up proposal: an IQA system should be able to encourage the user to provide feedback about satisfaction with the answers received and also to keep the conversation with the user active until he/she has fulfilled their information needs; • natural interaction: wide coverage of the user utterances to enable smooth conversation; generation of a wide range of utterances to encourage users to keep the conversation active. 5 A Dialogue Model for Interactive Question Answering Several theories of discourse structure exist in the literature and have led to dif- ferent models of dialogue. Among these, a widely used representation of dialogue consists in the speech act theory, introduced by (Austin 1962), which focuses on the communicative actions (or speech acts) performed when a participant speaks. Based on speech act theory, several annotation schemes of speech acts – also called dialogue moves have been developed for task-oriented dialogues. While the level of granularity of such schemes as well as the range of moves of most of the schemes were determined by the application of the dialogue system, as pointed out in (Larsson 1998) there are a number of generic common dialogue moves, which include: • Core speech acts (e.g. TRAINS (Traum 1996)) such as “inform”/“request”; • Conventional (e.g. DAMSL (Core and Allen 1997)) or discourse management (e.g. LINLIN (Dahlbaeck and Jonsson 1998)) moves: opening, continuation, closing, apologizing; • Feedback (e.g. VERBMOBIL (Alexandersson et al. 1997)) or grounding (e.g. TRAINS) moves: to elicit and provide feedback; • Turn-taking moves (e.g. TRAINS), relating to sub-utterance level (e.g. “take- turn”, “release-turn”). Taking into account such general observations, we developed the set of user and system dialogue moves given in Table 2. In our annotation, the core speech acts are represented by the ask and answer moves. Amongst discourse management moves, we find greet, quit in both the user and system moves, and followup poposal from the system. The user feedback move is usrReqClarif, mirrored by the system’s sysReqClarif move. A common feedback move to both user and system is ack, while the ground and clarify moves are only in the system’s range. We do not annotate the scenario above with turn-taking moves as these are at a sub-utterance level. The above moves are used in the following dialogue management algorithm: Designing an Interactive QA system 11 Table 2. User and System dialogue moves User move Description System move Description greet conversation opening greet conversation opening ack acknowledge system ack acknowledge user ask(q) ask (q=question) answer(a) answer (a=answer) usrReqClarif clarification request sysReqClarif clarification request quit conversation closing quit conversation closing followup proposal to continue clarify(q) clarify (q=question) ground(q) ground (q=question) 1. An initial greeting (greet move), or a direct question q from the user (ask(q) move); 2. q is analyzed to detect whether it is related to previous questions (clarify(q) move) or not; 3. (a) If q is unrelated to the preceding questions, it is submitted to the QA component; (b) If q is related to the preceding questions (i.e. follow-up question), and is elliptic (e.g. “Why?”), the system uses the previous questions to complete q with the missing keywords and submits a revised question q’ to the QA component (notice that no dialogue move occurs here as the system does not produce any utterance); (c) If q is a follow-up question and is anaphoric, i.e. contains references to enti- ties in the previous questions, the system tries to create a revised question q” where such references are replaced by their corresponding entities, then checks whether the user actually means q” (move ground(q”)); If the user agrees, query q” is issued to the QA component. Otherwise, the system asks the user to reformulate his/her utterance (move sysReqClarif ) until finding a question which can be submitted to the QA component; 4. Once the QA component results are available, an answer a is provided (an- swer(a) move); 5. The system enquires whether the user is interested in a follow-up session; if this is the case, the user can enter a query (ask move) again. Else, the system acknowledges (ack); 6. Whenever the user wants to terminate the interaction, a final greeting is exchanged (quit move). At any time the user can issue a request for clarification (usrReqClarif ) in case the system’s utterance is not understood. We now discuss the choice of a dialogue management model to implement such moves. 12 S. Quarteroni and S. Manandhar 6 Previous Work on Dialogue Management Broadly speaking, dialogue management models fall into two categories: pattern- based approaches or plan-based approaches (Cohen1996; Xu et al. 2002). The fol- lowing sections provide a brief critical overview of these, underlining their issues and advantages when addressing interactive QA. 6.1 Pattern Based Approaches: Grammars and Finite-State When designing information-seeking dialogue managers, Finite-State (FS) ap- proaches provide the simplest methods for implementing dialogue management. Here, the dialogue manager is represented as a Finite-State machine, where each state models a separate phase of the conversation, and each dialogue move encodes a transition to a subsequent state (Sutton 1998); hence, from the perspective of a state machine, speech acts become state transition labels. When state machines are used, the system first recognizes the user’s speech act from the utterance, makes the appropriate transition, and then chooses one of the outgoing arcs to determine the appropriate response to supply. The advantage of state-transition graphs is mainly that users respond in a pre- dictable way, as the system has the initiative for most of the time. However, an issue with FS models is that they allow very limited freedom in the range of user utterances. Since each dialogue move must be pre-encoded in the models, there is a scalability issue when addressing open domain dialogue. Moreover, the model typically assumes that only one state results from a tran- sition; however, in some cases utterances are multifunctional, e.g. both a rejection and an assertion, and a speaker may expect the response to address more than one interpretation. 6.2 Information State and Plan Based Approaches Plan-based theories of communicative action and dialogue (Traum 1996) assume that the speaker’s speech acts are part of a plan, and the listener’s job is to uncover and respond appropriately to the underlying plan, rather than just to the utterance. Within plan-based approaches, one approach to dialogue management is the In- formation State (IS) approach (Larsson and Traum 2000). Here the conversation is centered on the notion of information state (IS), which comprises the topics under discussion and common ground in the conversation and is continually queried and updated by rules triggered by participants’ dialogue moves. The IS theory has been applied to a range of closed-domain dialogue systems, such as travel information and route planning (Bos et al. 2003). 6.3 Discussion Although it provides a powerful formalism, the IS infrastructure was too complex for our Interactive QA application. We believe that the IS approach is primarily Designing an Interactive QA system 13 suited to applications requiring a planning component such as in closed-domain dialogue systems and to a lesser extent in an open-domain QA dialogue system. Also, as pointed out in (Allen et al. 2000), there are a number of problems in using plan-based approaches in actual systems, including knowledge representation and engineering, computational complexity and noisy input. Moreoever, the In- teractive QA task is an information-seeking one where transactions are generally well-structured and not too complex to detect (see also (Jönsson 1993)). Hence, this shortcoming of pattern-based dialogue models does not appear to greatly impact on the type of dialogue we are addressing. The ideal dialogue management module for Interactive QA seems to lie some- where in between the FS and IS models. This is what we propose below. 7 Chatbot-based Interactive Question Answering As an alternative to the FS and IS models, we studied conversational agents based on AIML (Artificial Intelligence Markup Language). AIML was designed for the creation of conversational robots (“chatbots”) such as ALICE4. It is based on pattern matching, which consists in matching the last user utterance against a range of dialogue patterns known to the system and producing a coherent answer following a range of “template” responses associated to such patterns. Pattern/template pairs form “categories”, an example of which is the following greeting category: WHO ARE YOU Designed for chatting, chatbot dialogue appears more natural than in FS and IS systems. Moreover, since chatbots support a limited notion of context, they offer the means to handle follow-up recognition and other dialogue phenomena not easily covered using standard FS models. Chatbot dialogue seems particularly well suited to handle the the dialogue phe- nomena introduced in Section 4.1; in particular, the way in which such phenomena can be handled by a chatbot dialogue management model is discussed in detail below: • mixed initiative: as mentioned earlier, the system must be able to take control in order to confirm given information, clarify the situation, or constrain user responses. In the designed dialogue move set, the ground move is used to con- firm that the system has correctly interpreted elliptic or anaphoric requests, while the sysReqClarif move is used to verify that the current user’s utterance is an information request in ambiguous cases (see Section 9). The patterns used by the system are oriented towards QA conversation so that the user is encouraged to formulate information requests rather than engage in smalltalk. For instance, the pattern: HELLO * triggers: . On the other hand, the user may take the initiative for most of the dialogue, 4 http://www.alicebot.org/ 14 S. Quarteroni and S. Manandhar for instance by ignoring the system’s requests for feedback and directly for- mulating a follow-up question (e.g. User: “What is a thermometer?”, System: “The answers are . . . Are you happy with these answers?”, User: “How does it measure the temperature?”), triggering a new ask/answer adjacency pair with a new conversation focus. Moreover, the user can formulate a request for clarification at any time during the interaction. • over-informativeness: Providing more information than required is useful both from the system’s viewpoint and from the user’s viewpoint: this usually en- ables dialogue to be more pleasant as there is no need to ask for all desired pieces of information. In the current approach, the user can respond to the system by providing more than a simple acknowledgement. For instance, the following exchange is possible: User: “How does it measure the temperature?”, System: “Do you mean how does a thermometer measure the temperature?”, User: “No, how does a candy thermometer measure the temperature?”. • contextual interpretation: Contextual interpretation of the user’s utterances is handled by a clarification resolution module designed to take care of ellipsis and anaphoric references, as described in Section 5. • error recovery: The management of misunderstandings is possible due to the usrReqClarif and sysReqClarif moves. The sysReqClarif move is fired when the current user utterance is not recognized as a question according to the set of question patterns known to the system. For example, the pattern: I NEED * (e.g. “I need information about Shakespeare”) would trigger the template: . If the user confirms that his/her utterance is a question, the system will pro- ceed to clarify it and answer it; otherwise, it will acknowledge the utterance. Symmetrically, the user can enter a request for clarification of the system’s latest utterance at any time should he/she find the latter unclear. It must be pointed out that chatbots have rarely been used for task-oriented dialogue in the literature. An example is Ritel (Galibert et al. 2005), a spoken chat-based dialogue system integrated with an open-domain QA system. However, the project seems at an early stage and no thorough description is available about its dialogue management model. 8 A Wizard-of-Oz Experiment for the Dialogue Component To assess the feasibility of chatbot-based QA dialogue, we conducted an exploratory Wizard of Oz experiment Wizard-of-Oz (WOz) experiment, a procedure usually deployed for natural language systems to obtain initial data when a full-fledged prototype is not yet available (Dahlbaeck et al. 1993; Bertomeu et al. 2006). A human operator (or “Wizard”) emulates the behavior of the computer system by carrying on a conversation with the user; the latter believes to be interacting with a fully automated prototype. Designing an Interactive QA system 15 Design– We designed six tasks, to be issued in pairs to six or more subjects so that each would be performed by at least two different users. The tasks reflected the intended typical usage of the system, e.g. : “Find out who painted Guernica and ask the system for more information about the artist”, “Find out when Jane Austen was born”, “Ask about the price of the iPod Shuffle and then about the PowerBook G4”. Users were invited to test the supposedly completed prototype by interacting with an instant messaging platform, which they were told to be the system interface. Since our hypothesis was that a conversational agent is sufficient to handle ques- tion answering, a set of AIML categories was created to represent the range of utterances and conversational situations handled by a chatbot. The role of the Wizard was to choose the appropriate category and utterance within the available set, and type it into the chat interface; if none of these appeared appropriate to handle the situation at hand, he would create one to keep the conversation alive. The Wizard would ask if the user had any follow-up questions after each answer (e.g. “Can I help you further?”). To collect user feedback, we used two sources: chat logs and a post-hoc ques- tionnaire. Chat logs provide objective information such as the average duration of the dialogues, the situations that fell above the assumed requirements of the chat bot interface, how frequent were the requests for repetition, etc. The questionnaire, submitted to the user immediately after the WOz experiment, enquires about the user’s experience. Inspired by the WOz experiment in (Munteanu and Boldea 2000), it consists of the questions numbered Q1 to Q6 in Table 3. Questions Q1 and Q2 assess the performance of the system and were ranked on a scale from 1= “Not at all” to 5=“Yes, Absolutely”. Questions Q3 and Q4 focus on interaction difficulties, especially relating to the system’s requests to reformulate the user’s question. Ques- tions Q5 and Q6 relate to the overall satisfaction of the user. The questionnaire also contained a text area for optional comments. 8.1 Results The WOz experiment was run over one week and involved one Wizard and seven users. These were three women and four men of different ages who came from different backgrounds and occupations and were regular users of search engines. The users interacted with the Wizard via a popular, free chat interface which all of them had used before. All but one believed that the actual system’s output was plugged into the interface. The average dialogue duration was 11 minutes, with a maximum of 15 (2 cases) and a minimum of 5 (1 case). From the chat logs, we observed that users preferred not to “play” with the sys- tem’s chat abilities but rather to issue information-seeking questions. Users often asked two things at the same time (e.g. “Who was Jane Austen and when was she born?”): to account for this in the final prototype, we decided to handle double questions, as described in Section 9. The sysReqClarif dialogue move proved very useful, with “system” clarification re- quests such as “Can you please reformulate your question?”. Users seemed to enjoy 16 S. Quarteroni and S. Manandhar “testing” the system and accepted the invitation to produce a followup question (“Can I help you further?”) around 50% of the time. The values obtained for the user satisfaction questionnaire show that users were generally satisfied with the system’s performances (see Table 3, column WOz). None of them had difficulties in reformulating their questions when this was requested and for the remaining questions, satisfaction levels were high. Users seemed to receive system grounding and clarification requests well, e.g. “ ... on references to “him/it”, pretty natural clarifying questions were asked.” Question WOz Init. Stand. Inter. Q1 Did you get all the information you wanted using the system? 4.3±.5 3.8±.8 4.1±1 4.3±.7 Q2 Do you think the system understood what you asked? 4 3.8±.4 3.4±1.3 3.8±1.1 Q3 How easy was it to obtain the infor- mation you wanted? 4±.8 3.7±.8 3.9±1.1 3.7±1 Q4 Was it easy to reformulate your questions when you were invited to? 3.8±.5 3.8±.8 N/A 3.9±.6 Q5 Overall, are you satisfied with the system? 4.5±.5 4.3±.5 3.7±1.2 3.8±1.2 Q6 Do you think you would use this sys- tem again? 4.1±.6 4±.9 3.3±1.6 3.1±1.4 Q7 Was the pace of interaction with the system appropriate? N/A 3.5±.5 3.2±1.2 3.3±1.2 Q8 How often was the system slow in re- plying? (1= “always” to 5= “never”) N/A 2.3±1.2 2.7±1.1 2.5±1.2 Table 3. Questionnaire results for the Wizard-of-Oz experiment (WOz), the ini- tial experiment (Init.) and the final experiment (standard and interactive version). Result format: average ± standard deviation 9 Resulting Dialogue Component Architecture The dialogue manager and interface were implemented based on the scenario in Section 4 and the outcome of the Wizard-of-Oz experiment. 9.1 Dialogue Manager Chatbot dialogue follows a pattern-matching approach, and is therefore not con- strained by a notion of “state”. When a user utterance is issued, the chatbot’s Designing an Interactive QA system 17 strategy is to look for a pattern matching it and fire the corresponding template response. Our main focus of attention in terms of dialogue manager design was therefore directed to the dialogue tasks invoking external resources, such as han- dling double and follow-up questions, and tasks involving the QA component. Handling double questions As soon as the dialogue manager identifies a user utter- ance as a question (using the question recognition categories), it tests whether it is a double question. Since the core QA component in YourQA is not able to handle multiple questions, these need to be broken into simple questions. For this, the system uses the OpenNLP chunker5 to look for the presence of “and” which does not occur within a noun phrase. For instance, while in the sentence: “When was Barnes and Noble founded?” the full noun phrase Barnes and Noble is recognized as a chunk, in: “When and where was Jane Austen born?” the conjunction “and” forms a standalone chunk. If a standalone “and” is found, the system splits the double question in order to obtain the single questions composing it, then proposes to the user to begin answering the on containing more words (as this is more likely to be fully specified). Handling follow-up questions In handling QA dialogue, it is vital to apply an effec- tive algorithm for the recognition of follow-up requests (De Boni and Manandhar 2005; Yang et al. 2006). Hence, the following task accomplished by the DM is the detection of follow-up questions. The types of follow-up questions which the system is able to handle are el- liptic questions, questions containing third person pronoun/possessive adjective anaphora, or questions containing noun phrase (NP) anaphora (e.g. “the river” instead of “the word’s longest river”). For the detection of follow-up questions, the algorithm in (De Boni and Manand- har 2005) is used, which achieved an 81% accuracy on TREC-10 data. The algo- rithm is based on the following features: presence of pronouns, absence of verbs, word repetitions and similarity between the current and the 8 preceding questions6. If no follow-up is detected in the question q, it is submitted to the QA component; otherwise the following reference resolution strategy is applied: 1. If q is elliptic, its keywords are completed with the keywords extracted by the QA component from the previous question for which there exists an answer. The completed query is submitted to the QA component. 2. If q contains pronoun/adjective anaphora, the chunker is used to find the first compatible antecedent in the previous questions in order of recency. The latter must be a NP compatible in number with the referent. 3. If q contains NP anaphora, the first NP in the stack of preceding questions which contains all of the words in the referent is used in place of the latter. 5 http://opennlp.sourceforge.net/ 6 At the moment the condition on semantic distance is not included for the sake of speed. 18 S. Quarteroni and S. Manandhar In cases 2 and 3, when no antecedent can be found, a clarification request is issued by the system until a resolved query can be submitted to the QA component. Finally, when the QA process is terminated, a message directing the user to the HTML answer page is returned and the follow-up proposal is issued (see Figure 3). 9.2 Implementation Following the typical design of an AIML-based conversational agent, we created a set of categories to fit the dialogue scenarios elaborated during dialogue design (Section 5) and enriched with the WOz experience (Section 8). We used the Java-based AIML interpreter Chatterbean7 and extended its original set of AIML tags (e.g. ,