key: cord-0220674-4giq9g6e authors: Perevalov, Aleksandr; Yan, Xi; Kovriguina, Liubov; Jiang, Longquan; Both, Andreas; Usbeck, Ricardo title: Knowledge Graph Question Answering Leaderboard: A Community Resource to Prevent a Replication Crisis date: 2022-01-20 journal: nan DOI: nan sha: 405277e9ed66ff916652f253f60fb28cf856b7bb doc_id: 220674 cord_uid: 4giq9g6e Data-driven systems need to be evaluated to establish trust in the scientific approach and its applicability. In particular, this is true for Knowledge Graph (KG) Question Answering (QA), where complex data structures are made accessible via natural-language interfaces. Evaluating the capabilities of these systems has been a driver for the community for more than ten years while establishing different KGQA benchmark datasets. However, comparing different approaches is cumbersome. The lack of existing and curated leaderboards leads to a missing global view over the research field and could inject mistrust into the results. In particular, the latest and most-used datasets in the KGQA community, LC-QuAD and QALD, miss providing central and up-to-date points of trust. In this paper, we survey and analyze a wide range of evaluation results with significant coverage of 100 publications and 98 systems from the last decade. We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community - https://kgqa.github.io/leaderboard. Our analysis highlights existing problems during the evaluation of KGQA systems. Thus, we will point to possible improvements for future evaluations. Question Answering (QA) is a rapidly growing field in research and industry 1 . QA systems already deliver their potential into many real-world problems, e.g., (Mutabazi et al., 2021; Both et al., 2021; Diefenbach et al., 2021) . These systems can be divided into two main paradigms (Jurafsky and Martin, 2018) : IRbased that works over unstructured data, closely related to Machine Reading Comprehension and Retriever-Reader architecture) and Knowledge-Based (KBQA) which works over structured data, such as relational tables, specific data APIs, knowledge graphs (KGs). In this regard, Question Answering over Knowledge Graphs (KGQA) is of particular interest to this work. Many different benchmarking datasets are evaluating KGQA systems. These datasets differ in the underlying knowledge graph (e.g., DBpedia (Auer et al., 2007) or Wikidata (Erxleben et al., 2014) ), size order of magnitude (Fu et al., 2020) , questions complexity (Saleem et al., 2017) , multilingual support (Chandra et al., 2021) , and many more dimensions. In the KGQA research community, several datasets have become a de facto standard for evaluation of such systems, such as the QALD and LC-QuAD (Dubey et al., 2019) benchmark dataset series. As more and more researchers introduce new evaluation results using these well-known datasets, it becomes more challenging to follow the up-to-date state-of-the-art in the KGQA field. The related research fields such as IRbased QA and Knowledge Graph research community already have their own well-established and maintained leaderboards of the best solutions (SQuAD 2 (Rajpurkar et al., 2016) , OGB 3 (Hu et al., 2020) ). However, it is not the case for KGQA. This lack -in particular of curated leaderboards -leads to a missing global view over the research field. In turn, this could inject mistrust into result tables within publications when they are incomplete or lack a comparison to certain systems, as often required by reviewers. In particular, the latest and most-used datasets in the Semantic Web community, LC-QuAD and QALD, miss providing central points of trust such as leaderboards. In this paper, we analyze the publications of KGQA evaluations of the last decade. We evaluated 100 papers and 98 systems on 4 datasets focusing on the LC-QuAD and QALD series. Our results show that evaluation numbers are often consistent. Existing errors stem from minor differences in the data (e.g., gAnswer (Hu et al., 2018) on QALD-9 ) that seems to be rounding errors or inconclusive behavior. Finally, we discuss the consequences of our findings and will point to possible improvements for future evaluations. Our contributions are as follows: • We present the first, extensive evaluation analysis of the state of the research in KGQA. • We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the communityhttps://github.com/KGQA/leaderboard. • We provide an up-to-date overview of all available demos or Web services for KGQA at the point of publication. These contributions should help the scientific community to foster replication and cross-evaluation in the future. In the following, we analyze related studies and approaches in Section 2. Afterward, we introduce the analyzed datasets and systems in Section 3. In Section 4, we describe our extensive state-of-the-art data and delve into its analysis. Next, we discuss possible interpretations and paths forward and end with a summary and outlook in Section 6. There are multiple approaches to tracking the progress of any research field. In machine learning and NLP, these approaches can be subdivided into benchmarking frameworks and manual or semi-automatic reporting platforms. Today, benchmarking frameworks need to limit their scope to a subset of tasks to cover the necessary metrics and experiments types out-of-the-box. A general benchmarking framework, which works without writing code, does not exist. For KGQA, different benchmarking frameworks have been proposed. For example, GERBIL QA (Usbeck et al., 2019) , can benchmark KGQA systems via their Web APIs in a FAIR way (Wilkinson et al., 2016) . It also has an integrated leaderboard 4 which displays a summary of all experiments run via the platform. At the same time, this is the biggest downside -only experiments run via the platform are integrated. Thus, a realistic view depends on the adoption of the platform. This adoption seems to lack due to missing developer resources, which continuously update available systems and datasets. A different direction is followed by systems like https://github.com/AKSW/irbench or QALD-Gen (Singh et al., 2019) , which provide command-line tools for benchmarking any KGQA system. However, the offline nature of these tools leads to offline results, i.e., the results might be used in papers but do not contribute to a trustworthy overview of the field of research. Recently, reporting platforms gained popularity. They allow quick access to results, but either they are curated manually via a community or semi-automatically updated. https://nlpprogress.com/ is a famous community website launched by Sebastian Ruder. Regarding KGQA, the website's most recent information is 3 years old, possibly displaying the disinterest of the NLP community in semantic tasks. https://paperswithcode.com/ is another reporting platform run by Facebook AI research allowing to openly edit papers, code, datasets, methods, and evaluation tables. While this is of tremendous help for reproducibility, its results for KGQA are sparse. There is only one result for LC-QuAD 2 and QALD 9, and both are for relation extraction rather than Question Answering. A promising semi-automatic approach is the Open Research Knowledge Graph (ORKG) (Auer et al., 2020) . It allows the community to persistently annotate papers via smart tools with meta and evaluation data, e.g., https://www.orkg.org/orkg/paper/R6386/R6393 for QALD-6 data. However, the current adoption in the community does not go beyond prototypes provided by the ORKG team. A change might come with the European Open Science Cloud (EOSC) and the Nationale Forschungsdateninfrastruktur für Data Science und Künstliche Intelligence (German National Data Infrastructure for Data Science and AI). Those publicly funded initiatives strive to foster ecosystems like ORKG in the long term. Finally, surveys can be viewed as reporting platforms. Different surveys have been published in the past decade focusing on a variety of topics such as challenges in general KGQA (Höffner et al., 2017) , challenges in complex KGQA (Fu et al., 2020) , core techniques of KGQA (Diefenbach et al., 2018) or neural network-based KGQA systems (Chakraborty et al., 2021) . However, these are automatically outdated when published or focus only on a narrow subtopic. Thus, there is the need for a central, dense, and open reporting platform focusing on KGQA, which provides trustworthy insights. We surveyed 14 DBpedia-based KGQA benchmark datasets that were published in the last decade (cf., Section 3.1). In this paper, we consider 4 KGQA datasets for an in-depth analysis. Requirements for selecting a dataset include usage for the evaluation of different systems, availability in English, relying on DBpedia (primarily) or Wikidata (knowledge bases, which are still maintained), and cited above 5 times. Our goal was to make sure that the chosen QA datasets are: up-to-date, close to a real-world setting, can be manually evaluated, and are vastly studied. Note, we use benchmark datasets and dataset synonymous. We took 98 QA systems into the consideration. They are collected manually from articles that include evaluation results on the considered benchmark datasets. The article search was conducted in two ways. First, we retrieved articles using a keyword search on Google Scholar. Specifically, the selection criteria were: published after 2019, and that titles satisfy: ['question answering' AND ('semantic web' OR 'data web' OR 'web of data')]. The second method is to extract all articles which cite the benchmark dataset from Google Scholar either as direct citation or as URL to the location of the dataset. We removed duplicates and manually extracted the QA systems evaluated or referred to in the articles. This resulted in 100 analyzed papers. Note, some systems are evaluated on a subset of the dataset or a dataset where the benchmark dataset is just a subset. We indicated such a difference in the leaderboard accordingly. The first dataset is QALD which is multilingual dataset challenge series. In QALD-8, there were 219 training question-answer pairs and 42 test data points respectively. It was the first edition to use GERBIL QA as a benchmarking platform (Usbeck et al., 2019) . The newest instance -QALD-9 (Usbeck et al., 2018) -contains 558 questions incorporating information of the DBpedia knowledge base 5 where for each question the following is given: a textual representation in multiple languages, the corresponding SPARQL query (over DBpedia), the answer entity URI, and the answer type. The QALD series has a growing number of questions per edition and thus grows continuously in its expressiveness. The dataset has become a staple for many research studies in QA (e.g., (Höffner et al., 2017; Diefenbach et al., 2018) ). The second and third dataset is LC-QuAD. LC-QuAD These datasets do not fulfill our current criteria and thus are not part of the initial version of the KGQA leaderboard. However, we encourage the community to help us update the leaderboard also for these datasets to prevent a replication crisis before it starts. 5 https://www.dbpedia.org/ While there are decentral collections of KGQA systems and there are available as code or Web service, e.g., https://github.com/semanticsystems/NLIWOD/tree/master/qa.systems, there is no up-to-date and systematically curated collection as of now. Our analysis shows that 24 provide a URL to a repository and 16 even to an online demo or Web API. However, after inspection, only 8 demos or Web APIs are still functional. This is the first hint towards an upcoming replication crisis. For a full list of systems, their descriptions, and pointers to their web services and demo, see https://github.com/KGQA/leaderboard/ blob/gh-pages/systems.md#Systems We evaluated 100 papers and 98 systems focusing on 4 datasets, namely LC-QuAD version 1 and version 2 as well as the QALD-8 and 9 versions (all datasets released on 2017 or later). Figure 1 comprehensively summarizes the considered results of the leaderboard. Based on the results, it became clear that the evaluation values across the publications are often consistent. The results contain multiple values for some of the systemdataset combinations (e.g., WDAqua-core0 over LC-QuAD 1.0), reported by different publications. Figure 2 demonstrates the evaluation values given a particular benchmark dataset grouped by the KGQA systems. For system-dataset combinations with multiple values, we calculated the standard deviation (std.). The std. values for such systems as QAKiS, TeBaQA, Elon, QASystem, gAnswer, and QAmp are not higher than 1%. This non-null std. is probably caused by the rounding errors. The only outliers in the evaluation values were observed given the WDAqua-core systems. For example, the paper (Zheng and Zhang, 2019) reports F1 Score of 38.7% for WDAqua-core0 over QALD-8, taking the results from the original publication of WDAqua-core0. Another paper reports F1 Score of only 33.0% for the same system-dataset combination. The authors calculated this result. The std. of both WDAqua-core versions reaches 9% on LC-QuAD 1.0 dataset and 3% on QALD-8. Note, the high std. values are not dependent on the datasets. Hence, the papers reporting significantly different results regarding WDAqua-core require further investigation. One of the assumptions is that WDAquacore provides a publicly accessible demonstrator and API 6 which enables researchers to re-run the evaluation. This fact naturally implies possible differences in the evaluation results. However, there is no such systematic tendency for the other results as probably the majority of them were not reproduced but cited from the original publication. Despite the consistency of the results, the F1 Score values of the systems have a wide variance range given a particular dataset (cf., Figure 3 ). Surprisingly, the number of papers from ArXiv (preprints) in our leaderboard appears to be higher than the number of peer-reviewed papers (54% vs 46%). It was observed that the peer-reviewed papers report significantly higher results w.r.t. F1 Score which is 30.2% for preprints and 39.5% for peer-reviewed papers. The logical reason for this is that the peer-reviewed papers typically report state-of-the-art results, while preprints might contain preliminary work. Given the considered results, it was observed that the authors of 72% papers did not include all the evaluation results from other publications in their comparison that were already available at a particular point in time. To find out this number, the set of systems from a publication reporting the values on particular datasets was compared to the set of systems released a year ago or earlier. For example, the publication released in 2021 does not consider the results of the QAmp system (Vakulenko et al., 2019) that was published in 2019. The trustworthiness of scientific results strongly depends on the comparability and replicability of the same. In the field of KGQA, one could assume that the existence of a large and rising number QA datasets ensures comparability. Indeed, our analysis shows that the reported evaluations are overwhelmingly coherent. However, we observed several issues: First, the main reason why most numbers are identical is that people refer to results given in an original paper and its evaluation section. We could not find evidence that researchers actively tried to replicate results. A reason could be that only, 16 percent of the systems are available as source code (or Web service/demo). However, even in the existence of an online demo, e.g., Diefenbach et al., 2020) , the current state of the KGQA system seems not to be re-evaluated. Second, our analysis indicates that researchers might have overlooked (best case) or omitted (worst case) relevant results that speak against their claims. For example, in (Wu et al., 2021) there are similar earlier works (Maheshwari et al., 2019; To and Reformat, 2020 ) which evaluated the same datasets and provided similar or even better results. However, we are well aware that researchers struggle with establishing an up-to-date overview of current research due to the time-consuming nature of the process without a central overview of KGQA systems. Third, we see a strong need for improved evaluation methods. This demand can be covered by online evaluation methods, e.g., using platforms like Gerbil (Usbeck et al., 2019)). However, we also observed a decreasing amount of working online demos suggesting that a new form of a platform where models as such can be uploaded 7 could be a future direction. Fourth, while developing new platforms and systems, we should also consider the rising critique on leaderboards regarding their utility for the NLP community at large (Ethayarajh and Jurafsky, 2020) . Thus, we concur that evaluation protocols need to be published to foster transparency on leaderboards. Finally, the lack of open-source implementations could be a starting point for a replication crisis. While there is no replication crisis in the field of KGQA as of now, the community needs to leverage novel initiatives such as the European Open Science Cloud 8 or the National Research Data Infrastructure for Data Science and AI 9 . Otherwise, models and source code might be lost or results will become incomparable in the long term. In this paper, we presented a novel community resource to track advances in the field of KGQA research. We foresee the need to maintain a KGQA focused platform as long as approaches such as ORKG (Auer et al., 2020) are not widely used or developed far enough. Of course, we could have just added our findings to reporting platforms. However, we believe, that this publication provides a more valuable base for discussions and reaches a wider audience than a silent upload. Additionally, since the QALD-9 evaluation campaign has passed for more than 3 years now, we intend to establish a central leaderboard to keep people on the same page. In the future, we are looking into automatic ways to synchronize various reporting platforms with the KGQA leaderboard. We plan to extend the evaluation of QA systems, s.t., replicable evaluations, and data collections are possible. Additionally, improved metrics (e.g., Siciliani et al., 2021) ) should be evaluated over models, source code, or via platforms to allow in-depth analyses of the capabilities of QA systems. We are aware of research on other KGQA datasets grounded in Wikidata, Freebase, WikiMovies, and EventKG and want to encourage the community to update the KGQA leaderboard with the corresponding numbers. Dbpedia: A nucleus for a web of open data Improving access to scientific literature with knowledge graphs A question answering system for retrieving german COVID-19 data driven and quality-controlled by semantic technology Introduction to neural network-based question answering over knowledge graphs Wdaqua-core0: A question answering component for the research community Core techniques of question answering systems over knowledge bases: a survey Towards a question answering system over the semantic web Wikibase as an infrastructure for knowledge graphs: The eu knowledge graph Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia Introducing wikidata to the linked data web Utility is in the eye of the user: A critique of NLP leaderboards A survey on complex question answering over knowledge base: Recent advances and challenges. CoRR, abs Survey on challenges of question answering in the semantic web Answering natural language questions by subgraph matching over knowledge graphs (extended abstract) Open graph benchmark: Datasets for machine learning on graphs Speech and language processing (draft). preparation Learning to rank query graphs for complex question answering over knowledge graphs A review on medical textual question answering systems based on deep learning approaches Cbench: Demonstrating comprehensive evaluation of question answering systems over knowledge graphs through deep analysis of benchmarks Cbench: Towards better evaluation of question answering over knowledge graphs Squad: 100, 000+ questions for machine comprehension of text Question answering over linked data: What is difficult to answer? what affects the f scores Mqald: Evaluating the impact of modifiers in question answering over knowledge graphs Why reinvent the wheel: Let's build question answering systems together Qaldgen: Towards microbenchmarking of question answering systems over knowledge graphs Lc-quad: A corpus for complex question answering over knowledge graphs NLIWOD-4) and 9th Question Answering over Linked Data challenge (QALD-9) co-located with 17th International Semantic Web Conference Message passing for complex question answering over knowledge graphs The fair guiding principles for scientific data management and stewardship Modeling global semantics for question answering over knowledge bases Question answering over knowledge graphs via structural query patterns Language Resource References Farewell freebase: Migrating the simplequestions dataset to dbpedia Constraint-based question answering with knowledge graph Semantic parsing on freebase from question-answer pairs Large-scale simple question answering with memory networks Large-scale semantic parsing via schema matching and lexicon extension Multilingual compositional wikidata questions Question answering benchmarks for wikidata Beyond iid: three levels of generalization for question answering on knowledge bases Tempquestions: A benchmark for temporal question answering Complex temporal question answering on knowledge graphs Freebaseqa: a new factoid qa data set matching triviastyle question-answer pairs with freebase Measuring compositional generalization: A comprehensive method on realistic data Rubq: a russian dataset for question answering over wikidata Rubq 2.0: An innovated russian question answering dataset Question answering over temporal knowledge graphs Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus Kqa pro: A large diagnostic dataset for complex question answering over knowledge base. arXiv e-prints Event-qa: A dataset for event-centric question answering over knowledge graphs On generating characteristic-rich question sets for qa evaluation The web as a knowledge-base for answering complex questions Benchmarking question answering systems The value of semantic parse labeling for knowledge base question answering Neural machine translating from natural language to sparql Variational reasoning for question answering with knowledge graph An interpretable reasoning network for multi-relation question answering A chinese multi-type complex questions answering dataset over wikidata To ensure the replication of KGQA systems and the trustworthiness of their evaluation results, we provide a leaderboard. The leaderboard is available at https://kgqa.github.io/leaderboard/. It can be used to compare the capabilities of these KGQA systems over the latest and commonly used KGQA benchmark datasets by tracking the progress. It includes the datasets, links, papers and SOTA results. At the time of writing, the leaderboard includes a total of 34 KGQA datasets across 5 knowledge graphs (i.e., DBpedia, Wikidata, Freebase, WikiMovies, and EventKG). As shown in Fig. 4 , these KGQA datasets are separated by the used target KGs. Fig. 5 shows an example of LCQuAD V1.0 Leaderboard. We will continuously add newly released datasets and their SOTA results, and invite other researchers to make their contributions by adding new results based on these KGQA dataset overviews.