Efficient Query Processing and Index Tuning using Proximity Scores Efficient Query Processing and Index Tuning using Proximity Scores Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften der Naturwissenschaftlich-Technischen Fakultäten der Universität des Saarlandes Andreas Broschart Universität des Saarlandes Saarbrücken 2012 Dekan der Naturwissenschaftlich-Technischen Fakultät I Prof. Dr. Mark Groves Vorsitzender der Prüfungskommission Prof. Dr.-Ing. Thorsten Herfet Berichterstatter PD Dr.-Ing. Ralf Schenkel Berichterstatter Prof. Dr.-Ing. Gerhard Weikum Berichterstatter Prof. Torsten Suel, PhD Beisitzer Dr.-Ing. Klaus Berberich Tag des Promotionskolloquiums 09.10.2012 Acknowledgments I would like to to express my sincere gratitude to my supervisor, PD Dr.-Ing. Ralf Schenkel for guiding me from my Master’s to my PhD degree. He has always been a source of motivation and the door of his office has never been closed whenever I needed support. I would like to thank him for many interesting and fruitful discussions as well as for the scientific guidance he gave me. My special thanks go to Prof. Dr.-Ing. Gerhard Weikum for giving me the oppor- tunity to pursue my PhD studies in Saarbrücken, for his helpful comments, and for joining the reviewers board. I would like to also thank Prof. Torsten Suel, PhD for accepting my request to review my PhD thesis. Furthermore, I would like to thank Prof. Dr.-Ing. Thorsten Herfet for chairing the examination board and Dr.-Ing. Klaus Berberich for taking the minutes. Finally, I would like to thank my colleagues for the great atmosphere at the work place and many enjoyable moments. Eidesstattliche Versicherung Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit selbständig und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Die aus anderen Quellen oder indirekt übernommenen Daten und Konzepte sind unter Angabe der Quelle gekennzeichnet. Die Arbeit wurde bisher weder im In- noch im Ausland in gleicher oder ähnlicher Form in einem Verfahren zur Erlangung eines akademischen Grades vorgelegt. Saarbrücken, den 31.05.2012 (Unterschrift) Kurzfassung Angesichts wachsender Datenmengen stellt effiziente Anfrageverarbeitung, die gleich- zeitig Ergebnisqualität und Indexgröße berücksichtigt, zusehends eine Herausforderung für Suchmaschinen dar. Wir zeigen, wie man Proximityscores einsetzen kann, um An- fragen effektiv und effizient zu verarbeiten, wobei der Schwerpunkt auf eines der Ziele gelegt wird. Die Hauptbeiträge dieser Arbeit gliedern sich wie folgt: • Wir präsentieren eine umfassende vergleichende Analyse von Proximityscoremo- dellen sowie eine gründliche Analyse des Potenzials von Phrasen und passen ein führendes Proximityscoremodell für die Verwendung mit XML-Daten an. • Wir diskutieren für die präsentierten Proximityscoremodelle die Eignung zur Top- k-Anfrageverarbeitung und präsentieren einen neuen Index, der einen Inhalts- und Proximityscore kombiniert, um Top-k-Anfrageverarbeitung zu beschleunigen und die Güte zu verbessern. • Wir präsentieren ein neues, verteiltes Indextuningpaket für Term- und Termpaar- listen, das Tuningparameter mittels wohldefinierter Optimierungskriterien unter Größenbeschränkung bestimmt. Indizes können auf Effizienz oder Güte optimiert werden und sind bei hoher Güte performant. • Wir zeigen, dass gekürzte Indizes mit einem Merge Join-Ansatz Top-k Algorith- men mit ungekürzten Indizes bei hoher Güte schlagen. • Außerdem präsentieren wir eine hybride Indexstruktur, die Cold Cache-Effizienz verbessert. I Abstract In the presence of growing data, the need for efficient query processing under result quality and index size control becomes more and more a challenge to search engines. We show how to use proximity scores to make query processing effective and efficient with focus on either of the optimization goals. More precisely, we make the following contributions: • We present a comprehensive comparative analysis of proximity score models and a rigorous analysis of the potential of phrases and adapt a leading proximity score model for XML data. • We discuss the feasibility of all presented proximity score models for top-k query processing and present a novel index combining a content and proximity score that helps to accelerate top-k query processing and improves result quality. • We present a novel, distributed index tuning framework for term and term pair index lists that optimizes pruning parameters by means of well-defined optimiza- tion criteria under disk space constraints. Indexes can be tuned with emphasis on efficiency or effectiveness: the resulting indexes yield fast processing at high result quality. • We show that pruned index lists processed with a merge join outperform top-k query processing with unpruned lists at a high result quality. • Moreover, we present a hybrid index structure for improved cold cache run times. III Zusammenfassung Auf der Suche nach Information erwarten Leute qualitativ hochwertige Ergebnisse bei schnellen Antwortzeiten, zwei gegensätzliche Ziele. Angesichts ständig wachsen- der Datenmengen wird dabei effiziente Anfrageverarbeitung unter Berücksichtigung von Ergebnisgüte und Indexgröße zusehends eine Herausforderung für Suchmaschinen. Diese Arbeit beschäftigt sich mit dem wichtigen Problem, wie man Proximityscore- modelle einsetzen kann, um Anfrageverarbeitung gleichzeitig effizient und effektiv zu machen. Wir stellen neuartige Indexstrukturen vor, die Top-k-Anfrageverarbeitung er- lauben und auf eine Reihe von Proximityscoremodellen anwendbar sind. Ein neuartiger Algorithmus zur Indexoptimierung kann für Ergebnisgüte oder Effizienz unter Index- größenkontrolle optimiert werden. Der erste Teil dieser Arbeit widmet sich hauptsächlich Effektivitätsaspekten von Proximityscoremodellen. In einer groß angelegten Studie existierender Proximityscoremodelle klassifizieren wir diese in vier Kategorien: 1) Linearkombinationen eines Inhaltsscoremodells und eines Proximityscoremodells, 2) integrierte Scoremodelle, 3) Sprachmodelle mit Wort- abstandskomponenten und 4) Modelle, die Wortabstandsinformation verwenden und Scoremodelle mit maschinellen Lerntechniken erlernen. Wir präsentieren eine verglei- chende Effektivitätsanalyse für eine beträchtliche Menge von Proximityscoremodellen, die wir in einem gemeinsamen Framework mit Hilfe von vier Testumgebungen evaluie- ren. Wir führen einen systemorientierten Vergleich der erforderlichen Features jedes Scoremodells durch. Für jedes Scoremodell in jeder Testumgebung empfehlen wir Modellparameter, die eine hohe Ergebnisgüte erzielen. Zusätzlich evaluieren wir, wie in [Met06a] vorgeschlagen, die Robustheit jedes Modells bezüglich Modellparametern. Weiterhin führen wir einige Erweiterungen für Proximityscores in der Textsuche durch. Die Verwendung von Phrasen als hartes Filterkriterium für Ergebnisse ist eine weitere Möglichkeit, die Ergebnisqualität zu verbessern. Wir führen eine gründliche Analyse des Potenzials expliziter Phrasen für die Ergebnisqualität durch und verglei- chen sie mit der Ergebnisqualität eines der führenden Proximityscoremodelle. Eine Nutzerstudie untersucht, wie sehr Nutzer bei der Kennzeichnung von Phrasen für eine gegebene Anfrage übereinstimmen. Wir validieren die weit verbreitete Intuition, dass die Verwendung von Phrasen in Anfragen die Ergebnisqualität existierender Retrieval- modelle steigern kann. Jedoch ist die Wahl geeigneter Phrasen eine nichttriviale Auf- gabe und kann unter Umständen zu schwierig für Benutzer sein, die zudem häufig über V die Wahl geeigneter Phrasen uneins sind. Weiterhin kommt der Anordnung von Termen in Anfragen nicht immer eine semantische Bedeutung zu. Aufgrund der Verbreitung von XML-Dokumenten ist es nützlich, dass eine Such- maschine nicht nur unstrukturierte Textdokumente unterstützt, sondern auch semi- strukturierte XML-Daten. Wir passen eines der besten Proximityscoremodelle aus der Textsuche an, um Inhaltsanfragen auf XML-Daten zu unterstützen. Mit Hilfe zusätz- licher Abstände an den Elementgrenzen tragen wir der Dokumentstruktur für die XML- Elementsuche Rechnung, wenn wir die Distanz von Termauftreten berechnen. Der zweite Teil dieser Arbeit konzentriert sich auf Effizienzaspekte von Proximity- scoremodellen. Nachdem wir eine Einführung in Top-k und Nicht-Top-k Algorithmen gegeben haben, passen wir eines der führenden Proximityscoremodelle so an, dass wir es vorberechnen und in eine Indexstruktur gießen können. Weiterhin diskutieren wir für alle im ersten Teil vorgestellten Proximityscoremodelle die Anwendbarkeit der zuvor entwickelten Methoden. Wo diese anwendbar sind, leiten wir obere und untere Score- schranken für Kandidaten und Zwischenergebnisse in Top-k Algorithmen her und ent- wickeln passende Indizes. Wir zeigen weiterhin, dass bereits wenige Tausend gelesene Indexeinträge hinreichend sind, um eine Ergebnisgüte zu erzielen, die mit ungekürzten Indizes erreicht werden kann. Weil derart wenige Einträge gelesen werden müssen, eröffnet das die Möglichkeit, auf gekürzten Indexlisten einen Merge Join-basierten Ansatz zu verwenden. Das spart gleichzeitig zusätzliche Kosten der Top-k-Algorithmen und reduziert signifikant die Indexgröße. Wir erzielen beeindruckende Effizienzsteige- rungen um bis zu zwei Größenordnungen verglichen mit dem Lesen ungekürzter Listen mit Inhaltsscoreinformation. In einem ersten Ansatz wurden dabei alle Listen auf nicht- systematische Art gekürzt. Das ist die Stelle, an der unser Indexoptimierungspaket für Term- und Termpaar- indexlisten ins Spiel kommt. Wir schlagen einen systematischen Ansatz vor, der mit wohldefinierten Optimierungskriterien Parameter zum Kürzen von Indexlisten errechnet. Dazu entwickeln wir ein Indexoptimierungspaket, das Indexstrukturen für Terme und Termpaare für maximale Ergebnisgüte oder maximale Effizienz unter Güte- kontrolle und Indexgrößenbudget optimiert. Das Paket verwendet Hadoop, ein Open Source MapReduce-Paket und gestattet eine selektive Materialisierung von Termpaar- listen auf der Basis von Information aus einer Anfrageprotokolldatei. Wir zeigen, wie wir Indizes sowohl mit als auch alternativ ohne Bewertungen der Ergebnisrelevanz optimieren können. Die resultierenden gekürzten Indizes bieten verlässliche Anfrage- ausführungszeiten und eine Ergebnisgüte, die vergleichbar oder sogar besser als die ungekürzter Termindizes ist, welche die Ergebnisgüte des BM25-Bewertungsmodells liefern. Wir präsentieren eine hybride Indexstruktur, welche Term- und Termpaarindex- listen kombiniert, um weniger gelesene Listen gegen eine höhere gelesene Datenmenge einzutauschen, um Cold Cache-Laufzeiten zu verbessern. Wir zeigen experimentell, dass die resultierenden gekürzten Indizes Anfragen um fast eine Größenordnung gegenüber einem führenden Top-k Algorithmus bei vergleichbarer Ergebnisgüte beschleunigen. Wir führen ausgedehnte Experimente auf den Dokumentkollektionen GOV2 und ClueWeb09 sowie für den INEX Efficiency Track 2009 und den TREC Web Track 2010 durch. Summary When people search for information, they expect high quality results at fast processing times which are conflicting goals. In the presence of growing data, the need for efficient query processing under result quality and index size control increasingly becomes a challenge to search engines. This work addresses the important problem how to use proximity scores to make query processing effective and efficient at the same time. We present novel index structures for top-k query processing applicable to a number of proximity score models and a novel algorithm for index tuning that can be optimized for retrieval quality or efficiency under index size control. The first part of this thesis deals mainly with effectiveness aspects of proximity score models. In an extensive survey of existing proximity-enhanced score models, we put them into four categories: 1) linear combinations of a content score model and a proximity score model, 2) integrated score models, 3) language models with proximity compo- nents, and 4) models that incorporate proximity features and learn to rank by applica- tion of machine learning techniques. We present a comparative analysis of a significant set of proximity score models in a single evaluation framework with four test beds. We carry out a system-oriented comparison with the required features per score model. We give recommendations on how to set parameters for each combination of test bed and score model. In addition, we measure intercollection and intracollection generalization, entropy, and spread values as proposed in [Met06a]. Furthermore, we elaborate on some extensions to proximity scores in text retrieval. Usage of phrases as a hard filter criterion for results is a different means to improve retrieval quality. We carry out a rigorous analysis of the potential of explicit phrases for retrieval quality and compare it to the retrieval quality of a state-of-the-art proximity score model. A user study investigates the degree of user agreement about phrases in a query. We validate the common intuition that phrase queries can boost the performance of existing retrieval models, but choosing good phrases is a non-trivial task and might be too difficult for users as they frequently disagree on phrases in a query; furthermore, term order in queries does not always bear semantics. Due to the dissemination of XML documents, it is useful for a search engine to not only support unstructured text documents, but also semi-structured XML data. We adapt one of the best performing proximity score models from text retrieval to support content queries on XML data. By means of virtual gaps in XML documents, we take VII the document structure into account when computing the distance of term occurrences. The second part of this thesis concentrates on efficiency aspects of proximity score models. After giving an introduction into top-k and non-top-k algorithms, we show how to adapt a state-of-the-art proximity score model for top-k query processing and devise appropriate index structures that allow precomputation of the required features. Fur- thermore, we discuss the feasibility of all proximity score models presented in the survey for top-k query processing, give score bounds, and devise indexes where possible. We furthermore show that already a few thousand read entries are good enough to yield a retrieval quality comparable to reading unpruned index lists. As only that few entries have to be read, this opens the door to merge join processing on pruned index lists, saving on overhead costs of top-k query processing and index space requirements. We achieve impressive performance gains by up to two orders of magnitude compared to reading unpruned content score lists. However, all index lists have been pruned in a non-systematic, ad hoc style manner. That is the place where our index tuning framework for term and term pair index lists comes into play. We propose a systematic pruning approach with well-defined optimization criteria. To this end, we introduce a tunable indexing framework for term and term pair index structures for optimizing index parameters towards either maximal result quality or maximal query processing performance under result quality control, given a maximal index size. The index tuning framework is implemented on top of the Open Source MapReduce framework Hadoop and allows a selective materialization of term pair index lists based on information from a query log. We show how to perform index tuning both in the presence and, alternatively, in the absence of relevance assessments. The resulting indexes provide dependable query execution times while providing result quality comparable to or even better than unpruned term indexes that provide BM25 score quality. We present a hybrid index structure that combines the term and term pair index lists to trade in a reduced number of fetched lists for an increased number of read bytes to improve cold cache run times. Experimental results demonstrate that the resulting index configurations allow query processing that achieves almost one order of magnitude performance gain compared to a state-of-the-art top-k algorithm yielding results of comparable quality. We carry out extensive experiments on GOV2 and ClueWeb09, in the INEX 2009 Efficiency Track and for the TREC Web Track 2010. Contents 1 Introduction 3 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Proximity Score Models 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Model and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Unigram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 BM25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Lnu.ltc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Language Models, Smoothing Methods, and KL-Divergence . . . 14 2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Linear Combinations of Scoring Models . . . . . . . . . . . . . . . . . . 16 2.4.1 Rasolofo and Savoy . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Büttcher et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Uematsu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.5 Tao and Zhai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Integrated Score Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 De Kretser and Moffat . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Song et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.3 Mishne and de Rijke . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Language Models with Proximity Components . . . . . . . . . . . . . . 31 2.6.1 Lv and Zhai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6.2 Zhao and Yun . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Learning to rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7.1 General Introduction to Learning to Rank Approaches . . . . . . 35 2.7.2 Svore et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7.3 Metzler and Croft . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7.4 Cummins and O’Riordan . . . . . . . . . . . . . . . . . . . . . . 39 I 2.8 System-Oriented Comparison of Implementation Efforts per Scoring Model 42 3 Benchmarks 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 The TREC Initiative and Selected Test Beds . . . . . . . . . . . 48 3.2.2 INEX and Selected Test Beds . . . . . . . . . . . . . . . . . . . . 51 3.2.3 Other Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Measures for Text/Document Retrieval . . . . . . . . . . . . . . 54 3.3.2 Measures for XML Retrieval . . . . . . . . . . . . . . . . . . . . 57 4 Evaluation for Selected Score Models 59 4.1 Results from the Original Papers . . . . . . . . . . . . . . . . . . . . . . 59 4.1.1 Linear Combinations of Scoring Models . . . . . . . . . . . . . . 59 4.1.2 Integrated Score Models . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.3 Language Models with Proximity Components . . . . . . . . . . 63 4.1.4 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Comparative Analysis for Selected Score Models . . . . . . . . . . . . . 65 4.2.1 Experimental Comparison of Scoring Models . . . . . . . . . . . 66 4.2.2 Individual Scoring Models . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Intercollection and Intracollection Generalization Results . . . . 74 4.2.4 Sensitivity Charts . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5 Extensions 79 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.1 XML Background . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.3 Related Work by Beigbeder . . . . . . . . . . . . . . . . . . . . . 81 5.2.4 Proximity Scoring for XML . . . . . . . . . . . . . . . . . . . . . 82 5.2.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 85 5.2.6 Additional Experiments for INEX 2008 . . . . . . . . . . . . . . 88 5.3 Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.1 Evaluating the Potential of Phrases . . . . . . . . . . . . . . . . 90 6 Top-k Vs. Non-Top-k Algorithms 95 6.1 Top-k Algorithms from DB . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.1 Sorted and Random Accesses . . . . . . . . . . . . . . . . . . . . 97 6.1.2 No Random Accesses . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.3 Carefully Scheduled Random Accesses . . . . . . . . . . . . . . . 98 6.2 Top-k Algorithms from IR . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.1 Exact Top-k Algorithms from IR . . . . . . . . . . . . . . . . . . 101 6.2.2 Exact Top-k Algorithms from IR with a Term Proximity Com- ponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.3 Approximate Top-k Algorithms from IR . . . . . . . . . . . . . . 106 6.3 Non-Top-k Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7 Casting Proximity Scoring Models into Top-k Query Processing 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Proximity Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.1 Proximity Scoring Models . . . . . . . . . . . . . . . . . . . . . . 110 7.2.2 Modification of Büttcher’s Scoring Model . . . . . . . . . . . . . 111 7.3 Indexing and Evaluation Framework . . . . . . . . . . . . . . . . . . . . 112 7.3.1 Precomputed Index Lists and Evaluation Strategies . . . . . . . 112 7.3.2 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.4 Results with Pruned Index Lists . . . . . . . . . . . . . . . . . . 116 7.3.5 Comparison: TopX(RR-LAST Mode) on Unpruned Lists vs. Merge Join on Pruned Lists . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.6 Conclusion of the Experiments . . . . . . . . . . . . . . . . . . . 124 7.4 Feasibility of Scoring Models for Top-k Query Processing . . . . . . . . 125 7.4.1 Linear Combinations of Scoring Models . . . . . . . . . . . . . . 125 7.4.2 Integrated Score Models . . . . . . . . . . . . . . . . . . . . . . . 130 7.4.3 Language Models with Proximity Components . . . . . . . . . . 131 7.4.4 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8 Index Tuning for High-Performance Query Processing 141 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.1.3 Outline of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 142 8.2 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.3 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.3.1 Tuning as Optimization Problem . . . . . . . . . . . . . . . . . . 147 8.3.2 Implementation of the Tuning Framework . . . . . . . . . . . . . 152 8.4 Log-Based Term Pair Pruning . . . . . . . . . . . . . . . . . . . . . . . . 154 8.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.5.2 Index Tuning on GOV2 . . . . . . . . . . . . . . . . . . . . . . . 159 8.5.3 Query Processing with GOV2 . . . . . . . . . . . . . . . . . . . . 165 8.5.4 Log-based Pruning with GOV2 . . . . . . . . . . . . . . . . . . . 173 8.5.5 Summary of Conclusions and Limitations of the Approach . . . . 174 8.5.6 Results with ClueWeb09 . . . . . . . . . . . . . . . . . . . . . . . 175 8.5.7 Results with INEX 2009 . . . . . . . . . . . . . . . . . . . . . . . 177 8.6 Hybrid Index Structure for Efficient Text Retrieval . . . . . . . . . . . . 183 8.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8.6.2 Hybrid Index Framework . . . . . . . . . . . . . . . . . . . . . . 185 8.6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 186 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9 Conclusion and Outlook 189 9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 A Retrieval Quality and Sensitivity 191 B TREC 209 C INEX 215 List of Figures 225 List of Tables 229 References 230 Index 241 1 List of Abbreviations C = {d1, . . . ,dN } : document corpus/collection consisting of N documents ctf(ti) = ∑ d∈C tf(ti,d) : collection term frequency of term ti in C df(ti) : document frequency of term ti in C dt : number of distinct terms in C dt(d) : number of distinct terms in document d idf(ti) : inverse document frequency of term ti in C idfj (ti) : inverse document frequency of term ti in C, variant j lC = ∑ d∈C ld : length of document collection C ld = |d| : length of document d le = |e| : length of element e K = k · [(1 − b) + b · ld avgdl ] : frequently occurring component in scoring models N : number of documents in the document corpus/collection C P ⊆ {1, . . . , ld} : subset of positions in document d Pd(t) = {i : pi(d) = t} ⊆ {1, . . . , ld} : set of positions in document d where term t occurs Pe(t) = {i : pi(e) = t} ⊆ {1, . . . , le} : set of positions in element e where term t occurs Pd(q) := ∪ti∈qPd(ti) : set of positions of all query terms in document d Pe(q) := ∪ti∈qPe(ti) : set of positions of all query terms in element e pi(dj ) : term occurring at position i of document dj pi(ej ) : term occurring at position i of element ej q = {t1, . . . , tn} : unordered query with query terms t1, . . . , tn Qadj,d(q) := {(i,j) ∈ Pd(q) × Pd(q) | (i < j) ∧ ∀k ∈ {i + 1, . . . ,j − 1} : k �∈ Pd(q)} : set of pairs of query terms in document d that are adjacent to each other Qadj,e(q) := {(i,j) ∈ Pe(q) × Pe(q) | (i < j) ∧ ∀k ∈ {i + 1, . . . ,j − 1} : k �∈ Pe(q)} : set of pairs of query terms in element e that are adjacent to each other Qall,d(q,dist) := {(i,j) ∈ Pd(q) × Pd(q) | (i < j) ∧ (j − i ≤ dist)} : set of pairs of query terms in document d within a window of dist positions Qall,d(q) : the same as Qall,d(q,dist), but employs a window size of dist = ld Qall,e(q,dist) := {(i,j) ∈ Pe(q) × Pe(q) | (i < j) ∧ (j − i ≤ dist)} : set of pairs of query terms in element e within a window of dist positions Qall,e(q) : the same as Qall,e(q,dist), but employs a window size of dist = le qtf(ti) : query term frequency of term ti in a query 2 Sq = (t1, . . . , tn) : ordered query Td(P) = {t| i ∈ P ∧ pi(d) = t} : terms located at the positions of P in document d Te(P) = {t| i ∈ P ∧ pi(e) = t} : terms located at the positions of P in element e tf(ti,d) : term frequency of ti in d V = {v1, . . . ,vm} : vocabulary, set of terms which occur in an index Wq = (qw(t1), . . . ,qw(tn)) ⊂ [0, 1]n : query term weights for terms in query q Chapter 1 Introduction There is a plethora of applications, on the Web, in XML retrieval, in Intranets, Digital Libraries, or Desktop search, where large document collections need to be queried. Users expect not only high quality answers but also require almost instant response times. To achieve these conflicting goals, index structures and algorithms have to be devised that index documents in a compact way that allows determining a ranking of the top matching documents without inspecting the entire index. In this thesis, we focus on retrieval models for proximity search, which go far beyond simple bag of words. Proximity score models are a means to improve the retrieval quality of results by exploiting term position information of query term occurrences in documents where positional distances consider contextual information. Clearly, a good proximity score model has also to be robust to model parameters. Phrases are a hard filter for documents that can be used to further improve retrieval quality, but may also be subject to deleting potentially relevant results if the phrase in the query is not exactly matched in a document. Proximity scores allow soft phrase querying without the requirement to specify phrases. The improvement in user-perceived result quality comes, however, in general at the price of a larger index size and higher query response times. As there is no need to exhaustively compute the score of all documents with respect to a query, as only the top ranked documents are shown to a user, we apply top-k algorithms which are an effective means to tackle efficiency issues by dynamic pruning/early termination. The key idea is to stop the query processing at a point where all potential top results have been inspected. In this context, it is important that a proximity score model can be cast into precomputed index lists to compute score bounds for result candidates, hence, allow early stopping. Devising compact index structures that can be efficiently queried and at the same time provide highly accurate results is the task we consider in this thesis. We show how proximity scores that enhance retrieval quality can be integrated into efficient top-k algorithms. We propose to extend the index with additional term pair lists that maintain prox- imity scores. However, an index with these lists can become prohibitively large. A naive approach would simply cut index lists or exclude complete lists already 3 4 1. Introduction during the indexing phase. However, it remains unclear where to cut index lists, hence the tradeoff between performance gains and loss in user-perceived result quality is rather ad-hoc and bears the risk to drastically favor one or the other extreme. To overcome this, in this thesis, we devise a number of techniques for limiting the index size. Occurrences within a large proximity distance have only a marginal contribution to the overall score, we propose a window-based pruning approach that only considers term pair occurrences in a text-window of fixed size. We heuristically limit the list length to a constant number of entries, usually in the order of a few thousand entries. Further list pruning with quality guarantees is applied. We show that pruned term and term pair lists provide a retrieval quality compara- ble to unpruned term lists. At the same time, this not only saves on disk space, but significantly accelerates query processing. We propose an index tuning framework that prunes term and term pair lists in a systematic fashion and we prune both list types by list length, term pair lists are additionally restricted to entries above a minimal proxim- ity score contribution. If the disk space is limited, control over the space consumption of index structures is necessary. It is desirable to opt between index optimization to- wards maximum efficiency and maximum effectiveness given an index size constraint. Our approach allows tuning pruning parameters by using a set of queries and their relevance assessments for the collection to be indexed or, alternatively, if relevance as- sessments are not available, by a result overlap approach. In addition, query logs can be used to select term pair lists to be materialized. Using lossless index compression, the index size can be further decreased. Although this thesis focuses on Web Retrieval scenarios for the evaluation of the pre- sented approaches, the developed techniques are not only applicable to Web Retrieval, but also to other domains such as book search over digital libraries or Intranet search for enterprises that keep track of various kinds of documents such as blueprints and patents. In fact, we make a proximity score feasible for XML element retrieval and show that we can apply our index tuning framework for indexes that support content queries for XML element retrieval. Beyond the technical contributions in the area of proximity indexing and search, this thesis provides a comprehensive survey that describes and experimentally compares a significant portion of proximity scoring models. 1.1 Contributions 1. We present a comprehensive comparative analysis of a significant set of proximity score models in a single evaluation framework with four test beds. We extensively present and classify existing proximity-enhanced score models in a joint notation; using one running example, we illustrate the various models and include a feature list to compare the required model features. We show how to adapt a state-of- the-art proximity scoring model to support content queries on XML data. 2. We carry out a rigorous analysis of the potential of explicit phrases for retrieval quality and compare it to the retrieval quality of a state-of-the art proximity score 1.2 Publications 5 model. A user study investigates the degree of user agreement about phrases in a query. 3. We propose a novel index structure that combines content and proximity scores. Processing that index structure together with a content score index improves query processing in top-k algorithms by up to two orders of magnitude through tighter score bounds and a better retrieval quality compared to processing content score lists only. We apply top-k query processing to several proximity score models and devise appropriate index structures. 4. We show that already a few thousand read entries on unpruned term and term pair lists are good enough to yield a retrieval quality comparable to reading unpruned index lists. This insight opens the door to a simple merge join-based approach with pruned index lists: we require less disk space and keep the performance improvements. 5. We propose a novel, distributed index tuning framework for term and term pair index lists that optimizes pruning parameters for retrieval quality or efficiency under index size control with well-defined optimization criteria. We allow a selec- tive materialization of term pair index lists based on information from a query log and show how to perform index tuning both in the presence and in the absence of relevance assessments. 6. We present a hybrid index structure for improved cold cache run times of small and medium-sized queries that reduces the number of fetched index lists. 1.2 Publications Various aspects of this thesis have been published in [SBH+07, BS08b, BS08a, BST08, BS09, BBS10, BS10, BS11, BS12]. Effectiveness-related contributions have been described in the following publications: in [BS08b], we have presented a proximity score model for content-only queries on XML data, enriched with additional experiments on a different test bed in [BST08]. [BS08b] Andreas Broschart and Ralf Schenkel. Proximity-Aware Scoring for XML Retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Poster. [BST08] Andreas Broschart, Ralf Schenkel, and Martin Theobald. Experiments with Proximity-Aware Scoring for XML Retrieval at INEX 2008. In Advances in Focused Retrieval, 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008. 6 1. Introduction In [BBS10], we have rigorously analyzed the potential of phrases, compared the retrieval quality to proximity scores and carried out a user study. [BBS10] Andreas Broschart, Klaus Berberich, and Ralf Schenkel. Evaluating the Potential of Explicit Phrases for Retrieval Quality. In Advances in Infor- mation Retrieval, 32nd European Conference on IR Research, ECIR 2010, Poster. Efficiency-related contributions have been described in the following publications: in [SBH+07], we have shown how to accelerate top-k query processing by means of a content score index structure and a new index structure that incorporates a content and a proximity score. Furthermore, we have shown that a few thousand entries per list are enough to provide the same retrieval quality as on unpruned content score lists. In addition, pruning saves on index space. [SBH+07] Ralf Schenkel, Andreas Broschart, Seung-Won Hwang, Martin Theobald, and Gerhard Weikum. Efficient Text Proximity Search. In String Pro- cessing and Information Retrieval, 14th International Symposium, SPIRE 2007. In [BS08a], we have presented the merge join-based approach with pruned index lists to save on overhead costs of top-k query processing and to lower index space requirements. [BS08a] Andreas Broschart and Ralf Schenkel. Effiziente Textsuche mit Positions- information. In Grundlagen von Datenbanken, 2008. In [BS12], we have presented a novel, distributed index tuning framework which is a major part of this thesis (cf. Chapter 8) and supported it with extensive experiments especially for GOV2. Additional experiments with this tuning approach for more test beds have been released for the INEX 2009 Efficiency Track in [BS09] and for the TREC Web Track 2010 in [BS10]. [BS12] Andreas Broschart and Ralf Schenkel. High-Performance Processing of Text Queries with Tunable Pruned Term and Term Pair Indexes. In ACM Transactions on Information Systems 2012, Volume 30, Issue 1. [BS09] Andreas Broschart and Ralf Schenkel. Index Tuning for Efficient Proximity- Enhanced Query Processing. In Focused Retrieval and Evaluation, 8th In- ternational Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009. [BS10] Andreas Broschart and Ralf Schenkel. MMCI at the TREC 2010 Web Track. In The Nineteenth Text REtrieval Conference Proceedings, TREC 2010. 1.3 Thesis Outline 7 In [BS11], we have presented a hybrid index structure for improved cold cache run times for pruned indexes from our index tuning framework that trades in a reduced number of fetched lists for an increased number of read bytes. [BS11] Andreas Broschart and Ralf Schenkel. A Novel Hybrid Index Structure for Efficient Text Retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Poster. 1.3 Thesis Outline Chapter 2 gives an extensive overview over proximity-enhanced score models that we put in four categories: after describing unigram models that serve as basis for proximity score models, we detail every proximity score model and carry out a system-oriented comparison of the implementation effort required per score model. Chapter 3 introduces two popular evaluation initiatives, namely TREC (text retrieval), and INEX (XML retrieval) and two less popular, niche initiatives. We present a choice of test beds for each of them and performance metrics for both text/document retrieval and for XML retrieval. Chapter 4 shows experimental results of the score models from the original papers surveyed in Chapter 2. As they usually compare only a few of the score models, we perform a comparative analysis of a significant portion of proximity score models in a single evaluation framework using four test beds. Chapter 5 proposes one of the first XML score models that uses proximity information, rigorously analyzes the potential of explicit phrases for retrieval quality, and compares it to a proximity score. Chapter 6 presents various top-k algorithms from both the database systems and the IR community as well as non-top-k algorithms. Chapter 7 describes a modification of Büttcher et al.’s score model that allows to use it in a top-k style with dynamic pruning techniques that not only improves retrieval effectiveness but also efficiency compared to standard top-k algorithms. We show that already a few thousand read entries per index list yield a good retrieval quality. This opens the door to using light-weight n-ary merge joins to save on processing overhead. Moreover, we discuss the feasibility of the remaining proximity score models surveyed in Chapter 2 for top-k query processing and propose appropriate index structures where possible. Chapter 8 introduces our index tuning framework for trading off index size and result quality given an index size constraint. Chapter 9 concludes this thesis and outlines possible future research directions. Chapter 2 Proximity Score Models 2.1 Introduction 2.1.1 Motivation In search engines scoring functions play an important role to rank results supposed to answer user queries. Therefore, the quality of the scoring function is decisive to user satisfaction and success of the search engine. Nowadays, many search engines rely on some form of BM25 [RW94, RWHB+95], a state-of-the-art content-based scoring model commonly used in probabilistic information retrieval. It incorporates tf values (term frequency, i.e., the number of a term’s occurrences in a document) and idf values (inverse document frequency, i.e., the inverse of the number of documents that con- tain a term) plus document length information. Content-based scoring models usually represent documents using ”bags of words” that consider all query term occurrences, but ignore the information where these query terms occur. That way those models ignoring positional information are foregoing the chances to leverage the power of term proximity information, i.e., measuring the distance between query term occurrences in a document and aggregating them into a proximity score to rank the document appro- priately. If this valuable information is ignored, users might face unsatisfactory results. Suppose a user poses the query surface area of a triangular pyramid. Scoring func- tions that completely ignore proximity information may consider documents relevant that contain query terms frequently, but in different paragraphs that are likely to treat different topics: in a document related to geometric objects like the one depicted in Figure 2.1, the first paragraph might elaborate on the ”volume of a triangular prism”, while the second talks about the ”volume of a square pyramid”, and the third about the ”surface area of a cylinder”. Each of the query terms will individually occur quite frequently, but not in the user-intended context. From a user’s point of view, formu- lating her information need as a phrase query might be the solution to prevent such results. As phrase queries are usually used as hard filters, documents that do not con- tain the phrase terms in the exact order (as they might be interleaved by a different term or appear in a slightly different order) are ignored. Unfortunately this comes at the expense of many discarded good results - documents carrying information about 9 10 2. Proximity Score Models 1st paragraph: volume of a triangular prism 2nd paragraph: volume of a square pyramid document The triangular prism … Its volume is therefore V=B*h … Triangular prisms … The square pyramid … Its volume can be computed as … Square pyramids have been used … 3rd paragraph: surface area of a cylinder The surface area of a cylinder … Its surface … Figure 2.1: Non-relevant document for query surface area of a triangular pyramid. the ”surface area of a pyramid composed of four triangular faces” would certainly be a good hit, but excluded by the phrase query. Proximity scores provide a solution to alleviate those effects by providing some kind of soft phrasing without the need to specify phrase bounds by the user. This chapter gives an extensive overview of existing proximity-enhanced score mod- els. We categorize them into the following four categories: • linear combinations of a content score model and a proximity score model de- scribed in Section 2.4 (e.g., Rasolofo and Savoy [RS03], Büttcher et al. [BC05, BCL06], Uematsu et al. [UIF+08], Monz [Mon04], and Tao and Zhai [TZ07]), • integrated score models described in Section 2.5 (e.g., Song et al. [STW+08], de Kretser and Moffat [dKM99, dKM04], and Mishne and de Rijke [MdR05]), • language models with proximity components described in Section 2.6 (e.g., Lv and Zhai [LZ09], and Zhao and Yun [ZY09]), and • models that incorporate proximity features and learn to rank by application of machine learning techniques described in Section 2.7 (e.g., Svore et al. [SKK10], Metzler and Croft [MC05], and Cummins and O’Riordan [CO09]). An experimental study in Chapter 4 will investigate the retrieval quality for a selec- tion of these approaches and compare them to the retrieval quality that can be achieved using BM25. 2.1.2 Model and Notation In order to describe the scoring models of this survey in a uniform manner, we first introduce some notation. 2.1 Introduction 11 Definition 2.1.1. (corpus, vocabulary, position-related notation) A corpus C = {d1, . . . ,dN } is a set of N documents where each document is considered a sequence of terms. The vocabulary V = {v1, . . . ,vm} is the set of terms which occur in an index. Given a document d with length ld, we denote the term occurring at position i of d by pi(d), 1 ≤ i ≤ ld; if the document is clear from the context, we simply write pi. For a term t, we capture the positions in document d where t occurs by Pd(t) = {i : pi(d) = t} ⊆ {1, . . . , ld}; if d is clear from the context, we write P(t). We write Pd(q) := ∪ti∈qPd(ti) for the positions of all query terms in document d, again omitting the suffix d if the document is clear from the context. Given a set of positions P ⊆ {1, . . . , ld} and a document d, we write Td(P) to denote the set of terms at the positions of P ⊆ {1, . . . , ld} in d. Precisely, Td(P) := {pi(d)|i ∈ P}. Definition 2.1.2. (document frequency, inverse document frequency) A term ti occurs in df(ti) documents in C, the document frequency of ti. The inverse document frequency idf(ti) measures a term’s importance in C by means of an inverse function of df(ti). In the literature (e.g., [BCL06, RS03, UIF+08]), the inverse document frequency idf(ti) is used in (slightly) different versions1, e.g., • idf1(ti) = log Ndf (ti) • idf2(ti) = max{0, log N−df (ti)df (ti) }, and • idf3(ti) = max{0, log N−df (ti)+0.5df (ti)+0.5 }. Definition 2.1.3. (term frequency, collection term frequency for terms and n-grams) Given a term ti, a corpus C, and a document d in C, the term frequency of ti in d, tf(ti,d), is the number of times term ti occurs in d. The term frequency of the n-gram (ti, . . . , ti+n−1) in d, tf((ti, . . . , ti+n−1),d), is the number of times the n-gram (ti, . . . , ti+n−1) occurs in d. The collection term frequency of ti is the total number of oc- currences of the term ti in C and defined as ctf(ti) = ∑ d∈C tf(ti,d). The collection term frequency of the n-gram (ti, . . . , ti+n−1) is the total number of occurrences of the n-gram (ti, . . . , ti+n−1) in C and defined as ctf((ti, . . . , ti+n−1)) = ∑ d∈C tf((ti, . . . , ti+n−1),d). Definition 2.1.4. (document and collection length, number of distinct terms in a document or collection) Given a corpus C and a document d in C, the document length of d corresponds to the number of term occurrences in d and is denoted by ld = |d|. The collection length corresponds to the number of term occurrences in C and is denoted by lC = ∑ d∈C ld. While dt(d) = |{t : tf(t,d) > 0}| stands for the number of distinct terms in document d, we use dt = |{t : ∃d ∈ C s.t. tf(t,d) > 0}| as an abbreviation for the number of distinct terms in C. 1Please note that, unlike e.g., [RS03, UIF+08], for idf2(ti) and idf3(ti), we have imposed a lower bound of zero to avoid negative score contributions of too frequent terms. 12 2. Proximity Score Models For the ease of presentation, as a default, we assume that each query term occurs just once per query such that we can use sets of terms to model issued queries. In some cases it may be necessary to deviate from this (e.g., if the order of query terms in the original query matters). Where applicable we will make additional remarks in the description of the affected scoring model. Definition 2.1.5. (query, query term frequency, unordered query, ordered query, query term weights) W.l.o.g. the user issues an (unordered) query q′ = {t′1, . . . , t′x} which is supposed to represent her information need; the query processor evaluates only query terms from V , i.e., the evaluated query is q = q′ ∩V = {t1, . . . , tn}. Ordered queries are denoted by Sq = (t1, . . . , tn). The query term frequency denotes the number of times a query term t appears in a query Sq, short qtf(t); for unordered queries qtf(t) is either 0 or 1. Query terms may be attributed query term weights Wq = (qw(t1), . . . ,qw(tn)) ⊂ [0, 1]n. Definition 2.1.6. (set of pairs of adjacent query term occurrences, set of pairs of all query term occurrences) We denote pairs of query terms that are adjacent to each other (there might be non- query terms in between) in document d by Qadj,d(q) := {(i,j) ∈ Pd(q) × Pd(q) | (i < j) ∧ ∀k ∈ {i + 1, . . . ,j − 1} : k �∈ Pd(q)}. Pairs of query terms within a window of dist positions in document d are defined as Qall,d(q,dist) := {(i,j) ∈ Pd(q) × Pd(q) | (i < j) ∧ (j − i ≤ dist)}. Please note that in this case, the query terms need not occur consecutively in a docu- ment. Qall,d(q) := Qall,d(q, ld) denotes all query term pairs in the document. 2.2 Unigram Models This section describes unigram models that serve as basis for proximity scores. 2.2.1 BM25 We start with the probabilistic, content-scoring model BM25 [RW94, RWHB+95]. Robertson and Walker [RW94] define the relevance score of a document d for the query q = {t1, . . . , tn} as scoreBM25(d,q) = ∑ ti∈q (k1 + 1) · tf(ti,d) k · [(1 − b) + b · ld avgdl ] + tf(ti,d) · W RSJ (ti) · (k3 + 1)qtf(ti) k3 + qtf(ti) , where the Robertson/Sparck Jones weight [RJ76] is defined as W RSJ (ti) = log (r(ti) + 0.5)(N − R − df(ti) + r(ti) + 0.5) (df(ti) − r(ti) + 0.5)(R − r(ti) + 0.5) ; 2.2 Unigram Models 13 R denotes the number of relevant documents, and r(ti) the number of relevant docu- ments which contain term ti. Later publications substitute the Robertson/Sparck Jones weight by a form of idf s.t. scoreBM25(d,q) = ∑ ti∈q (k1 + 1) · tf(ti,d) k · [(1 − b) + b · ld avgdl ] + tf(ti,d) · idf(ti) · qtf ′(ti), where idf and qtf ′ determine the specific variant of BM25; idf is a variant of the inverse document frequency (as described in Definition 2.1.2) and qtf ′ (cf. Definition 2.1.5) represents a function that incorporates ti’s query term frequency qtf. In the BM25 scoring model, k, k1, and b are constants (where k1 = k in the original definition), and avgdl is the average document length in the collection, i.e., avgdl = lC N . A frequently used abbreviation is K = k · [(1 − b) + b · ld avgdl ]. Table 2.1 shows the idf and qtf ′ components and the tuning parameters k, k1, b, and optionally k3 as used in a follow-up [RWHB+95] to the original BM25 paper2 and in some proximity scores. k3 set to none indicates the absence of k3 in the re- spective method. Additionally, the table contains a pointer to the section where the corresponding content/proximity scoring model is described. method idf or W RSJ qtf ′ b k1 k k3 Section Robertson et al. W RSJ (ti) (k3+1)·qtf (ti) k3+qtf (ti) ∈[0.6,0.75] ∈[1.0,2.8] k1 8 2.2.1 Rasolofo and Savoy max{0, log N−df (ti) df (ti) } qtf (ti) k3+qtf (ti) 0.9 1.2 2 1,000 2.4.1 Uematsu et al. max{0, log N−df (ti)+0.5 df (ti)+0.5 } 1 0.75 1.2 1.2 none 2.4.3 Büttcher et al. log N df (ti) 1 0.5 1.2 1.2 none 2.4.2 Tao and Zhai max{0, log N−df (ti)+0.5 df (ti)+0.5 } (k3+1)·qtf (ti) k3+qtf (ti) optimal 1.2 1.2 1,000 2.4.5 Cummins and O’Riordan max{0, log N−df (ti)+0.5 df (ti)+0.5 } qtf (ti) 0.75 0 1.2 none 2.7.4 Table 2.1: Overview: BM25 variations. In Tao and Zhai [TZ07], b is tuned for optimality on BM25, although its exact value is not reported. For our own experiments, we use the BM25 score as used by Büttcher et al. in [BC05, BCL06]. 2.2.2 Lnu.ltc Buckley et al. [BSM95] introduced the Lnu.ltc weighting scheme used by Monz [Mon04] (cf. Section 2.4.4) in a normalized version. Lnu specifies the document weight that is determined by a logarithmically smoothed term frequency and a pivoted length nor- malization. Ltc relates to the query term weight that is computed by a logarithmically smoothed query term frequency in combination with idf and a standard cosine normal- ization. 2The original BM25 paper [RW94] does not specify the parameter choices while [RWHB+95] shows typical values and ranges. 14 2. Proximity Score Models The formulation for the Lnu weighting is lnu(d,ti) = 1+log(tf (ti,d)) 1+log(avgtj ∈{t:tf (t,d)>0}tf (tj ,d)) (1 − slope) · pivot + slope · dt(d) . The slope value is fixed at 0.2, pivot is set to the average number of distinct terms per document in the collection. The ltc weighting scheme for queries is ltc(ti) = (log(qtf(ti)) + 1) · idf1(ti)√∑ tx∈q[(log(qtf(tx)) + 1) · idf1(tx)]2 . 2.2.3 ES Cummins and O’Riordan employ a term weighting scheme learned in [CO07] scoreES (d,q) = ∑ ti∈Td(Pd(q)) tf(ti,d) · qtf(ti) tf(ti,d) + 0.45 · √ ld avgdl · √ ctf(ti)3 · N df(ti)4 and linearly combine it with their proximity score combinations learned by Genetic Programming as described in [CO09]. Section 2.7.4 elaborates on the details of the learning process and specifies the learned proximity scores. 2.2.4 Language Models, Smoothing Methods, and KL-Divergence Another group of models are language models (LM) which have been employed in several areas of computer science such as speech recognition. Ponte and Croft [PC98] and Hiemstra [Hie98] were the first to use language models in Information Retrieval. Language models aim at modelling the query generation process. To this end, they rank documents according to the likelihood that a random sample of a document generates a given ordered query Sq = (t1, . . . , tn). This likelihood is captured by means of a document language model for each document. The most basic language model is the unigram language model that uses bag-of-words. It relies only on term distributions and does not use any context information: Punigram(q|d) = n∏ i=1 P(ti|d), where P(ti|d) = tf (ti,d)ld which corresponds to the maximum likelihood model for document-term probability. For completeness, we introduce bigram language models which consider the previous term as a context (and therefore already incorporate some proximity information by means of the context information) such that Pbigram(q|d) = P(t1|d) n∏ i=2 P(ti|ti−1,d), 2.3 Example 15 where P(ti|ti−1,d) = tf ((ti−1,ti),d)tf (ti−1,d) . The general form of n-gram language models consid- ers the previous n-1 terms and defines probabilities analogously to the bigram language model. In [ZL04], Zhai and Lafferty survey different smoothing methods and compare their performance. Smoothing methods aim at adapting the maximum likelihood estimator such that data sparseness is compensated. Jelinek-Mercer smoothing [JM80] uses a linear interpolation of the maximum likelihood model for document-term probability as foreground model and the collection-term model as background model. It uses a mixture parameter δ to control the influence of each model P(t|d,C) = (1 − δ) · P(t|d) + δ · P(t|C), where P(t|d) = tf (t,d) ld and P(t|C) = ctf (t) lC . For a non-seen term t in d, tf(t,d) is 0 which would make the score of that document zero for any query containing t (as P(t|d) = 0); smoothing aims at fixing that flaw by introducing the background model. Another popular smoothing method surveyed in [ZL04] is the Dirichlet prior P(t|d,C) = tf(t,d) + μ · P(t|C) ld + μ , where μ is a smoothing parameter. KL-divergence [LZ01] measures the difference of two probability distributions. In the case of language models it compares a query language model and a document language model. The basic form of the KL-divergence model is defined as KL(f,g) = ∑ x f(x) · logf(x) g(x) . If f and g represent the same distribution, their KL-divergence value becomes 0 – for larger values, the divergence is larger. Lv and Zhai (cf. Section 2.6.1) use KL-divergence to compare the similarity between a query language model and their positional language model that constructs a language model at each term position. The KL-divergence language model variant used by Tao and Zhai is defined as scoreKL(d,q) = ∑ t∈Td(Pd(q)) (qtf(t) · ln(1 + tf(t,d) μ · p(t|C) )) + |q| · ln μ ld + μ , where p(t|C) = ctf (t) lC . 2.3 Example As a running example, we will use a poem written by Amy Lowell (taken from ”A Dome of Many-Coloured Glass”) which is depicted in Figure 2.2. Superscripts represent term positions. Our query will be q = {sea,shell,song} or, for order-aware scoring models, Sq = (sea,shell,song). The query terms (with position information, disabling match cases and ignoring punctuation) in the poem are located at {sea1, shell2, sea3, shell4, sea5, shell6, song10, song14, sea53, shell54, sea55, shell56}. 16 2. Proximity Score Models Sea1 Shell2 Sea3 Shell,4 Sea5 Shell,6 Sing7 me8 a9 song,10 O11 Please!12 A13 song14 of15 ships,16 and17 sailor18 men,19 And20 parrots21, and22 tropical23 trees,24 Of25 islands26 lost27 in28 the29 Spanish30 Main31 Which32 no33 man34 ever35 may36 find37 again,38 Of39 fishes40 and41 corals42 under43 the44 waves,45 And46 seahorses47 stabled48 in49 great50 green51 caves.52 Sea53 Shell,54 Sea55 Shell,56 Sing57 of58 the59 things60 you61 know62 so63 well.64 Figure 2.2: A poem with position information. 2.4 Linear Combinations of Scoring Models One category of text-proximity enhanced scoring models is based on linearly combining content and proximity scores. Such scoring models always attribute a relevance score of the following form to a given document d with respect to a query q: score(d,q) = λ · cscore(d,q) + (1 − λ) · pscore(d,q),λ ∈ (0, 1). While cscore denotes the content score, pscore denotes the proximity score. In this section, we will present several approaches that can be assigned to this class of scoring models, namely scoring approaches by Rasolofo and Savoy [RS03], Büttcher et al. [BC05, BCL06], Uematsu et al. [UIF+08], Monz [Mon04], and Tao and Zhai [TZ07]. Please note that the absolute scores computed in [RS03, BC05, BCL06, Mon04, UIF+08, TZ07] differ by a factor of two from the descriptions presented here to fit our frame- work. Dividing the scores from the original papers by two however neither influences the ranking (as the order of scores is preserved) nor the ratio between scores attributed to documents. 2.4.1 Rasolofo and Savoy Rasolofo and Savoy [RS03] compute results of a query q by means of a two-stage algo- rithm: in stage one, the algorithm computes the top-100 documents from C according to the cscore which is a variant of Okapi BM25 described in Section 2.2.1. In stage two, it reranks these documents. To this end, for every such document, it computes the pscore. Reranking just the top-100 documents from stage one is motivated by efficiency needs and the main interest to improve the ranking of the top-ranked documents. The algorithm sequentially reads the query term positions within d and computes a weight for each pair of query term positions (i,j) ∈ Qall,d(q,dist) as tpi(i,j) = 1 (i − j)2 . 2.4 Linear Combinations of Scoring Models 17 The underlying assumption is that there is no semantic relationship between two key- words located in a text window with a width that exceeds dist. [RS03] sets dist to a value of 5. This means for the poem example and dist=5 that song10 only influences (and is influenced by) sea5, shell6, and song14 as these three occurrences are located within the text window of song10. We define the sum of tpi contributions of term pair (ti, tj ) within a text window of size dist as tpiaccd(ti, tj,dist) = ∑ (i,j)∈Qall,d(q,dist):pi=ti∧pj =tj tpi(i,j). As shell4 or sea53 are too distant from song10 in document d (i.e., they are not part of song10’s text window), the term pairs (shell4, song10) and (song10, sea53) do not influence tpiaccd(shell,song, 5) and tpiaccd(sea,song, 5), respectively. For our example the formula leads to the following tpiacc scores: tpiaccd(sea,sea, 5) = 1 (3 − 1)2 + 1 (5 − 1)2 + 1 (5 − 3)2 + 1 (55 − 53)2 = 0.625, tpiaccd(sea,shell, 5) = 1 (2 − 1)2 + 1 (4 − 1)2 + 1 (6 − 1)2 + 1 (3 − 2)2 + 1 (5 − 2)2 + 1 (4 − 3)2 + 1 (6 − 3)2 + 1 (5 − 4)2 + 1 (6 − 5)2 + 1 (54 − 53)2 + 1 (56 − 53)2 + 1 (55 − 54)2 + 1 (56 − 55)2 = 8.484, tpiaccd(sea,song, 5) = 1 (10 − 5)2 = 0.04, tpiaccd(shell,shell, 5) = 1 (4 − 2)2 + 1 (6 − 2)2 + 1 (6 − 4)2 + 1 (56 − 54)2 = 0.8125, tpiaccd(shell,song, 5) = 1 (10 − 6)2 = 0.0625, and tpiaccd(song,song, 5) = 1 (14 − 10)2 = 0.0625. The weight for a pair of query terms (ti, tj ) wd(ti, tj,dist) = tpiaccd(ti, tj,dist) · (k1 + 1) tpiaccd(ti, tj,dist) + K is structure-wise similar to the term frequency component of BM25, substituting tf(t,d) for tpiaccd(ti, tj ). Finally, the proximity scoring function for a document d on query q sums up the contributions of all pairs of query terms in document d. Hence, the formulation pscore(d,q,dist) = ∑ (ti,tj )∈q×q wd(ti, tj,dist) · min{qw(ti),qw(tj )}, where qw(ti) = idf2(ti) · qtf (ti)k3+qtf (ti) , which shrinks the influence of a query term to the importance of the least important term in the considered pair. The final score is defined as scoreRasolofo(d,q,dist) = 1 2 · cscore(d,q) + 1 2 · pscore(d,q). 18 2. Proximity Score Models 2.4.2 Büttcher et al. Büttcher et al. [BC05, BCL06] combine the baseline BM25 scoring function3 with a proximity score which we will describe in the following to compute document-level relevance scores. For any document d, they maintain for every query term tk an accu- mulator value denoted by accd(tk). This accumulator value can be summarized by the following formula: accd(tk) = ∑ (i,j)∈Qadj,d(q):pi �=pj ,pi=tk idf1(pj ) (j − i)2 + ∑ (i,j)∈Qadj,d(q):pi �=pj ,pj =tk idf1(pi) (j − i)2 . Büttcher et al. use adjacent query term occurrences to compute accumulator values. Adjacency is used in the broader sense here such that non-query terms might be located between adjacent query terms. It is obvious that the accumulator value increases the more, the less distant the occurrences of two distinct terms are and the less documents in the collection contain the adjacent term. For our example we demonstrate how to compute accd scores. To this end we have to consider Qadj,d(q)= {(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 10), (10, 14), (14, 53),(53, 54), (54, 55), (55, 56)} which contains the position information of adjacent query term oc- currences in the example poem. accd(t) considers idf scores of t’s adjacent query terms. We briefly explain how to compute accd(song): for this purpose we consider all query term occurrences adjacent to any occurrence of query term song in d. song14 is adjacent to sea53 and shell6 is adjacent to song10. accd(song) is increased by the idf scores of the adjacent terms of ’song’ but decreases with the square of increasing distance to the adjacent terms. Please note that accd(song) is not influenced by (song10,song14) as p10 = p14 = song. Consequently, accd(song) =[ idf1(sea) (53 − 14)2 ] + [ idf1(shell) (10 − 6)2 ]. The proximity score structurally resembles the BM25 scoring model presented in Section 2.2, substituting the accumulator values for the tf values: pscore(d,q) = ∑ t∈q min{1, idf1(t)} accd(t) · (k1 + 1) accd(t) + K . The document score for a document d structurally corresponds to the one formulated in Subsection 2.4.1: scoreBüttcher(d,q) = 1 2 · cscore(d,q) + 1 2 · pscore(d,q). In Büttcher et al.’s approach [BC05, BCL06] only adjacent query terms of the same document influence a query term’s aggregated proximity score. This varies the previous work by Rasolofo and Savoy [RS03] that considers all query terms within a given text window. Moreover, Büttcher et al. limit the term proximity score’s influence on the document score for terms occurring in just a few documents. They do so by restricting idf1(t) as a multiplier in pscore to one. 3with b = 0.5 and k = k1 = 1.2 2.4 Linear Combinations of Scoring Models 19 2.4.3 Uematsu et al. From a structural point of view, Uematsu et al.’s approach [UIF+08] is very similar to Büttcher et al.’s approach. Like Büttcher et al., they use a variant of BM25 (in a slightly different version) as cscore. Details can be found in Table 2.1. The proximity score structurally resembles the cscore, however substitutes the tf values by co-occurrence values of all query terms. Here coocc(d,q) counts the number of sentences where all query terms from q occur: pscore(d,q) = ∑ ti∈q coocc(d,q) · (k1 + 1) coocc(d,q) + K · idf3(ti). While the first sentence (positions 1 to 12) in our running example contains all query terms (sea, shell, and song) at least once, the second sentence (positions 13 to 52) only contains the query term song and the third sentence (positions 53 to 64) only sea and shell, but not song; hence coocc(d,q)=1. scoreUematsu(d,q) combines its cscore and pscore in the same way as scoreBüttcher : scoreUematsu(d,q) = 1 2 · cscore(d,q) + 1 2 · pscore(d,q). 2.4.4 Monz Monz [Mon04] uses a normalized version of Buckley’s Lnu.ltc weighting scheme [BSM95] as cscore. Monz normalizes the Lnu.ltc score scoreLnu.ltc(d,q) = ∑ ti∈q lnu(d,ti) · ltc(ti) with respect to the maximal similarity score of the query such that scoreLnu.ltc,norm(d,q) = scoreLnu.ltc(d,q) maxd∈C scoreLnu.ltc(d,q) . The model used for the pscore builds on the concept of a minimal matching span, which is the smallest text excerpt that contains all terms that occur both in the query and in the document. To capture the minimal matching span more formally, Monz defines the concept of matching spans. Given a document d and a query q, a matching span ms is a set of consecutive positions, where q ∩ Td({1, . . . , ld}) = q ∩ Td(ms). That means the consecutive document part represented by ms contains every query term that occurs in document d at least once. The length of a matching span is defined as length(ms) = max(ms) − min(ms) + 1, max(ms), min(ms) being the highest and lowest position in ms, respectively. A matching span for d and q with the shortest length is called minimal matching span mms(d,q); its length is denoted length(mms(d,q)). If multiple minimal matching spans exist, we can safely pick any of them (e.g., the one with the lowest min(ms)) 20 2. Proximity Score Models to compute the proximity score pscore: we can do that since the pscore only uses length(mms(d,q)) (multiple minimal matching spans have the same length) and the number of query terms in the minimal matching span (which is also equal for multiple matching spans since they contain all query terms that occur in the document). We illustrate the concepts of matching span and minimal matching span using our running example, showing again the query term occurrences in d: sea1,shell2,sea3,shell4,sea5,shell6,song10︸ ︷︷ ︸ mms(d,q) ,song14,sea53,shell54︸ ︷︷ ︸ ms ,sea55,shell56. Matching spans of d contain all terms that occur both in q and d. Therefore, {14, . . . , 54}, but also others like {14, . . . , 56} qualify as matching spans. A matching span with the smallest length, called minimal matching span, however, consists of {5, . . . , 10}. Note that in [Mon04] minimal matching spans have been defined ambigu- ously. Instead of defining the minimal matching span as the matching span with the shortest length in the given document, it has only been checked that a given matching span does not contain another matching span with lower length as a subset. However, this might result in multiple minimal matching spans of different lengths: in the exam- ple given above, employing the ambiguous definition of Monz [Mon04], {14, . . . , 54} and {6, . . . , 53} would qualify as minimal matching spans besides {5, . . . , 10} since all of them do not contain another matching span with lower length as subset. The span size ratio considers the proximity of matching terms and is defined as ssr(d,q) = |q ∩ Td({1, . . . , ld})| length(mms(d,q)) . It measures how large the document excerpt has to be in order to cover all possible distinct query terms in a document. In our example, length(mms(d,q)) = 10 - 5 + 1 = 6 and ssr(d,q) = 3 6 = 0.5. The matching term ratio mtr(d,q) = |q ∩ Td({1, . . . , ld})| |q| measures the fraction of covered query terms in a document which is 3 3 = 1. Span size ratio and matching term ratio are used to compute the pscore(d,q) = ssr(d,q)α · mtr(d,q)β. Here, α and β are additional weights for the span size ratio and the matching term ratio, respectively.4 The score of a document is then computed as scoreM onz(d,q) = { λ · cscore(d,q) + (1 − λ) · pscore(d,q) : |q ∩ Td({1, . . . , ld})| > 1 cscore(d,q) : else . If d contains only one query term (i.e., |q ∩ Td({1, . . . , ld})| = 1), the pscore is omitted. If |q ∩ Td({1, . . . , ld})| > 1 (i.e., the document d and query q have more than one query term in common), both cscore and pscore influence the final score. 4Monz uses α=0.125, β=1.0, and λ=0.4 for his experiments. 2.4 Linear Combinations of Scoring Models 21 2.4.5 Tao and Zhai Tao and Zhai [TZ07] linearly combine a baseline cscore with a proximity score. The baseline scores are 1) the KL-divergence model and 2) the Okapi BM25 model as de- scribed in Section 2.2. The authors outline five proximity distance functions which can be classified into span-based and distance aggregation measures. The first class computes proximity scores based on the length of a text segment that covers all query terms. The second class aggregates distances over pairs of query terms and is more local than the first one which takes all query terms into account. The authors use two different span-based measures: 1) Span is defined as the length of the document part that covers all query term oc- currences in a document, i.e., Span(d,q) = max(Pd(q)) − min(Pd(q)). 2) Min coverage (MinCover) uses the length of the shortest document part that covers each query term at least once in a document, i.e., MinCover(d,q) = min{max(P ′) − min(P ′) : Td(P ′) = Td(Pd(q))}, where P ′ is a set of positions in document d. Both span-based measures are normalized such that Spannorm(d,q) = max(Pd(q)) − min(Pd(q)) |Pd(q)| and MinCovernorm(d,q) = min{max(P ′) − min(P ′)|Td(P ′) = Td(Pd(q))} |Td(P ′)| . Distance aggregation measures come in three variants and are all based on the minimum distance between pairs of query terms ta and tb defined as mindist(ta, tb,d) = min{|i − j| : pi(d) = ta ∧ pj (d) = tb}. Those three variants encompass 1) Minimum pair distance (MinDist) which is the smallest distance over all query term pairs in document d, i.e., MinDist(d,q) = minta,tb∈Td(Pd(q)),ta �=tb{mindist(ta, tb,d)}. 2) Average pair distance (AvgDist) which is the average distance over all query term pairs in document d, i.e., AvgDist(d,q) = 2 n(n − 1) ∑ ta,tb∈Td(Pd(q)),ta �=tb mindist(ta, tb,d) with n being the number of unique matched query terms in d. 22 2. Proximity Score Models 3) Maximum pair distance (MaxDist) which is the maximum distance over all query term pairs in document d, i.e., MaxDist(d,q) = maxta,tb∈Td(Pd(q)),ta �=tb{mindist(ta, tb,d)}. For the case that document d contains just one kind of query term, MinDist(d,q), AvgDist(d,q), and MaxDist(d,q) are all defined as ld. The authors propose two constraints for a function that transforms the value of a proximity distance function δ(d,q) into a proximity score π(d,q) which is a func- tion of δ(d,q). While the first constraint (called proximity heuristic) attributes smaller proximity scores to larger δ(d,q), the second constraint suggests a convex-shaped trans- formation function that only rewards really close term occurrences. Both constraints lead to the definition of a proximity score π(d,q) = log(α + e−δ(d,q)), where α is a tuning parameter. The baseline retrieval models KL-divergence and BM25 are enriched with the proximity score such that R1(d,q) = 1 2 ·scoreKL(d,q) + 1 2 ·π(d,q) and R2(d,q) = 1 2 ·scoreBM25(d,q) + 1 2 ·π(d,q). 2.5 Integrated Score Models Another category of proximity-enhanced score models are integrated score models. Un- like the linear combination models presented in Section 2.4, integrated score models do not linearly combine cscore and pscore parts, but seek providing a holistic, integrated approach to rank. 2.5.1 De Kretser and Moffat De Kretser and Moffat [dKM99, dKM04] describe a model that does not make use of Okapi BM25 [RW94, RWHB+95], but relies exclusively on proximity scores of query terms in the text collection. It retrieves the exact point of maximum similarity to the query for any given document, not the document as a whole. The presentation of result snippets can benefit from the knowledge about the exact point of maximum similarity as this opens the option to show only relevant document parts to the user. De Kretser and Moffat’s key assumption is that text regions having a high density of query terms are considered as highly important, while isolated query terms in a document are considered as less important. Thus, dense text regions are attributed high scores, while text regions consisting of isolated query terms generate lower scores. To this end, for each query term t, there is a contribution function ct which expresses the impact of t, occurring at a position l, on the score for position x. There are three main factors that influence contribution functions: shape, height, and spread. The shape of the contribution function determines the region of influence of each appearance of t in d. De Kretser and Moffat implemented triangle, cosine, circle, and 2.5 Integrated Score Models 23 arc functions that are plotted in [dKM99]. Unfortunately, the plots depicted in [dKM99] for arc and circle do not match the formulas. Hence, we added two additional functions we named circle’ and arc’ that match the plots. ht -st 0 st c t (x ,l) offset from query term occurence x-l triangle cosine circle arc Figure 2.3: Plots according to formulas in [dKM99]. ht -st 0 st c t (x ,l) offset from query term occurence x-l triangle circle circle’ arc’ Figure 2.4: Arc and circle replaced to fit the plots in [dKM99, dKM04]. The corresponding contribution functions are listed below: 1. triangle: ct(x,l) = max(0,ht · (1 − |x−l|st )) 2. cosine: ct(x,l) = max(0,ht · (1+cos(π· |x−l| st )) 2 ) 3. circle: ct(x,l) = max(0,ht · √ 1 − ( |x−l| st )2) 24 2. Proximity Score Models 4. arc: ct(x,l) = max(0, ht 2 · (1 − |x−l| st + √ 1 − ( |x−l| st )2)) 5. circle’: ct(x,l) = max(0,ht · (1 − √ 1 − (1 − |x−l| st )2)) 6. arc’: ct(x,l) = max(0, ht 2 · (1 − √ 1 − (1 − |x−l| st )2 + (1 − |x−l| st ))), where |x − l| denotes the positional distance between an occurrence of query term t at position l and the position x we want to compute a score for. Furthermore, ht represents the height and st the spread of t’s contribution function. The plots for the functions are depicted in Figure 2.3 and Figure 2.4, respectively. 0 5 10 15 20 25 30 35 10 20 30 40 50 60 s im il a ri ty x sea shell song Figure 2.5: Example: triangle-shaped contribution function. The maximum height of the contribution function for a query term occurs at the position of each query term appearance. A query term t generates a contribution function using either a non-damped height ht,non−damped = qtf(t) · lC ctf(t) or a damped height ht,damped = qtf(t) · loge lC ctf(t) 2.5 Integrated Score Models 25 0 20 40 60 80 100 120 5 10 15 20 25 30 35 40 45 50 55 60 sc or e x (d ,q ) x aggregated Figure 2.6: Example: aggregated score scorex. which are alternatively used as ht in the contribution functions. (The usage of qtf(t) indicates that this approach allows for term repetitions in the same query.) The spread or width of the contribution function determines the distance from the query term appearance in which the query term exerts non-zero influence to the aggregated score. A query term t influences proximity scores of terms within a radius of st = dt lC · lC ctf(t) = dt ctf(t) . The aggregated score for position x in document d and query q is the sum of the contribution function values: scorex(d,q) = ∑ t∈q ∑ l∈Pd(t) ct(x,l). Following our running example, Figure 2.5 depicts the individual non-aggregated triangle-shaped contribution functions for each query term occurrence. Figure 2.6 de- picts the aggregated scores at all locations in the example document; positions with query term occurrences are marked with crosses. To reduce computation costs, de Kretser and Moffat restrict the evaluation of ag- gregated proximity scores to locations where query terms appear. Please note that, 26 2. Proximity Score Models for some documents d, the highest scorex(d,q) might be located at a non-query term location x not considered for efficiency reasons. For the example document, whose ag- gregated scores at various positions are shown in Figure 2.6, this issue does not arise since the highest scorex(d,q) is achieved at positions 4 and 5 where query terms occur. Figure 2.7 shows a scenario where the highest scorex(d,q) is located at a non-query term position. This example underlies a cosine-shaped contribution function that is applied to an example document that contains the query term shell at position 4 and song at position 10. The highest scorex(d,q) value, however, is achieved at the non- query term position 6. 0 20 40 60 80 100 120 5 10 15 20 25 30 35 40 45 50 55 60 sc or e x (d ,q ) x aggregated shell song Figure 2.7: Example: highest aggregated score scorex located at a non-query term location. To obtain a ranking for documents, the authors describe two algorithms: for a given query q, both algorithms start off with retrieving for each document d in the corpus C the set of positions Pd(q) where query terms occur. • The first algorithm computes for every document d and all positions x ∈ Pd(q) the scorex(d,q) at position x in d; the scores from all documents are sorted in descending order. For each document d ∈ C the algorithm creates a document accumulator A[d] that keeps the document’s score. Now the algorithm starts greedily processing the scores from all documents values and adds them to the corresponding accumulator until k documents have been seen. Those documents are returned as the top-k results. 2.5 Integrated Score Models 27 • The second algorithm computes for each document d the maximum similarity score at any position x ∈ Pd(q) in d and returns the k documents with the highest scores. As de Kretser and Moffat consider the first approach more effective, we use this one later for our experiments. 2.5.2 Song et al. Song et al. [STW+08] describe an algorithm that partitions documents into groups of subsequent query term occurrences. By construction, the query terms in such a group, called espan (short for expanded span), are pairwise distinct. By means of the espans that contain a query term, the algorithm computes the query term’s relevance contribution score (as a substitute for proximity scores) that is directly plugged into an Okapi BM25 ranking function. The following assumptions underlie the design of the algorithm: the closer appro- priately chosen groups of query term occurrences in a document, the more likely that the corresponding document is relevant. The more espans contained in a document, the more likely that the document is relevant. The more query terms an espan of a document contains and the more important these terms are, the more likely that the document is relevant. The algorithm to detect espans is depicted in Figure 2.8 and proceeds as follows: given a document d and a query q, all query term occurrences form a sequence of (term, position) pairs that are ordered by ascending position; each such pair is called hit. We identify the jth query term occurrence in the given document by (aj,bj ), aj and bj being the query term and its position in the document, respectively. The algorithm distinguishes four cases while scanning the position-ordered sequence of hits: (1) If the distance between the current hit (aj,bj ) and the next hit (aj+1,bj+1) is larger than a user-defined threshold dmax (i.e., bj+1 −bj > dmax), a new espan starts with the next hit. This is covered in lines 6-9. (2) If the current hit (aj,bj ) and the next hit (aj+1,bj+1) represent the same query term (i.e., aj = aj+1), a new espan starts with the next hit which is described in lines 10-13. (3) If the next hit (aj+1,bj+1) represents a term aj+1 which is identical to a hit’s term in the current subchain currentEspan, it computes the distance between the current and the next hit as well as the distance between the existing hit and the current hit. The new espan begins at the bigger gap which is handled in lines 14-23. (4) Otherwise the algorithm scans the next hit in the sequence of (query term, position) pairs which is caught in lines 24 and 25. Please note that for (3) a tie-breaker is missing, if the distance between the current and the next hit equals the distance between the existing hit and the current hit. For this case our implementation always splits between the current and the next hit. 28 2. Proximity Score Models detectEspans(d, q) 1 termsAndPositions ← sortByPositionAscending(d, q) 2 length ← termsAndPositions.length() 3 espans ← {∅} 4 currentEspan ← ∅ 5 for (j = 1 to length-1) 6 if ((bj+1 − bj ) > dmax) 7 currentEspan ← currentEspan ∪ {(aj , bj )} 8 espans ← espans ∪ {currentEspan} 9 currentEspan ← ∅ 10 else if (aj+1 = aj ) 11 currentEspan ← currentEspan ∪ {(aj , bj )} 12 espans ← espans ∪ {currentEspan} 13 currentEspan ← ∅ 14 else if (∃(ax, bx) ∈ currentEspan s.t. bj+1 = bx) 15 dist1 ← bj+1 - bj 16 dist2 ← bj - bx 17 if (dist1 ≥ dist2) 18 currentEspan ← currentEspan ∪ {(aj , bj )} 19 espans ← espans ∪ {currentEspan} 20 currentEspan ← ∅ 21 else 22 espans ← espans ∪ {currentEspan} 23 currentEspan ← {(aj , bj )} 24 else 25 currentEspan ← currentEspan ∪ {(aj , bj )} 26 if (length �= 0) 27 currentEspan ← currentEspan ∪ {(alength, blength)} 28 espans ← espans ∪ {currentEspan} 29 return espans Figure 2.8: detectEspans pseudocode. For a query q, the set of all espans in document d is denoted as espans(d,q). We illustrate now how to compute all espans for a document with the help of our running example, assuming dmax=10. The query term occurrences are located at {sea1, shell2, sea3, shell4, sea5, shell6, song10, song14, sea53, shell54, sea55, shell56}. The first espan consists of {sea1, shell2} by application of (3) since sea3 is an identical hit to sea1. As the distance between sea1 and shell2 equals the distance between shell2 and sea3, our tie-breaker applies: it splits between the current hit shell2 and the next hit sea3. The second espan follows the same rule and consists of {sea3, shell4}. The next espan consists of {sea5, shell6, song10} by application of (2) as song10 is identical to song14. The distance between song14 and sea53 exceeds dmax. Hence, by application of (1), {song14} forms an espan. According to (3) the remaining two espans are {sea53,shell54} and {sea55,shell56}. Intuitively, for Song et al., the relevance contribution of an espan is a function of its density and the number of query terms occurring in the espan. The density of an espan is defined as density(espan) = #query terms(espan) width(espan) , 2.5 Integrated Score Models 29 where width(espan) = { maxpos(espan)-minpos(espan) + 1: #query terms(espan) > 1 dmax: else , maxpos(espan)=max{b|(a,b) ∈ espan}, and minpos(espan)=min{b|(a,b) ∈ espan}. Song et al. measure a term t’s relevance contribution given an espan that contains t by means of a function f(t,espan) = (density(espan))x · (#query terms(espan))y. If the given espan does not contain term t, f(t,espan) is set to zero. For all their ex- periments Song et al. set x=0.25 and y=0.3, respectively. Depending on the collection, they set b=0.3 and k1=0.4 (TREC-9 and 10) or b=0.45 and k1=2.5 (TREC-11). The relevance contribution of all occurrences of term t in espans(d,q) are accumu- lated to: rc(t,d) = ∑ espanj∈espans(d,q) f(t,espanj ). To compute the final score scoreSong, the authors employ Okapi BM25 and replace tf(ti,d) by rc(ti,d) with idf3 as idf score variant such that scoreSong(d,q) = ∑ ti∈q rc(ti,d) · (k1 + 1) rc(ti,d) + K · idf3(ti). In contrast to Okapi BM25, which attributes a fixed weight of one to each term occurrence, the weight in Song’s approach is dependent on the environment of the term occurrence and the density of the espan it has been assigned to. Although both the approaches proposed by Monz (cf. Section 2.4.4) and Song et al. rely on spans to compute relevance scores, they differ in some features. While Monz considers only the span of minimal length that contains all query terms, Song et al.’s final relevance score incorporates multiple expanded spans. Monz’ minimum matching spans contain all query terms that occur in the considered document, Song et al.’s espan may only contain a subset of them. There is a threshold dmax that limits the width of expanded spans and the relevance contribution of espans is directly plugged in the Okapi BM25 model. 2.5.3 Mishne and de Rijke In [MdR05], Mishne and de Rijke make use of a scoring model similar to the tf-idf model [SWY75] and additionally incorporate the coverage of query terms in the docu- ment to be scored. They use the ordered query Sq to construct all possible term-level n-grams that are part of the query following an ”everything-is-a-phrase approach”. Considering an ordered query Sq=(sea, shell, song), the corresponding 1-grams are (sea), (shell) and (song), 2-grams are (sea, shell) and (shell, song), while the only 3-gram is (sea, shell, song). 30 2. Proximity Score Models Every term-level n-gram (i.e., n consecutively occurring terms in Sq) derivable from the ordered query Sq forms a phrase, with n between 1 and the length of the ordered query. Proximity terms are term-level n-grams like phrases but the authors use two rewrit- ing methods, namely the fixed distance and variable distance mode. For fixed distance proximity terms, the length of the proximity term n and a tuning parameter k are used as input to a combining method (e.g., k + n) that determines the window size where proximity term occurrences in documents are considered (an example follows below). If the distance is m = k + n, all term occurrences in a window of size m or less in a document are attributed the same score. For variable distance proximity terms, terms that are found in smaller windows than size m in the document are attributed a higher score: window sizes are decreased stepwise from m = k + n to 1 + n and matching proximity terms are counted in each step. This is equivalent to issuing a query that consists of multiple fixed distance proximity terms of varying size; the tf value of the n-gram can be increased by one for each window size in which the n-gram occurs. Phrases and proximity terms incorporate position information of query term occur- rences into the scoring model and can replace query terms. The score in its basic form where each ti represents a query term is defined as scoreM ishne(d,Sq) = ∑ ti in Sq √ qtf(ti) · idf(ti) norm(Sq) · √ tf(ti,d) · idf(ti) norm(d) · mtr(d,q) · weight(ti), where norm(Sq) = √ ∑ ti in Sq √ qtf(ti) · idf(ti)2, norm(d) = √ ld, mtr(d,q) = |q ∩ Td({1, . . . , ld})| |q| , and idf(ti) = 1 + idf1(ti). qtf(ti) counts the number of occurrences of query term ti in Sq such that for Sq=(sea, shell, song, sea), qtf(sea) would be 2. weight(ti) is used as a phrase weight proportional to the real term frequency of phrases in different fields of HTML documents (such as BODY, ANCHOR TEXT and TITLE) and seems to be dis- abled for most evaluation methods. We think that norm(Sq) should be rather√∑ ti in Sq √ qtf(ti)2 · idf(ti)2 to make it appear more similar to a cosine normal- ization. The basic form of the scoring model with ti representing a query term can be varied such that ti represents a phrase, a fixed distance proximity term or a variable distance proximity term. We illustrate the effects of variable as well as fixed distance proximity terms on the computation of tf(ti,d) values for the example that ti represents the 2-gram (shell, song). 2.6 Language Models with Proximity Components 31 The query terms shell and song occur at the following positions in the example document: shell2,shell4,shell6,song10,song14,shell54,shell56. While (shell, song) never occurs as a phrase in our example document (the most proximate occurrence of this term pair is (shell6,song10)), the term pair can still occur as a proximity term in the document if k is chosen large enough: if k is set to 4 and the combining method is m = k + n, a window size of m=6 is induced (as the proximity term (shell, song) is a 2-gram which means n = 2). Using variable distance proximity terms is equivalent to using some fixed distance proximity terms of varying size; for the example with a window size of m = 6, using variable distance proximity terms is equivalent to using four fixed distance proximity terms (with a distance of 6, 5, 4, and 3). As (shell6,song10) has one occurrence in a text window of 5 terms for our example document, in the fixed distance proximity mode it increases tf((shell,song),d) by one, while it increases tf((shell,song),d) by two in the variable distance proximity mode; for a window size of 6 and 5. Being 7 positions apart, (shell4,song11) does not influence tf(ti,d) = tf((shell,song),d). For efficiency reasons, document frequencies of phrases and proximity terms are estimated. To estimate the document frequency for a phrase or proximity term p=(ty, ty+1, . . . , tz) with length |p| = z − y + 1, Mishne and de Rijke use different heuristics for their estimations of idf values: • Sum: idf(p) = ∑z i=y idf(ti) = ∑z i=y(1 + idf1(ti)) • Minimum: idf(p) = mini∈{y,...,z} idf(ti) = mini∈{y,...,z}(1 + idf1(ti)) • Maximum: idf(p) = maxi∈{y,...,z} idf(ti) = maxi∈{y,...,z}(1 + idf1(ti)) • Arithmetic mean: idf(p) = 1|p| · ∑z i=y idf(ti) = 1 |p| · ∑z i=y (1 + idf1(ti)) • Geometric mean: idf(p) = ∏z i=y idf(ti) 1 |p| = ∏z i=y (1 + idf1(ti)) 1 |p| 2.6 Language Models with Proximity Components This section presents the language models by Lv and Zhai [LZ09] and by Zhao and Yun [ZY09] that exploit proximity information. 2.6.1 Lv and Zhai In contrast to most other works that deal with language models (LMs), Lv and Zhai [LZ09] do not use one general language model for each document, but one lan- guage model for each word position in a document coined positional language model (PLM) estimated based on position-dependent counts of words. In most existing work on LMs, the estimated document language models only consider the word counts in the document, but not the positions of words. PLMs implement two heuristics which are usually treated externally to LM approaches: 32 2. Proximity Score Models 1. the proximity heuristic that rewards documents which have closeby occurrences of query terms and 2. passage retrieval that scores documents mainly based on the best matching pas- sage. PLMs facilitate the optimization of combination parameters that combine proximity and passage retrieval heuristics on one side and language models on the other side. Furthermore, PLMs allow finding best-matching positions in a document, i.e., support soft passage retrieval. PLMs at a position of a document are estimated based on propagated word counts from the words at all other positions in the document: positions closer to a term occurrence in a document get a higher share of the impact than those farther away which captures the proximity heuristics. A similar approach has also been used by de Kretser and Moffat for text retrieval (cf. Section 2.5.1) and by Beigbeder for XML retrieval (cf. Section 5.2.3). The propagation of a term occurrence to other positions is accomplished by a proximity-density function. A PLM is a generalization of a standard document LM and a window passage LM. Documents can be scored using one PLM or by a combination of multiple PLMs. First, the authors build up a virtual document di for each position i in document d. di is a term frequency vector whose jth component contains the propagated count c′(tj, i) of occurrences of term tj in document d to position i. Thus, p(t|d,i) = c ′(t, i)∑ t′∈V c ′(t′, i) is a PLM at position i, where c′(t, i) = ld∑ j=1 c(t,j) · k(i,j). c(t,j) is the count of term t at position j which is 1 iff t occurs at j (0 otherwise), and k(i,j) (which can be any non-increasing function of |i−j|) serves as a discounting factor. The authors populate the discounting factor with one out of 5 different kernels that determine the influence of a term occurring at position j to position i: 1. Gaussian kernel: k(i,j) = exp[ −(i − j)2 2σ2 ] 2. Triangle kernel: k(i,j) = { 1 − |i−j| σ if |i − j| ≤ σ 0 otherwise 3. Cosine (Hamming) kernel: k(i,j) = { 1 2 [1 + cos( |i−j|·π σ )] if |i − j| ≤ σ 0 otherwise 2.6 Language Models with Proximity Components 33 4. Circle kernel: k(i,j) = { √ 1 − ( |i−j| σ )2 if |i − j| ≤ σ 0 otherwise 5. Passage kernel (=the baseline): k(i,j) = { 1 if |i − j| ≤ σ 0 otherwise The spread σ is a tuning parameter which is kept constant for all queries and query terms. De Kretser and Moffat’s approach described in Section 2.5.1 makes use of kernel functions named contribution functions; the authors employ triangle, cosine, and circle kernels as used by Lv and Zhai, but also an arc-shaped kernel which is not used here. Furthermore, Lv and Zhai use two standard smoothing methods, namely Dirichlet prior and Jelinek-Mercer smoothing (adapted to PLMs). Following the descriptions of smoothing methods for LMs in Section 2.2, application of Dirichlet prior smoothing to Lv and Zhai’s PLM leads to pDP (t|d,i) = c′(t, i) + μp(t|C) ( ∑ t′∈V c ′(t′, i)) + μ , and Jelinek-Mercer smoothing results in pJM (t|d,i) = (1 − λ)p(t|d,i) + λp(t|C) with p(t|C) = ctf (t) lC . Intuitively, p(t|d,i) describes the share of the impact of t to impacts of all terms at position i in d, the relative influence share of term t to position i in d. For each PLM, the authors adopt the KL divergence model to compute a position i-specific score S(d,q,i) = − ∑ t∈V p(t|q) · log p(t|q) p(t|d,i) , where p(t|d,i) can be either the non-smoothed, Dirichlet prior smoothed (pDP (t|d,i)) or Jelinek-Mercer smoothed pJM (t|d,i) variant; p(t|q) is the maximum likelihood estimate (MLE) for a query language model, i.e., p(t|q) = qtf (t)|q| or a result of a pseudo relevance feedback algorithm. Ranking options are as follows: • scoring all documents by the best position in that document: S(d,q) = maxi∈{1,...,ld}{S(d,q,i)} • scoring all documents by the average of the best k positions in that document: S(d,q) = 1 k · ∑ i∈top-k of all S(d,q,·) S(d,q,i) 34 2. Proximity Score Models • scoring all documents using a weighted score based on various spreads σ: S(d,q) = ∑ σ∈R βσ · maxi∈{1,...,ld}{Sσ(d,q,i)} with R being a predefined set of spreads and ∑ σ∈R βσ = 1. 2.6.2 Zhao and Yun Zhao and Yun [ZY09] propose a proximity language model that incorporates a so-called proximity centrality and uses Dirichlet prior smoothing. The proximity centrality is computed for every query term and expresses the query term’s importance for the proximity structure in document d relative to the query q = {t1, . . . , tn}. The score for a document d relative to a query q is defined as score(d,q) = ∑ tf (ti,d)>0,ti in q p(ti|θ̂q) log ps(ti|d,u) αd · p(ti|C) + log αd, where θ̂q represents the language model estimate for q (s.t. p(ti|θ̂q) = qtf (ti)|q| ), and u = (u1, . . . ,u|V |) are hyper-level parameters of the Dirichlet prior with ui = λProxB(ti). ps(ti|d,u) = θ̂Bd,ti = tf(ti,d) + ui + μp(ti|C) ld + ∑|V | i=1 ui + μ is the seen word probability of ti in document d wrt its proximity model. αd · p(ti|C) is the probability assigned to unseen words in d, where αd = μ ld + ∑|V | i=1 ui + μ and p(ti|C) = ctf(ti) lC . The authors implement three variants to compute the proximate centrality ProxB(ti) of a term ti in d: 1. minimum distance: ProxM inDist(ti) = f(mintj �=ti,tj in q {Dis(ti, tj,d)}) 2. average distance: ProxAvgDist(ti) = f( 1n−1 ∑ tj �=ti,tj in q Dis(ti, tj,d)), where n = |{tj in q : tf(tj,d) > 0}| 3. summed distance: ProxSumP rox(ti) = ∑ tj �=ti,tj in q f(Dis(ti, tj,d)) While Dis(ti, tj,d) is the minimum pairwise distance between occurrences of the terms ti and tj in document d, f is a non-linear monotonic function to transform a pair- wise distance dist into a term proximity score: f(dist) = x−dist, where x is a scaling parameter. If not both ti and tj occur at least once in d, Dis(ti, tj,d) is set to ld. 2.7 Learning to rank 35 2.7 Learning to rank 2.7.1 General Introduction to Learning to Rank Approaches Learning to rank approaches, a kind of supervised learning approaches, have become popular over the last decade. Supervised learning approaches rely on a training set which consists of a set of training topics, a document collection represented by feature vectors, and the corresponding relevance assessments. According to [Liu11], learning algorithms aim at learning a ranking model (i.e., how to combine the features) such that the learned ranking model can predict the ground-truth labels of the training set as accurately as possible where prediction accuracy is measured using a loss function. [Liu11] contains a comprehensive review of many contributions in the research area of learning to rank. 2.7.2 Svore et al. Svore et al. [SKK10] extend the work by Song et al. [STW+08] summarized in Sec- tion 2.5.2. They provide a measure how to determine the goodness of an espan and extend the espan feature set introduced in [STW+08]. The initial approach presented in [STW+08] used only the density of espans and the number of query terms to assess the goodness of an espan. In this method, the goodness gs of an espan s is defined as gs = ∑ f∈F αfvf,s, where f is a feature of s taken from a feature set F. αf denotes the weight of f and vf,s the value of f for s. The goodness score for document d that contains a set of espans S is defined as gd = ∑ s∈S ∑ f∈F αfvf,s = ∑ f∈F αf ( ∑ s∈S vf,s). The goal is to learn all feature weights αf . To this end, the sum of the document’s espans’ feature vectors is input into LambdaRank [BRL06]. Espan-based features used for the goodness score can be assigned to different cate- gories: basic query match features, formatting and linguistic features, and third-party phrase features extracted from Wikipedia titles and popular n-grams from search engine query logs. A detailed list of espan goodness features can be found in Table 2.2. Model-related features which concern (unigram/standard) BM25, a bigram version of BM25 as well as proximity match features are depicted in Table 2.3. They can be used as additional features to determine the goodness score of a document (substituting∑ s∈S vf,s). 2.7.3 Metzler and Croft In [MC05], Metzler and Croft design a framework to model term dependencies using Markov random fields (MRFs). In statistical machine learning, MRFs are used to 36 2. Proximity Score Models Query Match Features Espan contains ≥ 2 (≥ 4) query terms (both binary) Espan length (number of terms in espan) Count of query terms in espan and density of espan Formatting and Linguistic Features (F) Count of indefinite and definite articles in espan Count of stopwords in espan Espan contains a sentence (paragraph) boundary (binary) Espan contains only stopwords (all binary) Espan contains html markup (bold, italic, tags) (binary) Third-party Phrase Features (P) Espan contains an important phrase (binary) Count and Density of important phrases in espan Table 2.2: Espan goodness features. λBM25 Features Term frequency of query unigrams Document frequency of query unigrams Length of body content (number of terms) λBM25-2 Features Term frequency of query bigrams Document frequency of query bigrams Proximity Match Features Relevance contribution (per query term, rc in [STW+08]) Number of espans in the document Maximum, average espan length, maximum, average espan density Maximum, average count of query matches in espans Length of espan with highest term frequency Term frequency of espan with longest length, largest density Table 2.3: Model feature sets. model joint distributions. [MC05] models a joint distribution PΛ(Q,D) over random variables for queries Q and documents D (an estimate of the relevance of a document to a query), parameterized by Λ. Λ is estimated given user-defined relevance assessments. The model uses three kinds of features, namely single query terms, ordered phrases, and unordered phrases. An MRF is generated from an undirected graph G whose nodes represent random variables while edges carry dependence information between random variables. There are two types of nodes, one query node for each query term qi and one document node D. Dependent query terms are connected to each other by edges. All query term nodes are connected to the document node. There exist three variants of the MRF model: 2.7 Learning to rank 37 • Full independence variant (FI): query terms are considered independent given a document D which means that P(qi|D,qj �=i) = P(qi|D), an assumption that many retrieval models like bag-of-words and unigram language models are based on. • Sequential dependence variant (SD): adjacent query terms are considered depen- dent; i.e., P(qi|D,qj �=i) = P(qi|D) only for qj not adjacent to qi. This variant can represent biterm (non-order-aware occurrences of query terms in documents) and bigram (order-aware occurrences of query terms in documents) models. • Full dependence variant (FD): all query terms are dependent on each other. The corresponding graph is complete. sea shell song D sea shell song D sea shell song D Figure 2.9: Three variants of the MRF model for our running example query, i.e., Sq=(sea,shell,song). We depict (left) the full indepence (FI) variant, (middle) the sequential dependence (SD) variant, (right) the full dependence (FD) variant. Figure 2.9 illustrates the three variants of the MRF model for the query from our running example which means that Sq=(sea,shell,song). While the FI variant considers the three query terms as independently occurring in documents, the SD variant con- siders the query as ordered and all adjacent query terms in the query as related: sea and shell as well as shell and song are treated as dependent. The FD variant considers all query term pairs as related: sea and shell, sea and song, and shell and song are connected in the graph. To utilize the MRF model, in a first step, the graph G to represent all query term dependencies is constructed. In a second step, a set of potential functions ψ(·, Λ) over cliques in the graph is defined. Potential functions are parameterized as ψ(c, Λ) = exp(λcf(c)), where f(c) is a feature function over random variables in a clique c. The joint distribution over the random variables in G is defined as PΛ(Q,D) = 1 ZΛ ∏ c∈C(G) ψ(c; Λ), where Q = Sq = (q1, . . . ,qn), ZΛ = ∑ Q,D ∏ c∈C(G) ψ(c; Λ), and C(G) is the set of cliques in G. Metzler and Croft propose three kinds of potential functions that aim at 38 2. Proximity Score Models abstracting the idea of term co-occurrence which can be applied to different kinds of cliques: • 2-clique, one edge between document node D and query node qi (i.e., c = qi,D): ψT (c) = λT log P(qi|D) = λT log[(1 − αD) tf(qi,D) lD + αD ctf(qi) lC ], where P(qi|D) is a smoothed language modeling estimate which uses a mixture of a document foreground model for document D and a collection background model, αD = μ μ+lD the Dirichlet prior (cf. Section 2.6). The potential function measures how likely or well D is described by qi. • cliques with two or more query nodes (i.e., c = qi, . . . ,qi+k,D): ψO(c) = λO log P(#1(qi, . . . ,qi+k)|D) = λO log[(1 − αD) tf#1(qi,...,qi+k),D lD + αD ctf#1(qi,...,qi+k) lC ], where tf#1(qi,...,qi+k),D is the number of occurrences of the exact phrase qi, . . . ,qi+k in D. • an unordered window of size N, cliques with two or more query nodes (i.e., c = qi, . . . ,qj,D): ψU (c) = λU log P(#uwN(qi, . . . ,qj )|D) = λU log[(1 − αD) tf#uwN (qi,...,qj ),D lD + αD ctf#uwN (qi,...,qj ) lC ], where tf#uwN (qi,...,qj ),D is the number of ordered or unordered occurrences of the query terms qi, . . . ,qj in D within a window of size N. 2.7 Learning to rank 39 In a third step, documents are ranked according to PΛ(D|Q). PΛ(D|Q) = PΛ(Q,D) PΛ(Q) ∝ PΛ(Q,D) = 1 ZΛ ∏ c∈C(G) ψ(c; Λ) ∝ ∏ c∈C(G) ψ(c; Λ) ∝ ∑ c∈C(G) log ψ(c; Λ) = ∑ c∈C(G) log(exp(λcf(c))) = ∑ c∈C(G) λcf(c) = ∑ c∈T λT fT (c) + ∑ c∈O λOfO(c) + ∑ c∈O∪U λUfU (c) such that λT + λO + λU = 1. T is the set of 2-cliques representing one query term and a document D, O is the set of cliques with a document node and at least two continuously appearing query terms, and U a set of cliques with a document node and at least two non-contiguously appearing query terms. λT , λO, and λU need to be tuned such that the retrieval measure for a given test bed is maximized (the authors use the mean average precision value as retrieval measure). As the authors claim that the mean average precision curve has a near concave surface when plotted against the tuning parameters and due to the small number of tuning parameters, this makes tuning by simple hill climbing feasible (i.e., it is unlikely to run into a local maximum value). 2.7.4 Cummins and O’Riordan Cummins and O’Riordan [CO09] use some term-term proximity measures in a learning to rank framework. To give examples for the various measures, we use our running example, showing again the query term occurrences in d, sea1,shell2,sea3,shell4,sea5,shell6,song10,song14,sea53,shell54,sea55,shell56. For the ease of presentation, we restrict ourselves to the query term pair (sea,song) when we explain term-term proximity measures. Measures which explicitly capture proximity of query term occurrences in documents include 40 2. Proximity Score Models 1) the minimum distance between query terms as used in Tao and Zhai (cf. Section 2.4.5): mindist(ti, tj,d) = min{|i − j| : pi(d) = ti ∧ pj (d) = tj}, where mindist(sea,song,d) = |5 − 10| = 5, 2) the distance of average positions of ti and tj in d: diff avg pos(ti, tj,d) = | ∑ pi(d)=ti i tf(ti,d) − ∑ pj (d)=tj j tf(tj,d) |, where diff avg pos(sea,song,d) = |1+3+5+53+55 5 − 10+14 2 | = |117 5 − 24 2 | = 11.4, 3) the average distance between all occurrences of ti and tj in d: avg dist(ti, tj,d) = ∑ i∈Pd(ti) ∑ j∈Pd(tj ) |i − j| |Pd(ti)| · |Pd(tj )| , where avg dist(sea,song,d) = (9+13)+(7+11)+(5+9)+(43+39)+(45+41) 5·2 = 22.2, 4) the average of the shortest distance between all occurrences of the least frequently occurring term ti and any occurrence of the other term tj : avg min dist(ti, tj,d) = ∑ i∈Pd(ti) min{|i − j| : j ∈ Pd(tj )} |Pd(ti)| , where avg min dist(sea,song,d) = (10−5)+(14−5) 2 = 7, 5) the smallest average distance avg match dist(ti, tj,d) when each term occurrence has at most one matching distinct term occurrence while there may be two partner term occurrences ti for some j ∈ Pd(tj ) in avg min dist. For avg match dist, every occurrence of the least frequently occurring term of the term pair in the document has to be paired with a distinct occurrence of the more frequently occurring term of the term pair such that the total distance between the two terms is minimized. To calculate avg match dist(sea,song,d), either song10 or song14 can be paired with sea5, but not both of them; consequently, avg match dist(sea,song,d) = (10−5)+(14−3) 2 = 8 or avg match dist(sea,song,d) = (10−3)+(14−5) 2 = 8, and 6) the maximum distance between two adjacent occurrences of ti and tj , max dist(ti, tj,d) = max{j − i : (i,j) ∈ Qadj,d({ti, tj}) ∧ pi(d) �= pj (d)}, where max dist(sea,song,d) = 53 − 14 = 39. Another way to implicitly measure proximity uses term frequencies of ti and tj in d which comes in the variants 8) sumtf(ti, tj,d) = tf(ti,d) + tf(tj,d), where sumtf(sea,song,d) = 5 + 2 = 7, and 2.7 Learning to rank 41 9) prodtf(ti, tj,d) = tf(ti,d) · tf(tj,d), where prodtf(sea,song,d) = 5 · 2 = 10. High sumtf and prodtf values increase the probability of closer occurrences for the given term pair (ti, tj ) in document d. Other approaches capture information about the entire query (in our example, q = {sea,shell,song}) by 10) the length of the shortest document part that covers all query term occurrences (corresponds to Tao and Zhai’s Span(d,q) measure, cf. Section 2.4.5) FullCover(d,q) = max(Pd(q)) − min(Pd(q)) which is FullCover(d,{sea,shell,song}) = 56 − 1 = 55 in our example, or 11) the length of the shortest document part that covers each query term that occurs in d at least once (also employed by Tao and Zhai as described in Section 2.4.5), MinCover(d,q) = min{max(P ′) − min(P ′)|Td(P ′) = Td(Pd(q))}. In the example MinCover(d,{sea,shell,song}) = 10 − 5 = 5. Normalization measures in use include 12) the length of the document under view ld, and 13) the number of unique query terms in document d, qt(q,d) = |Td({1, . . . , ld}) ∩ q| which is 3 in our example. Cummins and O’Riordan use genetic programming (GP) to learn good scoring mod- els that combine a subset of the measures presented above. Poli et al. have published a guide to GP [PLM08] that presents an introduction to GP but also advanced techniques in the field. They describe that GP randomly creates an initial population of programs and evolves them from generation to generation using a set of primitive modification operations. All programs are executed and only the best fitting programs per gener- ation survive and are modified using genetic operations to form the candidate set for the next generation. In GP, the primitive modification operations are crossover (i.e., randomly chosen parts of two parent programs are combined), and mutation (i.e., a randomly chosen part of a parent program is randomly changed). When a solution is acceptable or a stopping criterion is reached (e.g., the number of generations exceeds a threshold), the so-far best program is returned as a solution. Solutions are repre- sented using trees. Each tree (genotype) consists of two types of nodes, operators (i.e., functions) or operands (i.e., terminals). Cummins and O’Riordan run GP six times with an initial population of 2,000 programs for 30 generations and use an elitist strategy which copies the best solution of a 42 2. Proximity Score Models generation to the next generation. They employ three constants for scaling ({1,10,0.5}) and seven functions (+,−, ·,/,√,square(), log()) during evolution with the goal to maximize the MAP metrics performance. Given an n-term query {t1, . . . , tn}, the authors represent documents as n × n ma- trices, where the diagonal entries are some tf-idf measure w(ti) per term ti, and the non-diagonal entries are proximity scores proxv(ti, tj ) for pairs of query terms (ti, tj ), where v denotes the proximity score variant: score(d,q) = ∑ ti∈Td(Pd(q)) ∑ tj∈Td(Pd(q)) { |w(ti)| if i = j |proxv(ti, tj )| if i �= j . The three best proximity functions (from the six runs) are coined prox2, prox5, and prox6 (due to double entries for each query term pair (i.e., there are entries in the document matrix for (ti, tj ) and (tj, ti)), the learned function is twice the value produced by the proximity function): 2 · prox2(ti, tj,d) =log( 10 mindist(ti, tj,d) ) + 5 · prodtf(ti, tj,d) avg dist(ti, tj,d) + √ 10 mindist(ti, tj,d) 2 · prox5(ti, tj,d) =(((( log(FullCover(d,q)) mindist(ti, tj,d)2 + 10 sumtf(ti, tj,d) ) · mindist(ti, tj,d) − 0.5) /mindist(ti, tj,d) 2 + log(0.5) + prodtf (ti,tj ,d) avg dist(ti,tj ,d) 0.5 )/mindist(ti, tj,d)) − 0.5 2 · prox6(ti, tj,d) =((3 · log( 10 mindist(ti, tj,d) ) + log(prodtf(ti, tj,d) + 10 mindist(ti, tj,d) ) + 10 mindist(ti, tj,d) + prodtf(ti, tj,d) sumtf(ti, tj,d) · qt(q,d) )/qt(q,d)) + prodtf(ti, tj,d) avg dist(ti, tj,d) · mindist(ti, tj,d) The authors use scoreES and a scoreBM 25 variant as baselines (cf. Section 2.2 for details); the term weighting scheme scoreES (cf. Section 2.2) is linearly combined with the learned proximity score. An additional proximity-enhanced baseline is scoreES combined with MinDist as proximity function as used by Tao and Zhai (cf. Sec- tion 2.4.5). 2.8 System-Oriented Comparison of Implementation Ef- forts per Scoring Model This subsection aims at reviewing the required implementation effort for each of the scoring models we have presented in Chapter 2. Table 2.4 gives just a rough overview of the components needed by each scoring model (with additional remarks in the cap- tion of the table). As content scores which do not use proximity information, BM25 variants, Lnu.ltc, ES, unigram LMs (non-smoothed, Jelinek-Mercer, and Dirichlet prior 2.8 System-Oriented Comparison of Implementation Efforts per Scoring Model 43 smoothed), and KL-divergence models do not need materialized term position lists. All presented proximity scoring models can be implemented with term position indexes ex- cept Uematsu et al.’s approach [UIF+08] that uses sentence-level term indexes. While all presented linear combination scoring models use avgdl and ld information, most non-linear combination scoring models only use ld. The presented learning to rank and language model approaches do not incorporate idf values, the remaining approaches employ some form of idf. ctf values are only used by Metzler and Croft [MC05] as well as de Kretser and Moffat [dKM99, dKM04]: while the first uses ctf for terms, phrases, and unordered windows, the latter uses ctf only for terms. Monz’ scoring model [Mon04] is the only model that makes use of the collection-related dt and avgdt values. Some variants of the scoring models presented in the original papers may require more features than the ones listed in Table 2.4. We will now provide more details for some scoring models but exclude tuning parameters from our descriptions as we consider them known after training the respective scoring model. While Rasolofo and Savoy’s approach [RS03] considers all query term occurrences in small text windows, Büttcher et al.’s approach [BC05, BCL06] only considers adjacent query term occurrences in unrestricted text windows. Determining those query term occurrences can be implemented using term position lists. For each term, Uematsu et al.’s sentence-level scoring model [UIF+08] needs to index the term occurrences on a sentence-level to compute the number of sentences in a document where all query terms co-occur. Monz’ approach [Mon04] can be implemented using term position lists that help to determine matching spans and minimal matching spans, respectively. De Kretser and Moffat [dKM99, dKM04] use term positions to determine a score for a given document at a given position: to this end, the positional distances between the scored position and positions of query term occurrences are taken into account. Song et al.’s approach [STW+08] relies on term position lists to segment documents into espans and uses dmax as a maximum width of espans. In some settings, Mishne and de Rijke [MdR05] employ a tag-related weight which is proportional to the number term occurrences within a certain element of an HTML document (e.g., BODY, ANCHOR TEXT or TITLE). Term positions are needed to determine phrases and proximity term occurrences in a complete document or a given tag scope of a document, respectively. Lv and Zhai’s approach [LZ09] and Zhao and Yun’s approach [ZY09] require term position information to compute kernel values for any position in a document and to compute the proximate centrality of query terms, respectively. The required implementation effort for the presented learning to rank approaches is highly dependent on the kind of features in use. Analogously to Song et al.’s approach [STW+08], Svore et al. [SKK10] segment documents into espans using term position lists. The authors can plug a wide choice of different features which influence the required implementation effort in their scoring model. If the model employs formatting features, information about sentence and para- graph boundaries, HTML markup information needs to be stored for each document. If the model uses third party phrase features, it requires lists of important phrases which 44 2. Proximity Score Models may not be publicly available. Using λBM25 features requires knowledge about doc- uments’ body content lengths, λBM25-2 features require tf and df values for bigrams in documents. The implementation effort for Metzler and Croft’s approach [MC05] is dependent on the form of cliques required for scoring. For cliques handling phrases or unordered occurrences of query terms in windows of a given size, one needs to know tf and ctf values for phrases and unordered occurrences of query terms in text windows of a given size, respectively. Deriving those values may be realized using term positions indexes. Materializing tf and ctf values for phrases and unordered occurrences of query terms in text windows is usually only doable for tiny document collections or restricted sets of phrases. Otherwise the required disk space may quickly become prohibitively large. 2.8 System-Oriented Comparison of Implementation Efforts per Scoring Model 45 m e th o d c o ll e c ti o n -r e la te d d o c u m e n t- re la te d a v g tf a v g d l a v g d t l C ,d t id f (t ) c tf (t ) df (t ) te rm p o s. se n te n c e p o s. tf (t , d ) l d d t( d ) B M 2 5 x id f x x x L n u .l tc (M o n z ) x x id f 1 x x E S (C u m m in s a n d O ’ R io rd a n )1 x x x x x u n ig ra m L M x x J e li n e k -M e rc e r sm o o th in g x x x D ir ic h le t p ri o r sm o o th in g x x K L -d iv e rg e n c e (T a o a n d Z h a i) x x x n -g ra m L M (n > 1 ) x t = n -g ra m x R a so lo fo a n d S a v o y x id f 2 x x B ü tt ch e r e t a l. x id f 1 x x U e m a ts u e t a l. x id f 3 x x M o n z x T a o a n d Z h a i( w it h o u t K L / B M 2 5 ) x x d e K re ts e r a n d M o ff a t x x x S o n g e t a l. x id f 3 x x M is h n e a n d d e R ij k e 1 + id f 1 x x x L v a n d Z h a i2 x x Z h a o a n d Y u n x x x x S v o re e t a l. 3 x x M e tz le r a n d C ro ft 4 x x x x C u m m in s a n d O ’ R io rd a n 5 x x x T ab le 2. 4: O ve rv ie w : F ea tu re s u se d in ea ch sc or in g m od el . A d d it io n al re m ar ks : 1 n ee d s al so th e nu m b er of d oc u m en ts N in th e co ll ec ti on , 2 re qu ir es ct f an d tf if Je li n ek -M er ce r or D ir ic h le t p ri or sm oo th in g ar e u se d , 3 d et er m in es th e se t of fe at u re s d ep en d en t on th e em p lo ye d se tt in g (e .g ., df an d tf fo r u n ig ra m s/ b ig ra m s, re sp ec ti ve ly p lu s li st s of im p or ta nt p h ra se s, et c. ), 4 m ay u se tf va lu es n ot on ly fo r te rm s b u t al so fo r n -g ra m s an d u n or d er ed oc cu rr en ce s of n -g ra m te rm s, an d 5 ’s se t of fe at u re s m ay d iff er d ep en d en t on th e le ar n ed p ro xi m it y sc or e. Chapter 3 Benchmarks 3.1 Introduction When users look for information, they are driven by an information need. An infor- mation need of a user might be: information whether she shall consume black tea or coffee if she suffers from high blood pressure. To represent this information need, users try to formulate queries which usually contain keywords likely to occur in documents that may satisfy the information need, e.g., q={coffee, black, tea, effect, high, blood, pressure}. In order to compare the retrieval quality of different search engines that provide result lists as answers to queries that express users’ information needs, various initiatives have developed test beds for different application scenarios. A test bed consists of a document collection, a set of information needs expressed as topics, and a set of relevance assessments which maintain for each topic a list of items judged according to their relevance to the information need. Relevance assessments can be binary-level (i.e., a result is either relevant or non-relevant) or multi-level (e.g., a result is non-relevant, marginally relevant, mostly relevant, or definitely relevant). For classical text retrieval, the granularity of results is typically document-level and the relevance is thus assessed with respect to the complete document while for XML retrieval, results may be parts of the document such as elements or passages whose relevance is also assessed. Relevance is always assessed with respect to the user’s information need, not to a query. That means that a result is considered relevant iff it contains some information related to the user’s information need. Mere occurrence of the keywords from the query in a result is not sufficient to render the result relevant. Retrieval results are evaluated using various retrieval quality metrics. In the following, we will describe two popular evaluation initiatives for text and XML retrieval and two less popular, niche representatives for Japanese language and medical search; all of them have in common to provide the means to compare the performance of different systems. After that we will give a detailed description of performance metrics which express the performance of a system under consideration. 47 48 3. Benchmarks 3.2 Initiatives 3.2.1 The TREC Initiative and Selected Test Beds The U.S. NIST (National Institute of Standards and Technology) started their Text REtrieval Conference (TREC) efforts in 1991. The first TREC workshop took place in 1992 and has been run annually since then. It provides a forum for IR researchers to compare their systems to those of others in various areas of IR. A report about the economic impact of the TREC Program can be found at http://trec.nist.gov/ pubs/2010.economic.impact.pdf and contains much of the information briefly sum- marized in the following. When NIST started their TREC effort, they aimed at fixing two problems in IR, namely the lack of document collections and of methodologies to enable a standardized comparison of IR systems. They helped to put evaluation efforts into more realistic scenarios by creating many new, large test collections: test collec- tions used for the first TREC in 1992 contained already approx. 750,000 documents compared to a size of 12,000 documents for the largest commonly used collection before. The collection sizes continuously increased to adjust to the growing Web and more pow- erful machines: in 2004 the GOV2 collection contained approx. 25 million documents (426GB), and the most recent ClueWeb09 collection used for the TREC Web Tracks 2009 and 2010 consisted of more than 1 billion documents (25TB). TREC helped to develop standardized IR evaluation methods by providing document collections, sets of topics and relevance assessments (which documents are relevant to a given query) to compare IR systems in a standardized manner. Test collections are available not only for established tasks such as ad hoc retrieval but also for newer areas such as video retrieval and spam detection. TREC distributes research results and makes them also available to people not participating in TREC. Evaluation techniques and formats used in TREC inspired a number of other workshops and programs. TREC runs multiple tracks dedicated to particular areas in IR. Past TREC Tracks are listed at trec.nist.gov/tracks.html and include the Blog Track (last run in 2010), Cross-Language Track (in 2002), Enterprise Track, Filtering Track (last run in 2002), Genomics Track (last run in 2007), HARD (High Accuracy Retrieval from Documents, last run in 2005), Interactive Track (last run as adjunct to the Web Track 2003), Million Query Track (last run in 2009), Novelty Track (last run in 2004), Question Answering (QA) Track (last run in 2007), Relevance Feedback Track, Robust Retrieval Track (discontinued after 2005), SPAM Track (last run in 2007), Terabyte Track (last run in 2006), Video Track (last run in 2002, starting 2003 there was an independent evaluation named TRECVID with a workshop taking place), and the former Web Track (last run in 2004). In 2011, tracks encompassed the Chemical IR Track, Crowdsourcing Track, Entity Track, Legal Track, Medical Records Track, Microblog Track, Session Track, and a new Web Track (started in 2009). The Million Query Track used a large number of incompletely judged queries and aimed to find out whether this is better than the traditional TREC pooling approach. The Robust Track used difficult queries and focused on individual topics’ retrieval 3.2 Initiatives 49 quality rather than optimizing the average effectiveness. The Web Track used a web collection to perform search tasks on it: the Topic Distillation Task tried to find relevant pages desirable for inclusion in a list of key pages. The Large Web Task used 10,000 search log queries from Alta Vista and Electric Monk [Haw00]. The Ad Hoc Task and Small Web Task in TREC-8 used the same topic set to find out how web data differs from Ad Hoc data [VH00]. The Terabyte Track used a significantly larger collection than used for previous TREC evaluations and aimed to find out whether the evaluation scales. Table 3.1 shows an overview of selected test beds in the context of TREC which include Tracks/Tasks in TREC, a reference to the employed document collection, and the topic sets. We will now describe some TREC collections that are either used in experiments later in this thesis or have been used in the original papers that introduced the methods in Chapter 2. The TREC45 and TREC45-CR (a.k.a. TREC-8) collection: TREC Disk 4 contains about 30,000 documents (approx. 235MB, avgdl=1373.5) from the Congres- sional Record of the 103rd Congress (CR), about 55,000 documents (approx. 395MB, avgdl=644.7) from the Federal Register in 1994 (FR), and about 210,000 documents (approx. 565MB, avgdl=412.7) published in the Financial Times from 1992 to 1994 (FT). TREC Disk 5 contains about 130,000 documents provided by the Foreign Broad- cast Information Service (approx. 470MB, avgdl=543.6) (FBIS) and about 130,000 randomly selected Los Angeles Times articles from 1989 and 1990 (approx. 475MB, avgdl=526.5) (LA Times). The information given here has been taken from http: //www.nist.gov/srd/nistsd22.cfm (TREC Disk 4) and http://www.nist.gov/srd/ nistsd23.cfm (TREC Disk 5) where the disks can also be ordered. While the TREC45 collection is approx. 2,140MB in size, the TREC45-CR (a.k.a. TREC-8) collection’s size is only about 1.9GB as it does not contain the data from the Congressional Record of the 103rd Congress. Average document lengths for the subcollections have been mentioned in [VH00] and have been computed without term stemming and without stopword removal. According to [VH97] TREC45-CR consists of 528,155 documents at an avgdl of 467.42, and after stopword removal of 263.65. While TREC Disks 4 and 5 have been used to process the Ad Hoc Topics in TREC-6, TREC45-CR has been used for the TREC-8 Web Track Ad Hoc Task and the TREC-12 and TREC-13 Robust Track. TREC Disks 4 and 5 (plus the complete TIPSTER collection) have been used in the QA Track in TREC-9 and TREC-10. The AQUAINT collection: The information presented here can be found at http: //www.ldc.upenn.edu/Catalog/docs/LDC2002T31/. The AQUAINT collection con- sists of newswire text data in English from three sources: the Xinhua News Service from China (January 1996–September 2000) (XIE), the New York Times News Service (NYT), and the Associated Press Worldstream News Service (June 1998–September 2000) (APW). All articles are SGML-tagged text data presenting the series of news stories. There is a single DTD available for all data files in the corpus. The corpus 50 3. Benchmarks was prepared by the Linguistic Data Consortium (LDC) for the AQUAINT Project to be used by NIST for evaluations. The data files are about 3GB in size and contain approx. 375 million words. The AQUAINT collection has been used for the TREC-11 QA Track Main Task, and for the TREC-14 Robust Track. The TIPSTER collection: The TIPSTER collection (http://www.ldc.upenn. edu/Catalog/CatalogEntry.jsp?catalogId=LDC93T3A) comes on three disks and contains articles from the Wall Street Journal (1987–1992) (WSJ87-92) (173,252 doc- uments, approx. 81M words), Federal Register (1988 and 1989) (FR88-89), Associated Press (1988–1990) (AP88-90) (242,918 documents, approx. 114M words), Information from the Computer Select disks copyrighted by Ziff-Davis (1989–1992) (ZIFF89-92) (approx. 112M words), San Jose Mercury News (1991) (approx. 45M words), U.S. Patents (1983–1991) (250MB in size), and Department of Energy abstracts (DOE) (approx. 28M words) . In total it contains approx. 448 million words in over 700,000 documents with 2.1GB of text (http://www2.parc.com/istl/projects/ia/papers/ sg-sigir93/sigir93.html). While the TIPSTER Disks 1 and 2 have been used to evaluate the Ad Hoc Topics in TREC-1 to TREC-3, TIPSTER Disks 2 and 3 have been used to process the Ad Hoc Topics in TREC-4, TIPSTER Disk 2 (plus TREC Disk 4) to process the Ad Hoc Topics in TREC-5. The complete TIPSTER collection (plus TREC Disks 4 and 5) has been used in the QA Track in TREC-9 and TREC-10. The VLC2 collection: The VLC2 collection is about 100GB in size and contains 18.5 million web pages which are part of a web crawl from 1997 carried out by the Internet Archive. It has been described in detail in [HCT98] and has been used for the Large Web Task of the TREC-8 Web Track. The WT2g collection: The WT2g collection is a 2GB sized subset of the larger VLC2 collection and contains 250,000 documents. Details can be found in [HVCB99]. The collection has been used for the Small Web Task of the TREC-8 Web Track. The WT10g collection: The WT10g collection (http://ir.dcs.gla.ac.uk/test_ collections/wt10g.html) consists of 1,692,096 English web documents crawled from 11,680 servers and is about 10GB in size. It is the successor of the WT2g collection and contains 171,740 inter-server links (within the collection). Own experiments using the Galago parser have resulted in avgdl=599.41 and after stopword removal avgdl=393.74. According to [BCH03a], which contains a lot of information about the construction of the WT10g corpus, the corpus was created to perform repeatable retrieval experiments which model web search better than any previously available test collection. It has been used as document collection for the Web Track in TREC-9 and TREC-10. The .GOV collection: The .GOV collection is a TREC test collection (http:// ir.dcs.gla.ac.uk/test_collections/govinfo.html) which is a crawl of .gov (U.S. governmental) web sites from early 2002. In total it contains 1,247,753 documents (of 3.2 Initiatives 51 which 1,053,372 are HTML files) that have been truncated to a maximum size of 100KB each (reducing the size from 35.3GB to 18.1GB). This collection has been used for the Web Track, Topic Distillation Task from TREC-11 to TREC-13. The GOV2 collection: The GOV2 collection is a TREC benchmark collection in- tended for use in the Terabyte Track (http://ir.dcs.gla.ac.uk/test_collections/ gov2-summary.htm). It was crawled using NIST hardware and network. This crawl from early 2004 of .gov (U.S. governmental) web sites has an uncompressed size of approximately 426GB and consists of 25,205,179 documents (out of which 23,111,957 are HTML, 2,030,339 PDF, 60,176 plain text, 2,253 MS-Word, 454 postscript files) that have been truncated to a maximum size of 256KB each. Our experiments with the Galago parser have resulted in avgdl=886.03, and after stopword removal avgdl=633.08. This collection which entails 20 times more documents than the .GOV collection has been used for the Terabyte Track, Ad Hoc Task in TREC-13 to TREC-15, and for the Terabyte Track, Efficiency Task in TREC-14. The ClueWeb09 collection: The ClueWeb09 dataset was created by the Lan- guage Technologies Institute at Carnegie Mellon University (CMU) and consists of 1,040,809,705 web pages in 10 languages (http://boston.lti.cs.cmu.edu/Data/ web09-bst/). It was crawled in January and February 2009 and encompasses 25TB of uncompressed data. The dataset is used by several tracks of the TREC conference. People participating in the TREC Web Track often restrict the collection first to the 503,903,810 English documents (http://boston.lti.cs.cmu.edu/clueweb09/wiki/ tiki-index.php?page=Dataset+Information#Record_Counts), from which they chose the 50% documents with the smallest probabilities to be spam according to the Water- loo Fusion spam ranking (http://plg.uwaterloo.ca/~gvcormac/clueweb09spam/). The resulting document set has an uncompressed size of about 6TB. In our own exper- iments with the remaining 251,664,804 documents and our own parser, after stemming and stopword removal, the avgdl value was 3021.49. The ClueWeb09 collection has been used as a document collection for the Web Track, Ad Hoc Task in TREC-18 to TREC-20. 3.2.2 INEX and Selected Test Beds The second initiative we describe is the INitiative for the Evaluation of XML retrieval (INEX) which exists since 2002. It is the leading workshop on XML retrieval and takes place annually. In contrast to TREC where NIST is responsible for providing the test bed, in INEX only the document collection is provided and participants are asked to formulate topics and judge results for relevance. Like in TREC, the test collections employed in INEX have grown in size over the years to catch up with data growth in the real world, more powerful machines and to pose new challenges to the participants. While the initial INEX IEEE collection from 2002 consisted of about 12,100 articles with 8 million elements at a size of 494MB, and 52 3. Benchmarks TREC Year Track/Task Collection Topics TREC-1 1992 Ad Hoc Topics TIPSTER Disks 1+2 51-100 TREC-2 1993 Ad Hoc Topics TIPSTER Disks 1+2 101-150 TREC-3 1994 Ad Hoc Topics TIPSTER Disks 1+2 151-200 TREC-4 1995 Ad Hoc Topics TIPSTER Disks 2+3 201-250 TREC-5 1996 Ad Hoc Topics TIPSTER Disk 2+ 251-300 TREC Disk 4 TREC-6 1997 Ad Hoc Topics TREC45 301-350 TREC-7 1998 Ad Hoc Topics TREC45 351-400 TREC-8 1999 Web Track, Ad Hoc Ad Hoc: TREC45-CR 401-450 and Small Web Topics SmallWeb: WT2g TREC-8 1999 Web Track, Large Web Task VLC2 20001-30000 TREC-9 2000 Web Track WT10g 451-500 TREC-10 2001 Web Track, Ad Hoc Topics WT10g 501-550 TREC-11 2002 Web Track, GOV 551-600 Topic Distillation Task TREC-12 2003 Web Track, GOV TD1-TD50 Topic Distillation Task TREC-13 2004 Web Track, GOV 75 topics from WT04-1 to WT04-225 Topic Distillation Task TREC-18 2009 Web Track, Ad Hoc Task ClueWeb09 wt09-1 to wt09-50 TREC-19 2010 Web Track, Ad Hoc Task ClueWeb09 51-100 TREC-16 2007 Million Query Track GOV2 1-10000 TREC-12 2003 Robust Track TREC45-CR 100 topics from 303-650 TREC-13 2004 Robust Track TREC45-CR 301-450 (Ad Hoc Topics TREC6–TREC8), 601-650 (new topics TREC-12 Robust Track), 651-700 (new topics TREC-13 Robust Track) TREC-14 2005 Robust Track AQUAINT 50 topics from 303-689 TREC-13 2004 Terabyte Track, Ad Hoc Task GOV2 701-750 TREC-14 2005 Terabyte Track, Ad Hoc Task GOV2 751-800 TREC-15 2006 Terabyte Track, Ad Hoc Task GOV2 801-850 TREC-14 2005 Terabyte Track, GOV2 1-50000 Efficiency Task TREC-9 2000 QA Track TIPSTER+TREC45 201-893 TREC-10 2001 QA Track, Main Task TIPSTER+TREC45 894-1393 TREC-11 2002 QA Track, Main Task AQUAINT 1394-1893 Table 3.1: Some TREC test beds was increased to approx. 16,800 articles, 11 million elements with 764MB in size in 2005, the change to a Wikipedia collection in 2006 brought more than 1.5 million XML documents in 8 languages (out of them about 660,000 English documents) at a size of approx. 10GB. The current Wikipedia collection in use increased the size to more than 50 GB with more than 2.6 million XML documents and 1.4 billion XML elements. INEX also offers multiple tracks (https://inex.mmci.uni-saarland.de/) to their participants. Past tracks include the Heterogeneous Collection (2004–2006), Relevance Feedback (2004–2006), Natural Language (2004–2006), XML Multimedia (2005–2007), Use Case Studies (only in 2006), XML Entity Ranking (2006–2009), Efficiency (2008 and 2009), Book Search (2007-2010, renamed to Book and Social Search in 2011), Ad Hoc Retrieval (2002–2010), and XML Mining (2007–2010) Tracks. In 2011, tracks encompass the Book and Social Search, Interactive, Relevance Feed- back, Data-Centric, Question Answering (QA), Web Service Discovery, and the newly introduced Snippet Retrieval Track which replaces the former Ad Hoc Track. Mostly for the Ad Hoc Track, INEX uses two types of queries: CO (content-only) queries and CAS (content-and-structure) queries. While CO queries are keyword queries without struc- tural information as used in text retrieval, CAS queries impose structural constraints which position keywords into a structural context. We will now describe the INEX collections used over the years for the Ad Hoc Track. 3.2 Initiatives 53 The INEX IEEE collection 2002–2004 and its extension from 2005: The initial INEX collection from 2002 consisted of 12,107 marked-up articles with 8 million elements, taken from IEEE journals between 1995 and 2002, 494MB in size, and de- scribed in [LT07] which we will summarize here. The collection got extended in 2005 by 4,712 new IEEE articles published between 2002 and 2004, resulting in 16,819 articles with 11 million elements and a size of 764MB. A typical article consists of front matter, body, and back matter. The front matter contains metadata (e.g., title, author, publi- cation information, and abstract). The body contains text embedded in its structural information: sections, sub-sections, and sub-sub-sections that start with a title element followed by paragraphs. The content is extended by references (citations, tables, and figures), item lists, and layout (e.g., emphasised, bold text). The back matter contains bibliography and information about the authors. This collection has been used for the Ad Hoc Track which was the only INEX track in 2002 and 2003, and for the Interactive, Relevance Feedback, and Natural Language Track in 2004. The INEX Wikipedia collection 2006–2008: The INEX Wikipedia collection used for the INEX workshop from 2006 to 2008 consists of 1,535,355 Wikipedia-based XML documents in 8 languages with a total size of about 10GB. The English part consists of 659,388 English documents which are about 4,600MB in size. The average size of an English document is 7,261 bytes, the average depth of a node in an XML document tree is 6.72, and the average number of elements in a document is 161.35. More detailed information can be found in [DG06a] and [DG06b]. As the collection is highly irregular, there is no DTD available for this collection. This collection has been used for the Ad Hoc Track from 2006-2008, the INEX Efficiency Track 2008, and the Entity Ranking Track 2007 and 2008. The English part of this INEX Wikipedia collection with more than 300,000 images at approx. 60GB size has been used for the Multimedia Track in 2007 while the English part with tagged articles and a size of approx. 6GB has been used as Entity corpus for the Entity Ranking Track in 2007 and 2008. The INEX Wikipedia collection from 2009: This Wikipedia collection has been newly introduced in 2009 (http://www.mpi-inf.mpg.de/departments/d5/software/ inex/) and was created at Max-Planck-Institute and Saarland University. It consists of 50.7GB XML-ified Wikipedia articles, with 2,666,190 articles (which is four times the number the English articles in the former Wikipedia collection) and 1.4 billion elements. The collection is annotated with the 2008-w40-2 version of YAGO [SSK07]. Parsing the document collection using the Galago parser resulted in avgdl=565.83, and after stopword removal in avgdl=393.62. In 2009, the INEX Efficiency Track, the Link-the-Wiki Track, and the Question Answering Track have used this collection. 54 3. Benchmarks 3.2.3 Other Initiatives The IREX Project The IREX (Information Retrieval and Extraction Exercise) project is an evaluation project for Information Retrieval and Information Extraction in Japanese. [SI00] reports on this project, briefly summarized below: the project lasted from May 1998 to Septem- ber 1999 and ended with an IREX workshop held in Tokyo. More information including the data and tools used for the project can be found at http://nlp.cs.nyu.edu/irex/. The IREX project had two tasks, namely the Information Retrieval task (IR) and the Named Entity task (NE). We omit the description of the NE task as the evaluation in related work presented in Chapter 4 only deals with the IR task. There were 30 topics in the IR task and participants were asked to submit their top-300 results for each topic. The employed IREX IR collection consists of about 212,000 Mainichi newspaper articles written in Japanese that were published in 1994 and 1995. OHSUMED A description of the OHSUMED test bed is given in [HBLH94] and http://ir.ohsu. edu/ohsumed/ohsumed.html, its characteristics are briefly summarized in the following. The OHSUMED test collection is a subset of MEDLINE, a bibliographic database for medical publications maintained by the National Library of Medicine and about 400MB in size. MEDLINE consists of more than 7 million references starting in 1966, and grows by about 250,000 references per year. The OHSUMED collection contains 348,566 references, a subset of 270 medical journals covering the years 1987 to 1991. The generally short queries contain a brief statement about the patient and an information need. Relevance assessments distinguish two levels of relevance, namely definitely relevant (DR) and definitely or possibly relevant (D+PR). The test bed con- tains 101 physician-generated queries during patient care with at least one document considered definitely relevant. 3.3 Measures In order to assess the retrieval quality of search engines, the IR community has devel- oped several measures which can be classified in measures for text/document retrieval and measures for XML retrieval. Relevance is always assessed with respect to the user’s information need, not to a query. That means that a document is considered relevant iff it contains some infor- mation related to the user’s information need. Mere occurrence of the keywords from the query in a retrieved document is not sufficient to render the document relevant. 3.3.1 Measures for Text/Document Retrieval Query processing in a search engine returns a ranked list of results that are assessed using various measures; two of the most prominent measures are precision and recall. 3.3 Measures 55 For these measures, the order of (the first n) entries in the ranked list of query results does not influence the value of the measure such that the ranked list can also be viewed as a result set. Definition 3.3.1. (Precision at rank n) Given a set of the first n items In = {i1, . . . , in} retrieved as answer to query q, and the set of items R considered relevant to q, precision at rank n is defined as P@n = |R ∩ In| |In| and measures which fraction of the retrieved items in In is actually relevant to the query. Definition 3.3.2. (R-Precision) Given the set of items R considered relevant to a query q, and a set of retrieved items I|R| = {i1, . . . , i|R|} retrieved as answer to q, R-Precision is defined as R − Precision = |R ∩ I|R|| |I|R|| and measures which fraction of the first |R| retrieved items is actually relevant to q. Definition 3.3.3. (Recall) Given a set of n items In = {i1, . . . , in} that represent a query result, and the set of items R considered relevant to q, the recall is defined as recall = |R ∩ In| |R| and describes to which extent the items considered relevant have been retrieved. Sometimes, for example in the area of question answering, it is good enough to know whether there is at least one relevant answer among the first n results. A simple measure that can be used for this purpose is answer-at-n. Definition 3.3.4. (Answer-at-n (a@n)) Given a set of the first n items In = {i1, . . . , in} of a query result, and the set of items R considered relevant to q, the answer-at-n value is defined as a@n = min(1, |In ∩ R|). Therefore, a@n = 1 if there is at least one relevant result among the first n results and 0 otherwise. While the evaluation measures just described ignored the order of the ranked result lists, the evaluation measures we will describe in the following take the order of the ranked result lists into account. Thus, the order of the retrieved results has an impact on the value of the measures. 56 3. Benchmarks Definition 3.3.5. (AP (average precision)) Given a ranked result list RL of re- trieval results for a query q and the set of results R = {d′1, . . . ,d′x} in RL considered relevant, the average precision (AP) is defined as AP(q,RL) = 1 |R| ∑ d′i∈R P@rank(d′i,RL), where rank(d′i,RL) is the rank of d ′ i in RL. The NDCG measure (normalized discounted cumulative gain) has been proposed in [JK02] and supports non-binary relevance assessments: relevance assessments have more than two relevance levels, such as non-relevant, marginally relevant, relevant, and highly relevant. Each relevance level is mapped to a relevance value which is a number such as highly relevant→3, relevant→2, marginally relevant→1, and non-relevant→0. As the original work computes the NDCG value in an algorithmic way and we want to keep the definition compact, we adapt the variant presented in [MRS08] to our notation. Definition 3.3.6. (NDCG at rank k) Let rel(q,d) be the relevance value attributed to document d for query q, RL the ranked list for query q, and RLk the kth result in RL. The NDCG value for q at rank k is defined as NDCG(q,k) = Zq k∑ r=1 2rel(q,RLr) − 1 log(1 + r) , where Zq is a normalization factor such that a perfect ranked result list’s NDCG at rank k is 1. The RR (reciprocal rank) measure quantifies when the first relevant document is encountered in a result list. Definition 3.3.7. (RR (reciprocal rank)) Let RLq be a ranked list, and Rq the set of relevant items for query q. Then the reciprocal rank (RR) for q is defined as RR(q) = 1 minrank(RLq ∩ Rq) , where minrank(RLq ∩ Rq) is the minimum rank of a relevant document in RLq. If RLq ∩ Rq = ∅ (i.e., no relevant results are retrieved), 1minrank(RLq∩Rq) is set to 0. Common measures that are based on mean values for a set of topics (query load) are MAP (mean average precision) and MRR (mean reciprocal rank). While MAP averages over AP values, MRR averages over RR values: Definition 3.3.8. (MAP (mean average precision)) Given a query load Q = {q1, . . . ,qm} and a ranked result list of retrieved results for each query in Q, RLj being the ranked list for query qj and Rj the set of relevant items for query qj . Then the mean average precision (MAP) for Q is defined as MAP(Q) = 1 |Q| m∑ j=1 AP(qj,RLj ). 3.3 Measures 57 The MRR (mean reciprocal rank) averages over the reciprocal ranks of the first relevant retrieved result for each query in a query load. Definition 3.3.9. (MRR (mean reciprocal rank)) Given a query load Q = {q1, . . . ,qm} and a ranked result list of retrieved results for each query in Q, RLj being the ranked list for query qj and Rj the set of relevant items for query qj . Then the mean reciprocal rank (MRR) for Q is defined as MRR(Q) = 1 |Q| |Q|∑ j=1 1 minrank(RLj ∩ Rj ) , where minrank(RLj ∩ Rj ) is the minimum rank of a relevant document in RLj . If RLj ∩ Rj = ∅ (i.e., no relevant results are retrieved), 1minrank(RLj∩Rj ) is set to 0. 3.3.2 Measures for XML Retrieval In [KPK+07] Kamps et al. describe the official retrieval effectiveness measures used for the Ad Hoc Track at INEX 2007. While in earlier years only XML elements were allowed for retrieval, INEX 2007 allowed arbitrary document parts, i.e., XML elements and passages. The Focused Task requires a ranked list of non-overlapping document parts (i.e., there is no document part in the ranked list which is enclosed or partially overlaps with any other document part from the ranked list). Submitting organizations are asked to provide for each query q ranked lists of 1,500 non-overlapping document parts Lq that are supposed to be most focused and relevant. The amount of relevant information retrieved is measured in terms of the length of relevant text retrieved. Since 2005, INEX uses highlighting to get relevance assessments for the topics. Therefore, the evaluation is based on the number of relevant highlighted characters, not documents. pr is the document part assigned to rank r in the ranked list Lq of document parts returned by a retrieval system for a topic q1. size(pr) is the total number of characters contained in pr and rsize(pr) the total number of characters in the highlighted relevant text part of pr. Trel(q) denotes the total number of characters in all highlighted relevant text for q. The precision at rank r is defined as P[r] = ∑r i=1 rsize(pi)∑r i=1 size(pi) and measures which portion of the retrieved characters is relevant. The recall at rank r is defined as R[r] = ∑r i=1 rsize(pi) Trel(q) 1The topic may be notationally omitted if it is clear from the context which topic we are considering. 58 3. Benchmarks and measures to which extent the retrieved characters cover characters from text con- sidered relevant. Both precision and recall are similar to the earlier definitions but use characters instead of documents as evaluation units. The interpolated precision at recall level x for query q is defined as iP[x](q) = { max{P[r] : R[r] ≥ x} : if x ≤ R[|Lq|] 0 : else , where R[|Lq|] is the maximum recall over all retrieved documents Lq. It considers the maximum achievable precision after the returned results have achieved at least recall level x. If the recall level x exceeds the maximum recall level for Lq, the interpolated precision drops to 0. The average interpolated precision measure for query q AiP(q) = 1 101 · ∑ x∈SRL iP[x](q) builds the average over the interpolated precision at the 101 standard recall levels SRL = {0.00, 0.01, . . . , 1.00}. The mean average interpolated precision measure is defined as MAiP = 1 #T · ∑ q ∈ T AiP(q) and expresses the performance across a set of topics T. Chapter 4 Evaluation for Selected Score Models In the first part of this chapter (Section 4.1) we present a lot of insightful experimen- tal results from the original papers. However, they usually compare only a few of the proximity score models surveyed in Chapter 2. The second part of this chapter (Sec- tion 4.2) seeks to close this gap by performing a comparative analysis of a significant set of proximity score models in a single evaluation framework with four test beds. 4.1 Results from the Original Papers In this section, we report the main results from the original papers and describe their experimental setups. When we describe the employed test beds, we first talk about the topics with the corresponding tracks/tasks and mention the employed document collection in brackets. More information about the test beds can be found in Chapter 3. 4.1.1 Linear Combinations of Scoring Models Rasolofo and Savoy: Rasolofo and Savoy (cf. Section 2.4.1) use three Web Track Ad Hoc Task test beds from TREC-8 (TREC45-CR), TREC-9, and TREC-10 (both WT10g) with 125 multi-keyword queries without stopwords. They compare the retrieval quality of their proposed proximity-aware model to an Okapi BM25 baseline. Proximity scores help more with early (P@5) than with later precision. Average precision values for BM25 and the proposed model hardly differ for which the authors give two reasons: 1) Proximity scores only consider term pairs in a window size of five which limits the number of documents whose scores are influenced by proximity scores and 2) only the 100 documents with the highest BM25 scores are scored which may rule out potentially relevant documents beyond the top-100. Sign tests at p<0.05 show that their approach significantly improves in AP over the baseline when evaluating over all queries from all three test beds. Considering single test beds, only for TREC-8, their approach significantly improves over the baseline. 59 60 4. Evaluation for Selected Score Models Büttcher et al.: Büttcher et al. (cf. Section 2.4.2) perform two rounds of experi- ments: 1) They evaluate 100 topics from the TREC 2003 Robust Track (TREC45-CR) and 50 topics from the TREC 2004 Terabyte Track, Ad Hoc Task (GOV2). Like Ra- solofo and Savoy, they compare their proximity-enhanced model to a BM25 baseline: a paired t-test shows significant improvements on GOV2 (P@10 at p<0.02, P@20 at p<0.01), but fails on TREC45-CR. 2) They split the GOV2 collection into 100 ran- dom chunks which are combined to form 10%, 20%, . . . , 90% of the GOV2 documents (20 subcollections per size). Their test bed uses the 100 queries from TREC Terabyte Track, Ad Hoc Tasks 2004 and 2005 (subcollections of GOV2). It turns out that the larger the document collection is, the more important the impact of term proximity scores gets for P@10 and P@20 values. The authors suspect, that, in large collections, it is more likely to accidentally find non-relevant documents that contain query terms; term proximity may help to find relevant ones. As the relative gain of proximity scores is higher for stemmed than for unstemmed queries, term proximity may help to find stem-equivalent terms that represent the same semantic concept. Average document length and effectivity of term proximity do not seem to be related. Uematsu et al.: Uematsu et al. (cf. Section 2.4.3) use two test beds: 50 topics from the TREC-8 Web Track, Ad Hoc Task (TREC45-CR), and 30 IR Task Topics from IREX (IREX IR collection). They compare precision and average query processing times when evaluating queries with document-, word-, and sentence-level indexes and report index sizes. A document-level index contains only (docid, tf(term,docid)) pairs, and word-level indexes contain additional term position information. The proposed sentence-level index contains for each term t a list of docids and the number of sentences where t occurs plus sentence positions. This is used to determine, for each document, the number of sentences with co-occurrences of all query terms per document. Indexes are compressed by means of dgap and v-byte encoding: due to smaller dgaps, sentence- level may compress better than word-level indexes. Sentence-level indexes lead to the highest early precision values for both test beds. Document-level indexes are smallest, sentence-level indexes are as effective as word- level indexes, but smaller. Without positional information, document-level indexes are not as effective as the other indexes as they can only be used to compute BM25 scores without term proximity contributions. To index the TREC45-CR collection, the sentence-level index requires 900GB (39% larger than document-, 26% smaller than word-level index), for IREX IR 210GB (25% larger than document-, 12% smaller than word-level index). For the TREC-8 test bed, document- and sentence-level indexes’ query processing times are comparable and a bit faster than the word-level indexes’. For the IREX IR test bed, index granularities hardly influence the query processing speed. Monz: Monz (cf. Section 2.4.4) uses test beds from the question answering (QA) Tracks of TREC-9, TREC-10 (both TIPSTER+TREC45), and TREC-11 (AQUAINT) to evaluate his minimum span weighting (msw) approach against the baseline Lnu.ltc. 4.1 Results from the Original Papers 61 He compares the percentages of questions that have at least one relevant document among the top-n results (a@n), and shows that msw outperforms the baseline on all test beds, especially for low n. The precision values decline for the TREC-11 test bed as it contains more difficult questions and the average number of relevant documents is lower than for the TREC-9 and TREC-10 test beds. Percentually, the precision values P@n show higher gains than for a@n at all cutoffs. For both metrics, with every test bed and at all cutoffs, the performance of msw significantly outperforms the baseline at p<0.01. The author fails to show a correlation between query length and average precision. Tao and Zhai: Tao and Zhai (cf. Section 2.4.5) employ five TREC test beds, namely the 50 TREC-1 Ad Hoc Topics (AP, FR), DOE queries (DOE), and 50 TREC-8 Web Track, Ad Hoc and Small Web Topics (TREC45-CR, WEB2g). The authors report average values of five individual proximity distance functions (considered in isolation) for relevant and non-relevant documents. Ideally non-relevant documents should have higher distance values than relevant ones. It turns out that global measures (Span and MinCover) need a normalization as relevant documents tend to contain more query terms which span wider than in non-relevant documents. Local measures (MinDist, AvgDist, and MaxDist) perform better than global measures; MinDist is likely to be the best proximity distance function on every test bed. Except for FR (maybe too few queries (21) are applicable which tends to support the null hypothesis), MAP values improve significantly (Wilcoxon signed rank test, p < 0.05) for R1 + MinDist and R2 + MinDist over the baselines KL-divergence and BM25, respectively. Early precision values are better for R1 + MinDist and R2 + MinDist than for the base- lines. R1 + MinDist provides similar MAP values as the MRF approach used by Metzler and Croft [MC05] (cf. Section 2.7.3). Parameter sensitivity studies show that global proximity measures are less stable and accurate than local ones. Simple addi- tion of KL-divergence and proximity measures cannot improve retrieval quality over KL-divergence. 4.1.2 Integrated Score Models De Kretser and Moffat: De Kretser and Moffat (cf. Section 2.5.1) compare their locality to traditional document retrieval models using the AP metric for k=1,000 documents. They evaluate 150 long TREC-1-3 (TIPSTER Disks 1+2) and 49 short TREC-4 Ad Hoc Topics (TIPSTER Disks 2+3) on subcollections from TIPSTER Disk 2 (AP88, FR88, WSJ90-92, ZIFF89-90), newspaper AP88+WSJ90-92, and non- newspaper FR88+ZIFF89-90 parts. As document retrieval baselines, they use 1) the standard cosine measure (baseline1) and 2) an approach which uses tf, qtf, pivoted document length normalization, and idf-normalization by maximum frequency which achieves the best overall performance in [ZM98] (baseline2). For locality retrieval models, the authors test four kernel shapes with damped and non-damped height. 62 4. Evaluation for Selected Score Models For the short topics (7 distinct terms on average), with a few exceptions, their locality-based retrieval models improve over baseline1 and the arc-shaped, damped height kernel outperforms baseline2 on 5 of 6 document collections. For the long topics (43 distinct terms on average in description field), locality methods do not pay off except for FR88 which contains long documents where users save most time when pointed to passages by locality-based retrieval models. Song et al.: Song et al. (cf. Section 2.5.2) compare early and average precision of their retrieval model (newTP) to BM25 and Rasolofo and Savoy’s approach (OkaTP). To this end, they use the 50 topics from the TREC-10 Web Track, Ad Hoc Task (WT10g) and 50 topics from the TREC-11 Web Track, Topic Distillation Task (.GOV) as test beds. Song et al. tune first on BM25, then the newTP parameters using the TREC-9 Web Track, Ad Hoc Task (WT10g) test bed: newTP’s term proximity scores use much larger text windows (size=45) than OkaTP (size=5). newTP significantly (paired t-test, p<0.05) outperforms BM25 in terms of P@5 and P@10 for both testbeds. For P@5, OkaTP outperforms newTP which indicates that OkaTP brings more documents with very close term pair occurrences to the top-5. For P@10, Song et al.’s approach outperforms OkaTP and BM25 which indicates that newTP can handle more distant term pairs better than OkaTP. newTP and OkaTP provide similar average precision values that both outperform BM25. Mishne and de Rijke: Mishne and de Rijke (cf. Section 2.5.3) evaluate the impact of phrase and proximity terms and document structure on retrieval quality with two test beds: 50 queries from the TREC-12 and 75 queries from the TREC-13 Web Track, Topic Distillation Task (.GOV collection). They compare five approaches which use 1) single query terms from each topic as terms, no document structure (baseline), 2) all term- level n-grams from a topic as phrase terms (phrases), 3) phrase terms with weights proportional to term phrase frequencies in different fields (phrases-b), 4) all term-level n-grams from a topic as proximity terms with fixed distance length (proximity), and 5) with variable distance length (prox-v). The authors claim that using a multiple field representation for each document, phrase and proximity terms can help effectiveness and confirm Mitra et al. [MBSC97] that, for single field representations, given a good basic ranking model, phrases yield little or no improvement. Phrase and proximity terms often help to provide higher effectiveness the less re- strictive the variant in use (i.e., prox-v often outperforms proximity and proximity often outperforms phrases). phrases-b provides more stable results than phrases. Short queries often form linguistic phrases and rather gain effectiveness by phrase and proximity terms than longer queries. Those tend to consist of non-related sets of terms and may consequently suffer from a topic drift. Effectiveness gains for short queries predominate - starting at a query length of four terms, the effectiveness drops. 4.1 Results from the Original Papers 63 4.1.3 Language Models with Proximity Components Lv and Zhai: Lv and Zhai (cf. Section 2.6.1) use four TREC test beds to evaluate their positional language model (PLM) approach: 50 TREC-1 Ad Hoc Topics (AP88- 89 and FR), and 50 TREC-8 Web Track, Ad Hoc and Small Web Topics (WT2g and TREC45-CR). For the best position strategy (BPS), they compare the effectiveness of proximity-based kernels. The KL-divergence model (with Dirichlet prior smoothing) returns initial results re-ranked with PLMs for 25 ≤ σ ≤ 300: σ ≥ 125 does best, and Gaussian kernels are usually preferable. Lv and Zhai claim that the Gaussian kernel is superior since it is the only kernel under view whose propagated count drops slowly for small distances |i−j| (dependent terms are not always adjacent in documents), fast for moderate distances (boundary of term’s semantic scope reached), and again slowly for large distances (all terms are only loosely associated). For the multi-position strategy with a single spread σ, a Gaussian kernel (with Dirichlet prior smoothing) does not yield noticeable improvements over BPS (k=1) such that, for one single σ, BPS can be considered a robust method for document ranking. For the multi-σ strategy, PLMs (σP LM flexible) and document language model (σLM = ∞) are linearly combined using a coefficient γ. Interpolation helps PLMs to be more robust and effective: the authors claim that PLMs represent proximity well although document-level retrieval heuristics are better represented by document LMs. The PLM approach performs best for small σP LM values (e.g., 25 or 75). For collections with larger avgdl values (i.e., WT2g and FR), PLMs need more weight (i.e., a larger γ) since their document LMs tend to be noisier. Zhao and Yun: Zhao and Yun (cf. Section 2.6.2) use four test beds: the title fields of the 50 TREC-5 Ad Hoc Topics (AP88 and WSJ90-92), the title fields of 50 TREC-3 Ad Hoc Topics (WSJ87-92), and the OHSUMED Topics (OHSUMED). The compared approaches are 1) the KL-divergence model, 2) the KL-divergence model linearly combined with a proximity score model as used in [TZ07], and 3) the proposed proximity integrated language model (ProxLM) with different term proximity centrality measures. The authors compare the best achievable performance of the term proximity cen- trality measures. ProxSumP rox performs similarly well as ProxM inDist and both out- perform ProxAveDist. 2) performs better than 1) on all test beds except for OHSUMED whose queries are verbose. 3) outperforms 1) and 2) in terms of precision and MAP and can handle verbose queries very well: for OHSUMED, 3) always significantly (Wilcoxon signed rank test, p<0.05) outperforms 1) and 2). The authors study how robust the approaches are if stopwords in queries are con- sidered. To this end, they use the 23 Ad Hoc Topics from TREC-5 that contain at least one stopword on the collections AP88 and WSJ90-92. While, in the presence of stop- words, 1) is robust in terms of effectiveness, 2) fails for both collections. ProxSumP rox used as centrality measure in 3) improves over 1), is robust to stopword occurrences, and superior to ProxM inDist. The authors claim that stopwords occur frequently in 64 4. Evaluation for Selected Score Models documents, i.e., they are likely to occur close to other query terms which may highly influence proximity centrality scores for ProxM inDist. 4.1.4 Learning to Rank Metzler and Croft: Metzler and Croft (cf. Section 2.7.3) evaluate with four TREC test beds: the 150 Ad Hoc Topics from TREC-1 to TREC-3 (WSJ87-92 and AP88-90, respectively; both part of TIPSTER), 100 topics from the Web Track, Ad Hoc Task of TREC-9+10 (WT10g), and 50 topics from the Terabyte Track, Ad Hoc Task 2004 (GOV2). Documents are stemmed and stopwords removed during evaluation. The full independence (FI) setting (only cliques in T) serves as a baseline for the parameter- tuned, weighted MRF models. For the sequential dependence (SD) setting (cliques in T , U, and O), the authors evaluate MAP values for window sizes of 2, 8, 50, and ∞ and tune parameters separately for each window size. It seems that the window size only matters for the GOV2 collection: a size of 8 (which corresponds to the average length of English sentences) performs best and outperforms ∞-sized windows. Hill climbing is used for parameter tuning which starts off with the FI setting (λT =1, λO=λU =0). The authors find that SD and FD significantly (paired t-test, p<0.05) improve MAP values for all testbeds compared to FI variants. Svore et al.: Svore et al. (cf. Section 2.7.2) study the effect of different feature sets on effectiveness in web retrieval using stemmed English queries (up to 10 terms) sampled from a commercial search engine’s query log. Each query has on average 150–200 assigned documents with 5-level relevance assessments. While the training set consists of 27,959 queries (of which 20% are used for validation), the test set consists of 11,857 queries. The authors compare the impact of phrases and proximity terms on early NDCG values and compare ten different models: 1) BM25, 2) train LambdaRank over BM25 features (λBM25), 3) Rasolofo and Savoy’s approach, 4) a bigram-version of 3), 5) Song et al.’s approach, 6) λBM25 with bigram features, 7) 6) with additional espan-based rc value from 5), 8) an approach that uses all espan goodness features and model feature sets (Espan), 9) Espan without formatting features, and 10) Espan without 3rd-party phrase features. Espan significantly (t-test, p<0.05) outperforms all other models: phrase features and -even more- formatting features are important for retrieval effectiveness. To con- sider query characteristics, the queries in the test set are split by 1) length and 2) popularity. For popular queries, Espan significantly outperforms all other models. For short queries, removing phrase span features has only a small impact. In a full ranking model, Espan outperforms the other models. The experiments are not repeatable as neither queries nor assessments are disclosed. Cummins and O’Riordan: Cummins and O’Riordan (cf. Section 2.7.4) make use of the LA, FBIS, and FR collections from TREC Disks 4 and 5 as test data. For each collection, the corresponding topic set is evaluated in two variants with stemming 4.2 Comparative Analysis for Selected Score Models 65 and stopword removal: 1) short (title field) and 2) medium length queries (title plus description fields). Furthermore, 63 topics are evaluated with the OHSUMED collection (title plus description fields only). For each term-term proximity measure, for short and medium length queries sepa- rately, the average values for relevant and non-relevant documents are computed to see correlations. While average values for min dist, avg min dist and avg match dist seem to be inversely correlated with relevance (i.e., larger average values for less relevant documents), qt, sumtf, and prodtf values seem to be directly correlated with relevance (i.e., larger average values for relevant documents). Genetic programming is used to find a combination of a subset of the 12 proposed term-term proximity measures that form a learned proximity score. An FT subset of 69,500 documents from TREC Disk 4 and 55 topics (subset of Ad Hoc Topics from TREC-6+7) are used as training data. scoreES and a scoreBM 25 are used as baselines and linearly combined with the learned proximity score: on the training data, for ES (MAP as fitness metric), the prox- imity score generates significant improvements for prox5 and prox6 (Wilcoxon signed rank test, p<0.05), for BM25, proximity scores do not significantly improve the MAP values. An additional proximity-enhanced baseline is scoreES combined with MinDist as proximity function as used by Tao and Zhai which is not significantly better than scoreES on the training data. For most test collections, prox6 linearly combined with scoreES also significantly improves over the scoreES baseline, prox5 still significantly improves for FBIS. 4.2 Comparative Analysis for Selected Score Models In Section 4.1, we have seen that the original papers present a wealth of insightful experimental results. However, they usually only compare the effectiveness of just a few of the proximity score models described in Chapter 2. Therefore, it is difficult to assess which of the scoring models provides the best retrieval quality. We seek to close this gap by performing a comparative analysis of a significant set of proximity score models in one single evaluation framework with four test beds. In our experiments, we use an open-source implementation of the MapReduce frame- work, Hadoop in version 0.20 on Linux. Hadoop runs on a cluster of 10 servers in the same network, where each server has 8 CPU cores plus 8 virtual cores through hyper- threading, 32GB of memory, and four local hard drives of 1TB each. The implementa- tion has been done completely in Java 1.6. We evaluate the retrieval quality for various parameter combinations and a set of individual scoring models. The evaluation is accelerated and partially only enabled by the distributed evaluation in the Hadoop framework. For each scoring model, we have implemented a separate class which can be fully customized and plugged into our evaluation driver. Each scoring model class includes a list of all parameters that are evaluated. 66 4. Evaluation for Selected Score Models There is one file per test bed which contains all information that is necessary to perform the evaluation. This encompasses the document corpus, its characteristics (such as avgdl, N, and dt), the topic sets to be evaluated, and the relevance assessments. In addition, the file specifies query readers and document parsers plus optional Hadoop- related parameters to be used during evaluation; we employ the Galago Parser to parse the document collection. The evaluation makes use of two jobs: in the map phase of the first job, we evaluate the query load for all documents and for each configuration. (A configuration consists of one scoring model with one parameter combination.) In the reduce phase of the first job, we aggregate, for each configuration and topic, retrieval quality statistics. The second job reconciles the per-topic, per-method, and per-metric results into averages per method and metric; the work is done exclusively in the reduce phase. We will now briefly describe the various test beds we use across this section. For the Web Track, we evaluate the retrieval quality for the Web Tracks in 2000 and 2001 on the WT10g collection: topics 451-500 denote evaluations with Web Track Topics from TREC-9 (2000), and topics 501-550 denote evaluations with the Web Track, Ad Hoc Topics from TREC-10 (2001). For the Web Track, we additionally show the influence of limiting the evaluated topics to those that consist of at least two query terms, namely 451-500+2 and 501-550+2. This is intended to show the effects of proximity scores (which need at least two query terms to become effective). ALL encompasses all topics from both years (i.e., topics 451-550). For the Robust Track, we evaluate the retrieval quality of the Ad Hoc Topics on TREC Disks 4 and 5 (without the Congressional Record data). While topics 301-350 denote evaluations with the Ad Hoc Topics from TREC-6 (1997), topics 351-400 denote evaluations with the Ad Hoc Topics from TREC-7 (1998). Topics 401-450 evaluate using the Web Track, Ad Hoc Topics from TREC-8 (1999), 601-650 represent the new topics from the TREC-12 Robust Track (2003), and 651-700 the new topics from the TREC-13 Robust Track (2004). ALL encompasses all topics from the five years, thereby considering the result quality values from topics 301-450 and 601-700 for each run. For the Ad Hoc Tasks of the Terabyte Track, we evaluate the retrieval quality on the GOV2 collection. Topics 701-750, 751-800, and 801-850 denote evaluations with topics from TREC-13 (2004), TREC-14 (2005), and TREC-15 (2006), respectively. ALL encompasses all topics from the three years (topics 701-850). For INEX, we evaluate the 68 Ad Hoc Track Topics from 2009 and the 52 Ad Hoc Track Topics from 2010 on the INEX Wikipedia collection from 2009. ALL encompasses the 120 topics from both years. In our evaluation, documents are considered relevant if they contain some characters marked as relevant. 4.2.1 Experimental Comparison of Scoring Models For each test bed, we measure the retrieval quality using the NDCG@10, NDCG@100, P@10, P@100, and MAP retrieval metrics. Stop words are removed and query terms are stemmed. For each test bed, we compare the result quality of the scoring models 4.2 Comparative Analysis for Selected Score Models 67 by Büttcher et al., Rasolofo and Savoy, Zhao and Yun, Tao and Zhai, Lv and Zhai, Song et al., and de Kretser and Moffat. Furthermore, we evaluate LM with Dirichlet smoothing, ES, and BM25 as content-scores. We vary the parameters for each scoring model. For the Terabyte Track test bed, we can not evaluate Lv and Zhai’s scoring model within a reasonable amount of time due to the collection size and positional language models that need to be constructed for every position in each document. Evaluation Using Web Track Test Beds Figures A.1 to A.3 in Appendix A show the best NDCG, precision, and MAP values for the Web Track test beds. Song et al.’s and Büttcher et al.’s scoring models have the highest NDCG values, Tao and Zhai’s approach often performs similarly well. Tao and Zhai’s scoring model and Büttcher et al.’s scoring model provide the highest precision values. Song et al.’s and Büttcher et al.’s scoring models perform best for the MAP metric. De Kretser and Moffat’s approach performs worse than its competitors. Restricting the test beds involving topics 451-500 and 501-550 to those with at least two query terms (451-500+2, 501-500+2) yields a higher overall retrieval quality, but does not influence the order of result quality among the scoring models. Evaluation Using Robust Track Test Beds Figures A.4 to A.6 in Appendix A show the best NDCG, precision, and MAP values for the Robust Track test beds. The best performing scoring models on the Robust Track test beds are the ones by Büttcher et al., Tao and Zhai, and Song et al.; there is no clear winner among these three models, usually they achieve similar retrieval quality values. Like for the other test beds, de Kretser and Moffat’s approach falls behind the quality of the remaining scoring models. Evaluation Using Terabyte Track Test Beds Figures A.7 to A.9 in Appendix A show the best NDCG, precision, and MAP values for the Terabyte Track test beds. Büttcher et al.’s scoring model yields the highest retrieval quality for all test beds and retrieval metrics except for 751-800 with the MAP metric where it performs slightly weaker than Song et al.’s scoring model. For all other test beds Song et al.’s scoring model performs second best. BM25, Rasolofo and Savoy’s scoring model perform similarly, but still good. For the NDCG and precision metrics, the Dirichlet smoothed language model and Zhao and Yun’s scoring model yield similar retrieval quality, however often slightly weaker than BM25 and Rasolofo and Savoy’s scoring model. For the MAP metric, Zhao and Yun’s model falls behind the Dirichlet smoothed language model. De Kretser and Moffat’s scoring model is far weaker than all other scoring models. 68 4. Evaluation for Selected Score Models Evaluation Using INEX Test Beds Figures A.10 to A.12 in Appendix A show the best NDCG, precision, and MAP values for the INEX test beds. Büttcher et al.’s scoring model yields the highest retrieval quality on all test beds (2009, 2010, and ALL) for all employed retrieval metrics. Song et al.’s approach always performs second best. Song et al.’s model outperforms BM25, Rasolofo and Savoy, LM with Dirichlet smoothing, and Zhao and Yun’s approach; the latter four provide a similar retrieval quality and are a bit stronger than Lv and Zhai’s approach. Tao and Zhai’s scoring model and ES are similarly strong and a bit weaker than Lv and Zhai’s scoring model. De Kretser and Moffat’s approach is far behind. 4.2.2 Individual Scoring Models This subsection details the parameter settings which have been evaluated for the scoring models. To keep the evaluation manageable, in this subsection, we restrict ourselves to evaluating all topics available for the test beds (i.e., topic set ALL). In the result tables we abbreviate the four resulting test beds by WEB, ROBUST, TERABYTE, and INEX. BM25: For the (disjunctive) evaluation of the BM25 scoring model, we vary the parameters k1 and b: k1 ∈ {0.25, 0.4, 0.75, 1.0, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.0, 2.5} (12 vari- ants), b ∈ {0.25, 0.3, 0.5} (3 variants), and 2 variants of idf (idf1 and idf3). k is always set to k1. Therefore, in total 12 · 3 · 2 = 72 runs are evaluated per test bed. While for the MAP metric, smaller choices of b (0.25 and 0.3) work good for WEB and INEX, TERABYTE prefers larger b (0.3 and 0.5). For ROBUST, there does not seem to be a consistent best choice for b among the best parameter settings. k1 should be small or medium-valued: k1 ≤ 1 yields the best results for WEB and ROBUST, 0.75 ≤ k1 ≤ 1.2 for TERABYTE, and 0.4 ≤ k1 ≤ 1.0 for INEX. The choice of the idf variant does not seem to be important for the retrieval quality. For the NDCG@10 metric, results tend to be better if k1 is chosen larger, i.e., k1 ≥ 1.5 for WEB, 0.75 ≤ k1 ≤ 1.2 for ROBUST, k1 ≥ 1.2 for TERABYTE, and 0.75 ≤ k1 ≤ 1.5 for INEX. Good choices for b are similar as for MAP. optimize NDCG@10 optimize MAP Collection k1 k b idf NDCG@10 k1 k b idf MAP WEB 2.00 2.00 0.30 idf3 0.3436 0.75 0.75 0.25 idf3 0.2010 ROBUST 0.75 0.75 0.30 idf1 0.4329 0.75 0.75 0.30 idf1 0.2325 TERABYTE 1.60 1.60 0.50 idf1 0.5004 1.00 1.00 0.30 idf1 0.2973 INEX 1.00 1.00 0.30 idf1 0.6148 0.75 0.75 0.30 idf1 0.3389 Table 4.1: BM25: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.1 contains the optimal tuning parameter settings for BM25 with NDCG@10 and MAP values for all test beds. 4.2 Comparative Analysis for Selected Score Models 69 Büttcher et al.: As described in Section 2.4.2, the proximity score part of Büttcher et al.’s scoring function is defined as pscore(d,q) = ∑ t∈q min{1, idf1(t)} accd(t) · (k1 + 1) accd(t) + K . We evaluate the effects of shrinking the influence of the pscore in Büttcher et al.’s scoring function, substituting min{1, idf1(t)} by min{minidf,idf1(t)}, where minidf ∈ {0, 0.1, 0.2, . . . , 1.0, 1.5, 2.0, 10000} (14 variants). For the cscore part (i.e., a BM25 score variant), we evaluate the 72 BM25 tuning parameter combinations as described in the paragraph dealing with BM25. Thus, we evaluate 14 · 72 = 1, 008 runs in total per test bed. To achieve high MAP values, smaller values of b like 0.25 or 0.3 and relatively small values for k1 are preferable on all test beds: on TERABYTE and INEX 0.4 ≤ k1 ≤ 1.3 perform best, on ROBUST 0.4 ≤ k1 ≤ 1.0, and on INEX 0.25 ≤ k1 ≤ 1.0. To achieve high NDCG@10 values, smaller values of b usually work well: b=0.25 performs best for WEB and INEX, and b ∈ {0.25, 0.3} for ROBUST test beds. In contrast, TERABYTE achieves high NDCG@10 values for larger values of b, i.e., b ∈ {0.3, 0.5}. Medium-size choices of k1 perform well: 0.75 ≤ k1 ≤ 1.3 for WEB, 0.4 ≤ k1 ≤ 1.3 for ROBUST, 0.6 ≤ k1 ≤ 1.5 for INEX test beds. Best runs for the TERABYTE test bed tend to have larger choices of k1, i.e., k1 ≥ 1.0. Limiting the influence of the proximity score part makes sense, i.e., minidf = 10000 usually performs worse than values below 2: the best NDCG@10 runs use 0.5 ≤ minidf ≤ 1.0 for WEB and ROBUST, 0.6 ≤ minidf ≤ 1.5 for TERABYTE, and 0.6 ≤ minidf ≤ 2.0 for INEX. The best MAP runs use 0.4 ≤ minidf ≤ 2.0 for WEB, 0.5 ≤ minidf ≤ 1.5 for ROBUST, 0.9 ≤ minidf ≤ 2.0 for TERABYTE, and 0.7 ≤ minidf ≤ 2.0 for INEX. The choice of the idf-variant has practically no impact on the result quality for all test beds and both retrieval metrics. optimize NDCG@10 optimize MAP Collection k1 k b minidf idf NDCG@10 k1 k b minidf idf MAP WEB 1.00 1.00 0.25 0.7 idf1 0.3528 0.4 0.4 0.3 0.8 idf3 0.2131 ROBUST 0.75 0.75 0.30 0.9 idf1 0.4471 0.75 0.75 0.30 0.9 idf1 0.2469 TERABYTE 1.60 1.60 0.30 1.0 idf3 0.5199 0.75 0.75 0.30 1.5 idf3 0.3257 INEX 1.30 1.30 0.25 1.0 idf1 0.6398 0.75 0.75 0.25 2.0 idf1 0.3764 Table 4.2: Büttcher et al.’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.2 contains the optimal tuning parameter settings for Büttcher et al.’s scoring model with NDCG@10 and MAP values for all test beds. Rasolofo and Savoy: For the evaluation of Rasolofo and Savoy’s scoring model, we use the same parameters as for BM25. In addition, dist ∈ {5, 10, 20, 10000} (4 variants) is varied which specifies the text window width where pairs of query term occurrences influence each other’s proximity contribution. Therefore, 72 · 4 = 288 parameter combinations are evaluated per test bed. 70 4. Evaluation for Selected Score Models For WEB and INEX, smaller choices of b (0.25 or 0.3) usually generate better NDCG@10 and MAP values. For NDCG@10, both WEB and TERABYTE work best with medium and larger-valued k1: 1.0 ≤ k1 ≤ 2.5. To yield good NDCG@10 values for ROBUST and INEX, k1 should be chosen not that large: 0.4 ≤ k1 ≤ 1.5 and 0.75 ≤ k1 ≤ 1.7 perform best. To achieve good MAP performance, k1 values should be chosen a bit smaller than for NDCG@10 values, i.e., 0.25 ≤ k1 ≤ 1.2 for ROBUST and INEX, 0.75 ≤ k1 ≤ 1.4 for TERABYTE, and k1 ≤ 1.7 for WEB. There is a high impact of the choice of dist, especially for the NDCG@10 metric. Unfortunately, it is not clear whether to choose high or low dist values. The highest peaks are generated for the TERABYTE test bed: if chosen wrong, the NDCG@10 value can drop from 50% to 40%. Only the MAP value on WEB and ROBUST is not influenced much by the dist parameter. Furthermore, the choice of the idf version has a similar impact as dist and there is no tendency which idf version to prefer. Consequently, Rasofolo and Savoy’s scoring model is difficult to tune. optimize NDCG@10 optimize MAP Collection k1 k b dist idf NDCG@10 k1 k b dist idf MAP WEB 2.00 2.00 0.30 10 idf3 0.3436 0.40 0.40 0.25 5 idf1 0.2059 ROBUST 0.75 0.75 0.50 10 idf1 0.4329 0.40 0.40 0.30 20 idf3 0.2276 TERABYTE 1.60 1.60 0.50 20 idf1 0.5004 0.75 0.75 0.30 10,000 idf3 0.2925 INEX 1.00 1.00 0.30 5 idf1 0.6148 0.40 0.40 0.25 5 idf1 0.3396 Table 4.3: Rasolofo and Savoy’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.3 contains the optimal tuning parameter settings for Rasolofo and Savoy’s scoring model with NDCG@10 and MAP values for all test beds. Language Model with Dirichlet smoothing: For the Dirichlet smoothed language model we vary the smoothing parameter μ ∈ {500, 750, 1000, 1250, 1500} which leads to five evaluated runs per test bed. The evaluation shows that the spread in result quality is usually very small so that the choice of the tuning parameter does not have a large influence. Nevertheless, often smaller choices of μ yield small improvements, e.g., for the MAP value on ROBUST and INEX as well as the NDCG@10 value on ROBUST and TERABYTE. For MAP on WEB and NDCG@10 on INEX, larger choices of μ are often slightly better. There is no clear tendency whether to choose μ small or large for MAP on TERABYTE and NDCG@10 on WEB. optimize NDCG@10 optimize MAP Collection μ NDCG@10 μ MAP WEB 750 0.3233 1,500 0.1988 ROBUST 500 0.4831 500 0.2340 TERABYTE 500 0.4255 750 0.3021 INEX 1,500 0.5984 500 0.3359 Table 4.4: Language Model with Dirichlet smoothing: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.4 contains the optimal tuning parameter settings for Language Model with 4.2 Comparative Analysis for Selected Score Models 71 Dirichlet smoothing with NDCG@10 and MAP values for all test beds. Zhao and Yun: For the evaluation of Zhao and Yun’s score we use Dirichlet smooth- ing with smoothing parameter μ ∈ {500, 1000, 2000, 3000, 5000} (5 variants), scaling parameter x ∈ {1.5, 2.0, 2.5, 3.0, 4.0, 5.0} (6 variants), λ ∈ {3.0, 5.0, 7.5, 10.0, 12.5} (5 variants), and proximate centrality chosen between minimum distance, average dis- tance, and summed distance (3 variants). This amounts to 5 · 6 · 5 · 3 = 450 evaluated parameter combinations per test bed. To achieve high NDCG@10 values, usually μ = 500 is a good choice on all test beds; only for INEX μ ∈ {1000, 2000} performs better. The best MAP-oriented runs use μ = 500 for INEX and ROBUST, μ = 1000 for TERABYTE, and μ = 2000 for WEB test beds. For NDCG@10, while among the top runs for ROBUST and TERABYTE the average distance measure is most frequently used and for INEX the minimum distance measure is most frequently used as proximate centrality measure, there are no noticeable tendencies for WEB. For MAP, while the summed distance and minimum distance measures are most frequent among the top runs of ROBUST and INEX, there are no noticeable tendencies for WEB and TERABYTE. x and λ are very heterogeneously chosen among the top runs for all test beds; therefore, it is hard to give a general recommendation for choices of x and λ. optimize NDCG@10 optimize MAP Collection μ x λ P roxB (ti) NDCG@10 μ x λ P roxB (ti) MAP WEB 500 5 7.5 P roxAvgDist(ti) 0.3235 2000 5 5 P roxM inDist(ti) 0.1989 ROBUST 500 4 7.5 P roxAvgDist(ti) 0.4255 500 2.5 3 P roxM inDist(ti) 0.2340 TERABYTE 500 4 7.5 P roxAvgDist(ti) 0.4866 1000 4 3 P roxSumDist(ti) 0.2738 INEX 2000 1.5 10 P roxAvgDist(ti) 0.6008 500 2.5 5 P roxSumDist(ti) 0.3366 Table 4.5: Zhao and Yun’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.5 contains the optimal tuning parameter settings for Zhao and Yun’s scoring model with NDCG@10 and MAP values for all test beds. Tao and Zhai: For the evaluation of Tao and Zhai’s score we use a language model with Dirichlet smoothing and smoothing parameter μ ∈ {100, 250, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 5000} (11 variants), α ∈ {0.1, 0.2, . . . , 1.5, 2.0} (16 variants), and the kernels MinDist, AvgDist, and MaxDist (3 variants). This results in 11·16·3 = 528 evaluated runs per test bed. The evaluation shows that the choice of the kernel has only a minor influence on result quality, although for the ROBUST and INEX test beds MinDist appears fre- quently among the best runs. Smaller and medium-valued choices of μ perform usually better than large choices. If one aims at optimizing MAP values, μ ∈ {500, 1000, 1500} works well for TERABYTE and INEX, μ ∈ {250, 500, 1000} for ROBUST, and μ ∈ {1000, 1500, 2000, 2500} for WEB. If one aims at optimizing NDCG@10 values, μ ∈ {250, 500, 1000} yields best values for TERABYTE (μ = 500 works especially well), μ ∈ {500, 1000, 1500, 2000} for INEX, μ ∈ {250, 500, 1000, 1500} for ROBUST, and 72 4. Evaluation for Selected Score Models μ ∈ {500, 1000, 1500} for WEB. Setting μ=500 works with all test beds and metrics. It is unclear how to select α: almost all values of α are represented within the best runs on all test beds with both metrics. optimize NDCG@10 optimize MAP Collection μ α kernel NDCG@10 μ α kernel MAP WEB 1,000 0.9 MaxDist 0.3269 2,000 0.8 MinDist 0.2057 ROBUST 500 1.1 MinDist 0.4207 500 0.7 MinDist 0.2313 TERABYTE 500 0.4 MaxDist 0.4517 1,000 0.6 MinDist 0.2572 INEX 500 0.8 MinDist 0.5205 500 1.0 MinDist 0.3080 Table 4.6: Tao and Zhai’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.6 contains the optimal tuning parameter settings for Tao and Zhai’s scoring model with NDCG@10 and MAP values for all test beds. ES: As described in Section 2.2, Cummins and O’Riordan linearly combine the ES score with their proximity score combinations learned by Genetic Programming. As we do not have an implementation of this non-trivial Genetic Programming framework, we just use ES as another content-score without any parameters, generating only one run per test bed. Consequently, we cannot provide any optimal tuning parameter settings for NDCG@10 and MAP values. Lv and Zhai: The scoring model proposed by Lv and Zhai builds one positional language model for each word position in the document. As a consequence, it is com- putationally very expensive so that we could not evaluate the TERABYTE test bed. To further reduce computation costs, we have evaluated using the Gaussian kernel with Dirichlet prior smoothing as the Gaussian kernel is considered superior by Lv and Zhai (cf. Section 4.1.3). We have evaluated four parameter combinations, namely σ ∈ {25, 275} and μ ∈ {1000, 2000}. To rank, we scored each document by the best position in that document. For the three test beds (WEB, ROBUST, and INEX), we found that both for NDCG@10 and MAP used as retrieval quality metric, the combination μ = 1000 and σ = 275 yielded the best results. Thus, the intercollection generalization results are perfect (always 1.0000) on this restricted set of parameter combinations for all pairs of test beds. Song et al.: We evaluate Song et al.’s scoring model with k1 ∈ {0.25, 0.4, 2.5} (3 vari- ants), b ∈ {0.3, 0.45} (2 variants), two idf implementations, x ∈ {0, 0.25, 0.5, 0.55, 0.75, 1} (6 variants), y ∈ {0, 0.25, 0.5, 0.75, 1} (5 variants), and dmax ∈ {5, 45, 100} (3 variants) which amounts to 3 · 2 · 2 · 6 · 5 · 3 = 1, 080 parameter combinations. Large k1 (2.5) work best for NDCG@10 with the WEB and TERABYTE test beds. The best parameter combinations for INEX include medium and large choices of k1 (i.e., 0.4 and 2.5), whereas we obtain the best values on ROBUST for small and medium choices of k1 (0.25 and 0.4). For MAP, recommended choices of k1 are more homoge- neous: small and medium choices of k1 (i.e., 0.25 or 0.4) yield the best MAP values 4.2 Comparative Analysis for Selected Score Models 73 for all test beds. For all test beds and both MAP and NDCG@10 metrics, dmax = 5 is very frequent among the top performing parameter settings. Except for TERABYTE and the NDCG@10 metric (where the choice of b is unclear), b set to 0.3 is common for the best runs. The choice of the idf version has only a minor impact on the result quality. Choices of x and y among the best runs are too heterogeneous to say anything meaningful about them. optimize NDCG@10 optimize MAP Collection k1 k b idf x y dmax NDCG@10 k1 k b idf x y dmax MAP WEB 2.5 2.5 0.30 idf1 0.25 0.75 5 0.3471 0.4 0.4 0.3 idf3 0.5 0.5 5 0.2114 ROBUST 0.4 0.4 0.3 idf1 0.55 0.25 5 0.4418 0.4 0.4 0.3 idf1 0.5 0.25 5 0.2440 TERABYTE 2.5 2.5 0.45 idf1 0.25 0.25 5 0.5138 0.4 0.4 0.3 idf3 0.75 0.25 5 0.3214 INEX 0.4 0.4 0.3 idf1 0.5 0.0 5 0.6244 0.4 0.4 0.3 idf1 0.55 0.25 5 0.3693 Table 4.7: Song et al.’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.7 contains the optimal tuning parameter settings for Song et al.’s scoring model with NDCG@10 and MAP values for all test beds. De Kretser and Moffat: We evaluate de Kretser and Moffat’s scoring model in conjunctive and disjunctive mode (2 variants), using the contribution functions triangle, cosine, circle, arc, circle’, and arc’ (6 variants), and the two algorithms to obtain a ranking for documents (2 variants) which generates 2 · 6 · 2 = 24 runs in total. De Kretser and Moffat’s scoring model performs worse than all other scoring models we have evaluated. Nevertheless parameters have a high influence on result quality also for this scoring model; conjunctive evaluation of queries always provides higher retrieval quality than disjunctive evaluation which is especially high for TERABYTE and INEX test beds. For NDCG@10 advantages amount to about 12 and 17 percentage points, for MAP to around 7 and 5 percentage points, respectively. Only for the ROBUST test bed with MAP values, it is not clear whether one should use conjunctive or disjunctive query evaluation. To obtain a ranking for documents, choosing the first algorithm that greedily aggregates scores from different positions and documents (first algorithm) is often a bit better (below 2 percentage points) than considering the position with maximum score per document (second algorithm). This does not hold for the WEB test bed: if one wants to optimize NDCG@10 values, it remains unclear which ranking algorithm yields the better retrieval quality, for MAP values the second algorithm is often slightly better than the first algorithm. optimize NDCG@10 optimize MAP Collection conj./disj. kernel algorithm NDCG@10 conj./disj. kernel algorithm MAP WEB conjunctive circle’ 2nd algorithm 0.2357 conjunctive circle 2nd algorithm 0.1323 ROBUST conjunctive circle 1st algorithm 0.3340 disjunctive circle’ 1st algorithm 0.1493 TERABYTE conjunctive arc 1st algorithm 0.2370 conjunctive circle 1st algorithm 0.1671 INEX conjunctive circle 1st algorithm 0.4092 conjunctive circle’ 1st algorithm 0.2005 Table 4.8: De Kretser and Moffat’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. Table 4.8 contains the optimal tuning parameter settings for de Kretser and Moffat’s 74 4. Evaluation for Selected Score Models scoring model with NDCG@10 and MAP values for all test beds. For all test beds, de Kretser and Moffat’s approach falls behind the retrieval quality of the remaining scoring models. 4.2.3 Intercollection and Intracollection Generalization Results In this subsection, we measure both intercollection and intracollection generalization performance of different scoring models. To measure the intercollection generalization performance of a scoring model for a given evaluation metric, Metzler [Met06b] first computes the parameter combination for a training test bed that achieves the highest retrieval quality. This parameter combination is used as parameter combination for the test data (another test bed); the resulting retrieval quality m′ is divided by the best retrieval quality m∗ achievable on the test data to compute the effectiveness ratio G = m ′ m∗ . We use one document collection with the corresponding topic set ALL as training test bed to check the retrieval quality for some parameter combinations. Then, we employ the parameter combination that yields the highest retrieval quality for the training data with a different document collection with the corresponding topic set ALL for that second collection (test data). According to Metzler [Met06b], an ideal model that generalizes perfectly achieves an effectiveness ratio of 1. While effectiveness ratios below 0.90 indicate a scoring model’s missing ability to generalize, the most reasonable retrieval models have an effectiveness ratio above 0.95. Table 4.9 shows the intercollection generalization results for various scoring models for both the NDCG@10 and the MAP metric. Büttcher et al.’s scoring model and the LM approach with Dirichlet smoothing generalize especially well: all effectiveness ratios for both metrics are above 95%. In most cases, for the NDCG@10 metric, the other scoring models do not generalize as well as these two approaches, but usually still generalize reasonably well: BM25’s and Rasolofo and Savoy’s effectiveness ratios are always above 93.63%; when trained on TERABYTE or INEX, the ratio even always exceeds 95%. Zhao and Yun’s scoring model’s effectiveness ratio is always above 93.13%; when trained on WEB, ROBUST or TERABYTE, the ratio overscores 95%. For Song et al.’s scoring model the effectiveness ratio overscores 93.18%; when trained on WEB, ROBUST or INEX, the ratio overscores 95%. Only de Kretser and Moffat’s scoring model slightly underscores the 90% bound at a level of 87.91% when it is trained on WEB and tested with TERABYTE which may still be acceptable. For the MAP metric, BM25 and Song et al.’s scoring model have an effectiveness ratio above 95% and thus generalize very well. Zhao and Yun’s scoring model and Tao and Zhai’s scoring model have a high effectiveness ratio of at least 94.33% and 94.28%, respectively; when trained on ROBUST, TERABYTE or INEX, the ratio always ex- ceeds 95%. Rasolofo and Savoy’s scoring model slightly underscores the 90% bound at a level of 88.17%; it may still be acceptable, especially when trained with ROBUST or TERABYTE. de Kretser and Moffat’s scoring model is not able to generalize: when 4.2 Comparative Analysis for Selected Score Models 75 trained on ROBUST and tested on TERABYTE the effectiveness ratio is just 52.43%. Scoring model NDCG@10 MAP Train\Test WEB ROBUST TERABYTE INEX WEB ROBUST TERABYTE INEX BM25 WEB - 0.9795 0.9498 0.9708 - 0.9956 0.9829 0.9895 ROBUST 0.9363 - 0.9790 0.9613 0.9919 - 0.9989 1.0000 TERABYTE 0.9708 0.9819 - 0.9512 0.9875 0.9887 - 0.9919 INEX 0.9860 0.9974 0.9664 - 0.9919 1.000 0.9989 - Büttcher et al. WEB - 0.9952 0.9775 0.9927 - 0.9947 0.9816 0.9727 ROBUST 0.9848 - 0.9861 0.9795 0.9802 - 0.9890 0.9879 TERABYTE 0.9789 0.9787 - 0.9696 0.9848 0.9906 - 0.9933 INEX 0.9824 0.9917 0.9930 - 0.9907 0.9827 0.9979 - Rasolofo, Savoy WEB - 0.9795 0.9498 0.9708 - 0.9767 0.8817 1.0000 ROBUST 0.9363 - 0.9790 0.9613 0.9748 - 0.9633 0.9860 TERABYTE 0.9708 0.9819 - 0.9512 0.9712 0.9993 - 0.9944 INEX 0.9860 0.9974 0.9664 - 1.0000 0.9767 0.8817 - LM, Dirichlet WEB - 0.9965 0.9928 0.9944 - 0.9650 0.9815 0.9673 ROBUST 0.9940 - 1.0000 0.9951 0.9809 - 0.9824 1.0000 TERABYTE 0.9940 1.0000 - 0.9951 0.9909 0.9913 - 0.9996 INEX 0.9792 0.9844 0.9541 - 0.9809 1.0000 0.9824 - Zhao, Yun WEB - 0.9997 0.9997 0.9875 - 0.9482 0.9534 0.9433 ROBUST 0.9973 - 1.0000 0.9872 0.9745 - 0.9840 0.9999 TERABYTE 0.9973 1.0000 - 0.9872 0.9831 0.9820 - 0.9882 INEX 0.9492 0.9677 0.9313 - 0.9635 0.9986 0.9779 - Tao, Zhai WEB - 0.9727 0.9717 0.9812 - 0.9526 0.9515 0.9428 ROBUST 0.9782 - 0.9900 0.9971 0.9724 - 0.9916 0.9972 TERABYTE 0.9745 0.9750 - 0.9632 0.9824 0.9805 - 0.9700 INEX 0.9815 0.9917 0.9897 - 0.9666 0.9977 0.9855 - Song et al. WEB - 0.9610 0.9836 0.9789 - 0.9982 0.9929 0.9944 ROBUST 0.9670 - 0.9808 0.9907 0.9777 - 0.9889 0.9997 TERABYTE 0.9724 0.9318 - 0.9413 0.9778 0.9869 - 0.9965 INEX 0.9702 0.9984 0.9831 - 0.9775 0.9991 0.9924 - De Kretser, Moffat WEB - 0.9416 0.8791 0.9244 - 0.9676 0.8924 0.9149 ROBUST 0.9937 - 0.9980 1.0000 0.8070 - 0.5243 0.7134 TERABYTE 0.9782 0.9992 - 0.9987 0.9608 0.9863 - 0.9856 INEX 0.9937 1.0000 0.9980 - 0.9827 0.9813 0.9698 - Table 4.9: Intercollection generalization results for various scoring models. The intracollection generalization measure deals with how well a model trained on one topic set for a given collection generalizes to a different topic set on the same collection. Like for the intracollection generalization measure, the effectiveness ratio G = m ′ m∗ is computed. To this end, the topic set is divided in two halves; one half is used for training, the other half for evaluation. This procedure is repeated 5,000 times to compute an average value for G which represents the intracollection generalization measure. All scoring models exhibit high intracollection generalization values between 98% and 100% on all test beds with both the MAP and NDCG@10 evaluation metrics. Therefore, we do not show the exact values. 4.2.4 Sensitivity Charts Following Metzler’s work [Met06a, Met06b], we compute entropy and spread values for the scoring models. The spread of the effectiveness metric measures the quality difference between the parameter setting with the highest retrieval quality and the parameter setting with the lowest retrieval quality. Therefore, it gives an idea of how bad the results can get if we choose the wrong parameter values. Given a topic set Q and the corresponding relevance assessments R with T = (Q,R), 76 4. Evaluation for Selected Score Models the entropy is defined as H = − ∫ θ P(θ|T ) log P(θ|T ). To estimate P(θ|T ), the following procedure is performed B=5,000 times: in iter- ation b, we repeatedly sample a subset of topics from a test bed, i.e., if we have |Q| topics, we sample |Q| times with repetition. After sampling, we have obtained a subset Tb of T , and determine the best parameter combination θb for Tb. After B iterations, Metzler estimates the posterior P(θ|T ) s.t. P(θ|T ) = ∑B i=1 δ(θ,θi) B , where δ(θ,θi) denotes Kronecker’s delta. That means that one counts, for any given parameter combination θ, how often θ has been chosen as an optimal parameter combi- nation during the B iterations and divides this number by the number of iterations B. According to [Met06a], the spread and entropy provide a novel, robust way of looking at parameter sensitivity. Metzler claims that a model with high entropy and low spread is more stable than a model with low entropy but large spread; an ideal model features both low entropy and low spread. We think that this kind of evaluation is only fair when the number of evaluated parameter combinations is similar for all scoring models as the number of evaluated parameter combinations biases the results: the more parameter combinations of a scoring model are evaluated, the potentially higher its entropy and spread. The reason for different numbers of evaluated parameter combinations has two reasons: on the one hand, the number of parameters differs from scoring model to scoring model, on the other hand, evaluating many parameter combinations is infeasible for some scoring models as it is computationally too expensive such that experiments take arbitrarily long time. Therefore, the experiments carried out for Lv and Zhai’s scoring model (four set- tings), for the Dirichlet smoothed language model (five settings), and de Kretser and Moffat’s scoring model (24 settings) are not directly comparable to the remaining scor- ing models. Anyway, we leave them in the sensitivity charts as the entropy and spread values can be considered as a lower bound for these scoring models: if more settings had been evaluated, the values would have potentially increased. In other words: if the spread or entropy values for the scoring model under consideration are already high with a small amount of evaluated settings, the scoring model would also perform bad or even worse given more evaluated settings. When evaluating the sensitivity using NDCG@10, de Kretser and Moffat’s scor- ing model usually features a comparably high spread. This is mainly due to the re- trieval quality difference between runs using conjunctive and disjunctive evaluation. The Dirichlet smoothed language model just uses one parameter (μ) and therefore is less affected by high spreads. Lv and Zhai’s model usually features the lowest entropy which is also caused by the low number of evaluated settings. We show sensitivity charts in Appendix A to depict entropy and spread values for nine scoring models on the WEB, ROBUST, GOV, and INEX test beds. 4.2 Comparative Analysis for Selected Score Models 77 Figures A.13(a) and A.13(b) show the sensitivity of nine scoring models on the WEB test bed for the MAP and NDCG@10 evaluation metric, respectively. Figures A.14(a) and A.14(b) show the sensitivity of scoring models on the ROBUST test bed for MAP and NDCG@10 evaluation metrics, respectively. Figures A.15(a) and A.15(b) show the sensitivity of scoring models on the TERABYTE test bed for MAP and NDCG@10 evaluation metrics, respectively. Figures A.16(a) and A.16(b) show the sensitivity of scoring models on the INEX test bed for MAP and NDCG@10 evaluation metrics, respectively. For the scoring models with 24 or less evaluated parameter settings, the entropy value is naturally very low. Given that small amount of evaluated parameter settings, the spread of de Kretser and Moffat’s scoring model is very high which renders it a scoring model which is difficult to tune. Among the scoring models which had at least 72 evaluated parameter settings, Song et al.’s scoring model and Tao and Zhai’s scoring model exhibit always the highest spread. BM25, Büttcher et al.’s, Zhao and Yun’s, as well as Rasolofo and Savoy’s scoring model have usually low spreads (except for Rasolofo and Savoy’s model on TERABYTE with NDCG@10 where the spread is higher) and BM25 usually offers the lowest spread. The entropy value of Büttcher et al.’s and Song et al.’s scoring model are usually highest. For the INEX test bed, Zhao and Yun’s (both for MAP and NDCG@10) and Rasolofo and Savoy’s scoring model (only for MAP) have a higher entropy. Furthermore, Zhao and Yun’s approach has a higher entropy value than Büttcher et al.’s approach for the MAP value on TERABYTE. In our setting, we think that the spread value is more meaningful than the entropy value as it measures how much retrieval quality can decrease if we choose the wrong parameter combination. 4.2.5 Summary Büttcher et al.’s scoring model and LM Dirichlet smoothing provide the best intergen- eralization values for both NDCG@10 and MAP. The other scoring models are slightly behind, but still exceed a level of 90% except de Kretser and Moffat’s scoring model whose effectiveness ratio is just slightly above 50% for MAP. The intracollection gener- alization measures are excellent (98% to 100%) for all scoring models. Scoring models with low spread values include BM25, Büttcher et al.’s, and Zhao and Yun’s scoring model. With the exception of de Kretser and Moffat’s scoring model, all surveyed prox- imity scoring models perform well in relevant sensitivity and generalization measures. We focus later on Büttcher et al.’s scoring model since it combines one of the best intercollection generalization values and a low spread. Chapter 5 Extensions 5.1 Introduction This chapter deals with extensions to the proximity score model proposed by Büttcher et al. [BCL06] described in Section 2.4.2 and provides an extensive experimental study to investigate their impact on retrieval quality. Term proximity has been a common means to improve effectiveness for text retrieval, passage retrieval, and question answering, and several proximity scoring functions have been developed in recent years. Sections 2.4 to 2.7 survey a selection of proximity scoring models developed for text retrieval during the last decade. For XML retrieval, however, proximity scoring has not been similarly successful. To the best of our knowl- edge, there is only one single existing proposal for proximity-aware XML scoring. This proposal has been authored by Beigbeder and was initially described in [Bei07] and ex- tended towards full boolean query support in [Bei10] by the same author. It computes, for each position in an element, a fuzzy score for the query, and then computes the overall score for the element by summing the scores of all positions and normalizing by the element’s length. We provide a more detailed description of this scoring model in Section 5.2.3. The contributions of this chapter are two-fold: 1) In Section 5.2 we propose one of the first XML score models that uses proximity information. This part is based on our work published in [BS08b] and [BST08] which presents a proximity score for content- only queries on XML data. We describe how to adapt the existing scoring model proposed by Büttcher et al. [BCL06] towards XML element retrieval by taking into account the document structure when computing the distance of term occurrences. 2) In Section 5.3, by means of a case study, we rigorously analyze the potential of explicit phrases for retrieval quality and compare it to the proximity score used in [SBH+07]. This part is based on our work published in [BBS10]. 79 80 5. Extensions 5.2 XML In this section, we introduce some XML-related background and describe Beigbeder’s approach for proximity-enhanced XML retrieval [Bei07, Bei10] as well as his experi- mental results. Then, we present our own XML score model for content-only queries on XML data that uses proximity information published in [BS08b] and [BST08]. We show experimental results for two test beds and present a new evaluation metric. 5.2.1 XML Background In the context of the INEX workshop, documents are Wikipedia articles that have been annotated with XML tags. For our experiments in this chapter, we use the Wikipedia collection used for INEX during the years 2006 to 2008 (cf. Section 3.2.2) which contains tags that can be classified into two categories [DG06a]: a) language- independent general tags that carry structural information derived from the Wikitext format, and b) language-dependent template tags which describe repetitive information. Examples for general tags include article, section, p (which stands for paragraph), title, various forms of links (e.g., collectionlink and unknownlink), and emphasis levels (e.g., emph2 and emph3). Template tags always start with template_ and vary depending on the language of the Wikipedia collection in use. According to the W3C recommendation from 26 November 2008 (http://www.w3. org/TR/xml/), elements are either delimited by start tags and end tags (e.g.,
and
), or, for empty elements, by an empty-element tag (e.g.,
). Each element has a type, identified by name (generic identifier (GI)), and may have a set of attribute specifications. Each XML document can be represented as an element tree. Nodes represent elements and directed edges indicate parent-child relationships be- tween elements in the document under consideration. If the complete collection in- cluding links is considered, the tree structure is converted into the more general graph structure, links being considered as directed edges which may generate loops. Hence, XML retrieval aims at retrieving subtrees/subgraphs from the collection graph as re- sults to an issued query. According to [KGT+08], the two main research questions for the INEX Ad Hoc Track are 1) whether the document annotation helps to identify the relevant portion of a document, and 2) how focused retrieval compares to traditional document-level retrieval. 5.2.2 Notation To discuss proximity scoring models for XML elements, we adapt the notation intro- duced for text retrieval in Section 2.1.2 to the XML element retrieval setting where term positions in an element are defined analogously to term positions in documents. Definition 5.2.1. (element length; position-related notation) Given an element e in an XML document d, the element length of e is defined as le = |e| and corresponds 5.2 XML 81 to the number of term occurrences in e. Given e with length le, we denote the term occurring at position i of e by pi(e), 1 ≤ i ≤ le; if the element is clear from the context, we simply write pi. For a term t, we capture the positions in element e where t occurs by Pe(t) = {i|pi(e) = t} ⊆ {1, . . . , le}; if e is clear from the context, we write P(t). Given a query q = {t1, . . . , tn}, we write Pe(q) := ∪ti∈qPe(ti) for the positions of all query terms in element e, again omitting the suffix e if the element is clear from the context. Given a set of positions P ⊆ {1, . . . , le} and an element e, we write Te(P) to denote the set of terms at the positions of P ⊆ {1, . . . , le} in e. Precisely, Te(P) = {t| i ∈ P ∧pi(e) = t}. Definition 5.2.2. (set of pairs of adjacent query term occurrences; set of pairs of all query term occurrences) We denote pairs of query terms that are adjacent to each other (there might be non-query terms in between) in an element e by Qadj,e(q) := {(i,j) ∈ Pe(q) × Pe(q) | (i < j) ∧ ∀k ∈ {i + 1, . . . ,j − 1} : k �∈ Pe(q)}. Pairs of query terms within a window of dist positions in an element e are defined as Qall,e(q,dist) := {(i,j) ∈ Pe(q) × Pe(q) | (i < j) ∧ (j − i ≤ dist)}. Please note that in this case, the query terms need not to occur consecutively in e. Qall,e(q) is the same but employs a window size of le. 5.2.3 Related Work by Beigbeder This section describes a proposal for proximity-aware XML scoring that has been au- thored by Beigbeder and was initially described in [Bei07] and extended towards full boolean query support in [Bei10] by the same author. He transfers a score model akin to the one proposed by de Kretser and Moffat for text retrieval in [dKM99] (cf. Section 2.5.1) to XML retrieval. The approach answers boolean queries. To this end, it introduces several modes to combine the impacts of the contribution function at position x in document d: • conjunctive mode: cq1∧q2 (x) = min(cq1 (x),cq2 (x)), • disjunctive mode: cq1∨q2 (x) = max(cq1 (x),cq2 (x)), and • complement mode: c¬q1 (x) = 1 − cq1 (x), where qi is a boolean query. If qi is a term t, the value of the contribution function at position x in d is defined as ct(x) = maxl∈Pd(t)c ′ t(x,l), where c ′ t(x,l) = max(0, 1 − |x−l| s ). The contribution function c′t is triangle-shaped, its height ht is 1, and the spread s is considered a built- in parameter which is kept constant for all terms. For the most frequent elements in the Wikipedia collection used for INEX 2008, Beigbeder distinguishes between manually chosen title-like elements and section-like elements. While title-like elements encompass name, title, template, and caption elements, section-like elements consist of article, section, body, figure, image, page, and div elements. 82 5. Extensions To score full XML documents or passages in XML documents, query terms that occur in title-like elements can extend their influence to the full content of the element and recursively to the elements it contains. The intention of that so-called propagation mode is to reflect the descriptive property of the title element for the section element it entitles. Thus, given any positions l and x, if l is located in a title-like element et and x in a section-like element es that is entitled by et, it follows that ct(x) = 1. For all positions x located in any title-like element, the contribution ct(x) is 1. The score of a document or passage p, respectively that starts at position x1 and ends at position x2 is defined as score(q,p) = ∑ x1≤x≤x2 cq(x) |p| , where the document/passage length |p| = x2 − x1 + 1 is used for score normalization. The approach requires a mapping that keeps information whether a given term posi- tion belongs to a title-like or section-like element. For the propagation mode, additional descendant information for elements is necessary to decide where to propagate scores. Descendants information for XML documents can be kept in pre-/post-order trees, for example. Experimental evaluation: For the experimental evaluation, Beigbeder evaluates the 70 assessed topics from the INEX 2008 Ad Hoc Track (with the INEX Wikipedia collection 2006–2008), drops importance modifiers, and mostly uses keywords in the title field. Some topics are modified before evaluation to fit the boolean model better (e.g., spanish classical guitar players is modified to spanish ( classical | classic ) guitar players). Beigbeder varies the spread s and evaluates three approaches, retrieving only section-like elements: 1) NP-NS (no propagation, no structure) where the structure is ignored (term proximity only used as in text retrieval), 2) NP-S (no propagation, struc- ture) where only section-like elements are retrieved, term proximity influence ends at the boundaries of section-like elements, terms in title-like elements are not propagated, and 3) P-S (propagation, structure) where title-element terms’ propagation is enabled. The best run’s (P-S, s=10) iP-Value of 0.69490141 outperforms the best INEX 2008 run(0.68965708). For s ∈ {5, 50, 500}, Precision-Recall curves are highest for P-S whose iP value benefits from small choices of s. 5.2.4 Proximity Scoring for XML This section presents our proximity-enhanced score model for XML element retrieval based on our work published in [BS08b] and [BST08] which answers content-only queries on XML data. To compute a proximity score for an element e with respect to a query with multiple terms q = {t1, . . . , tn}, we first compute a linear representation of e’s content that takes e’s position in the document into account, and then apply a variant of the proximity 5.2 XML 83 article p section p section a b c d e f g h i j x zk u v w

d e

a b
c f

g h

i j
k
u v w
x z Figure 5.1: An XML document and its linearization. score by Büttcher et al. [BCL06] on that linearization. This variant has been first proposed in [SBH+07] and will be described in detail in Section 7.2.2. Figure 5.1 shows an example for the linearization process. We start with the se- quence of terms in the element’s content. Now, as different elements often discuss different topics or different aspects of a topic, we aim at giving a higher weight to terms that occur together in the same element than to terms occurring close together, but in different elements. To reflect this in the linearization, we introduce virtual gaps at the borders of certain elements whose sizes depend on the element’s tag (or, more generally, on the tags of the path from the document’s root to the element). In the example, gaps of section elements may be larger than those of p (paragraph) elements, because the content of two adjacent p elements within the same section element may be considered related, whereas the content of two adjacent section elements could be less related. Some elements (like those used purely for layout purposes such as bold or for navigational purposes such as link) may get a zero gap size. The best choice for gaps depends on the collection; gap sizes are chosen manually in our experiments. Based on the linearization, we apply the proximity scoring model of Büttcher et al. [BCL06] (cf. Section 2.4.2) for each element in the collection to find the best matches for a query q = {t1, . . . , tn} with multiple terms. To allow index precomputation without knowing the query load, we reuse the modi- fied variant proposed in [SBH+07] (detailed explanations can be found in Section 7.2.2) that does not only consider pairs of adjacent query term occurrences in documents, but all pairs of query term occurrences (not necessarily adjacent). We further gener- 84 5. Extensions alize the approach to score elements instead of documents, so the query-independent term weights in the formulas are not inverse document frequencies but inverse element frequencies ief(t) = log2 N − ef(t) + 0.5 ef(t) + 1 , where N is the number of elements in the collection and ef(t) is the number of ele- ments that contain the term t. Similarly, average and actual lengths are computed for elements. Please note that, unlike [BSTW07], we do not use a tag-specific ief score iefA(t) = log2 NA − efA(t) + 0.5 efA(t) + 1 , where NA is the tag frequency of tag A and efA(t) is the element frequency of term t as to tag A, i.e., the number of elements (in documents of the corpus) with tag A that contain t in their full-content. We demonstrated in [BS08b] (and also in additional non-submitted results in [BSTW07]) that a global ief value for each term (i.e., ief(t)) achieves better result quality for content-only (CO) queries than tag-specific ief values (i.e., iefA(t)). The BM25 score of an element e for a query q is defined as scoreBM25(e,q) = ∑ t∈q ief(t) tf(e,t) · (k1 + 1) tf(e,t) + K , where K=k·[(1 − b) + b · le avgel ] with avgel being the average element length in C. b, k1, and k are tuning parameters that are set to b = 0.5 and k = k1 = 1.2, respectively. As for XML element retrieval the element length is important to keep up result quality, we do not ignore it in the proximity score component as in [SBH+07]. Hence, the proximity part of an element’s score is computed by plugging the acc values into a BM25-style scoring function: scoreprox(e,q) = ∑ t∈q min{1, ief(t)}acc(e,t) · (k1 + 1) acc(e,t) + K , where acc(e,t) = ∑ (i, j) ∈ Qall,e(q) : pi = t, pj = t ′, t �= t′ ief(t′) (i − j)2 + ∑ (i, j) ∈ Qall,e(q) : pi = t ′, pj = t, t �= t′ ief(t′) (i − j)2 and K as well as the configurable parameters are set like for the BM25 score contribu- tion. The overall score is then the sum of the BM25 score and the proximity score: score(e,q) = scoreBM25(e,q) + scoreprox(e,q). 5.2 XML 85 5.2.5 Experimental Evaluation In order to evaluate our methods, in [BS08b] we used the standard INEX benchmark, namely the INEX Wikipedia collection [DG06a] with the content-only (CO) topics from the INEX Ad Hoc Task 2006. The 111 topics with relevance assessments are shown in Appendix C. Following the methodology of the INEX Focused Task, we computed, for each topic, a list of the 100 best non-overlapping elements with highest scores and evaluated them with the interpolated Precision metric used at INEX 20071 [KPK+07]. Details about the metric are given in Section 3.3. When we check for significant improvements of an approach over the BM25 baseline, we first check for significance using the Wilcoxon signed rank test as it does not make any assumptions about the distribution of differences between pairs of results. If it fails at p<0.10, we try the paired t-test which assumes a normal distribution of differences between pairs of results. In all tables of this section, ‡ and † indicate statistical sig- nificance over the baseline according to the Wilcoxon signed rank test at p<0.05 and p<0.10, respectively. * and / indicate statistical significance over the baseline according to the paired t-test at p<0.05 and p<0.10, respectively. Results for Document-Level Retrieval For our first experiment, we evaluated how good our proximity-aware scoring is at determining documents with relevant content. We limited the elements in the result set to article elements, corresponding to complete Wikipedia articles, and considered different gap sizes, where we report (1) gaps of size 0 for all elements, (2) gaps of size 5 for section and 3 for p elements, and (3) gaps of size 30 for section and p elements. Approaches (1)-(3) all exploit proximity information in the form of scoreprox(e,q), and approaches (2) and (3) increase distances between query term occurrences in different elements by artificial gaps. We call the first approach gap-free, the latter two approaches gap-enhanced models. Additionally, we report results without proximity (i.e., only the BM25 score scoreBM25(e,q) is used to rank elements) as baseline results. Our implementation first computed the 100 best results for the BM25 baseline and then additionally computed the different proximity scores for these results, re-ranking the result list. Table 5.1 shows the results for document-level retrieval with stopword removal. If stemming is enabled, usage of the gap-free model that employs proximity information improves every iP and MAiP value compared to the baseline, and gaps help additionally (except for iP[0.01]). The same holds if stemming is disabled, this time without any 1Due to a bug reported for the original INEX implementation, we used a Java-based reimplementa- tion of the metric. 86 5. Extensions stemming, stopword removal no stemming, stopword removal metric baseline (1) (2) (3) baseline (1) (2) (3) iP[0.01] 0.6610 0.6916† 0.6912† 0.6859/ 0.6721 0.7043‡ 0.7045‡ 0.7046‡ iP[0.05] 0.5630 0.5918* 0.5953* 0.5930* 0.5701 0.5904* 0.5954† 0.5953† iP[0.10] 0.5339 0.5496/ 0.5545* 0.5521/ 0.5487 0.5644/ 0.5685* 0.5684* MAiP 0.2682 0.2795‡ 0.2804‡ 0.2798‡ 0.2617 0.2717‡ 0.2725‡ 0.2730‡ Table 5.1: Results for document-level retrieval with stopword removal. exception. In both cases, we get very significant improvements for proximity scores over the baseline. With only a few exceptions, gap-enhanced approaches can further improve the iP result quality over gap-free approaches. For the MAiP metric all approaches achieve significant improvements over the baseline with the Wilcoxon signed rank test at p<0.05. stemming, no stopword removal no stemming, no stopword removal metric baseline (1) (2) (3) baseline (1) (2) (3) iP[0.01] 0.6440 0.6233 0.6186 0.6163 0.6660 0.6068 0.6157 0.6187 iP[0.05] 0.5476 0.4986 0.4964 0.4984 0.5579 0.5081 0.5173 0.5209 iP[0.10] 0.5117 0.4620 0.4604 0.4596 0.5230 0.4775 0.4853 0.4888 MAiP 0.2543 0.2444 0.2432 0.2434 0.2487 0.2360 0.2369 0.2375 Table 5.2: Results for document-level retrieval without stopword removal. Table 5.2 depicts the impact of missing stopword removal on the results for document-level retrieval. The results clearly demonstrate that stopword removal is crucial if we do not want to risk decreasing result quality with proximity scores com- pared to the baseline. If the query contains stopwords, the loss of result quality can be attributed to stopword occurrences near other query terms in some documents; as all pairs of query terms are considered if a document is to be scored, they generate an increased proximity contribution for the corresponding document. We think that these pairs are less meaningful (i.e., carry less semantics) than pairs of non-stopwords. Gap-enhanced models cannot resolve the issue of losing result quality against the baseline. They just reduce the losses if stemming is disabled but do not get even close to the baseline’s result quality. Consequently, all significance tests to show improvements over the baseline fail. In summary, stopword removal is mandatory to get high retrieval quality for document-level retrieval and gap-enhanced approaches can often help additionally to improve the retrieval quality. In most cases, runs that are based on disabled stemming have slight advantages for the absolute iP values over those runs that use stemming. We get the best MAiP values when stemming is enabled and stopwords are removed. Results for Element-Level Retrieval We now evaluate the performance of proximity-aware scoring for element-level retrieval, where we limit the set of elements in the result list to those with article, body, 5.2 XML 87 section, p, normallist, and item tags for efficiency reasons; initial experiments with all tags yielded similar results. As we had to remove overlap, we first computed the best 200 elements for the BM25 baseline, for which we then computed the proximity scores, resorted the list according to the new scores, and removed the overlap between elements. Whenever two elements overlapped, we kept the element with the highest score. stemming, stopword removal no stemming, stopword removal metric baseline (1) (2) (3) baseline (1) (2) (3) iP[0.01] 0.6589 0.6847‡ 0.6746† 0.6753† 0.6624 0.6659 0.6681 0.6677 iP[0.05] 0.5344 0.5591‡ 0.5534‡ 0.5544‡ 0.5143 0.5274† 0.5280† 0.5263 iP[0.10] 0.4482 0.4680‡ 0.4643‡ 0.4643‡ 0.4270 0.4441† 0.4413 0.4370 MAiP 0.1793 0.1870‡ 0.1855‡ 0.1854‡ 0.1617 0.1670 0.1669† 0.1666† Table 5.3: Results for element-level retrieval with stopword removal. Table 5.3 illustrates the results for element-level retrieval with stopword removal. The best results and most significant improvements in element-level retrieval can be achieved if stemming is enabled. While the gap-free approach shows significant im- provements over the baseline (Wilcoxon signed rank test at p<0.05, for every metric), the gap-enhanced approaches slightly lose absolute result quality compared to the gap- free approach. However, this does not overly harm the significance of improvements of gap-enhanced approaches over the baseline. For all metrics, except for iP[0.01], we achieve significant improvements with the Wilcoxon signed rank test at p<0.05, for iP[0.01] the improvements are still significant according to the Wilcoxon signed rank test, but only at p<0.10. If stemming is disabled, the usage of proximity improves every iP value compared to the baseline, but gaps help slightly only for early iP values. Sig- nificant improvements over the baseline using the Wilcoxon signed rank test at p<0.10 can be realized only for later iP values. In general, compared to stemming with stopword removal, no stemming with stop- word removal achieves less significant improvements for approaches (1)-(3) over the baseline (if at all) as well as a lower absolute result quality. stemming, no stopword removal no stemming, no stopword removal metric baseline (1) (2) (3) baseline (1) (2) (3) iP[0.01] 0.6409 0.6539* 0.6524/ 0.6484 0.6079 0.6050 0.6045 0.6047 iP[0.05] 0.5077 0.5165/ 0.5163/ 0.5072 0.4736 0.4828 0.4764 0.4758 iP[0.10] 0.4042 0.4139 0.4136 0.4075 0.3627 0.3766‡ 0.3700† 0.3702† MAiP 0.1452 0.1535‡ 0.1533‡ 0.1504† 0.1267 0.1331‡ 0.1308† 0.1307† Table 5.4: Results for element-level retrieval without stopword removal. Table 5.4 shows the results for element-level retrieval without stopword removal. Gap-free models improve every iP and MAiP value over the baseline, except for iP[0.01] if stemming is disabled. Gap-enhanced models slightly lose on absolute result quality compared to gap-free models but still frequently beat the baseline. 88 5. Extensions When we combine stemming with stopword removal, we achieve the best and most significant results for element-level retrieval. Gaps, however, help only for early iP values if stemming is disabled and stopwords are removed. Stopword removal is more important for document-level retrieval than for element-level retrieval if we want to obtain a good result quality, but our approaches benefit from stopword removal at both retrieval granularities. The structure-aware proximity score for XML retrieval that we have presented helps to improve the retrieval effectiveness of gap-free approaches for document-level retrieval, but does not show a similar effect for element-level retrieval. An automated selection of gap sizes by means of relevance feedback techniques could improve the result quality. 5.2.6 Additional Experiments for INEX 2008 This subsection describes additional experiments we have carried out for INEX 2008. It extends the experiments from Section 5.2.5 which used the 111 CO topics from the INEX Ad Hoc Task 2006 by another test bed (including the same document collection) used in the INEX 2008 Ad Hoc Track. An overview of the INEX 2008 Ad Hoc Track has been authored by Kamps et al. and has been published in [KGT+08]. 70 topics have been assessed for the INEX 2008 Ad Hoc Track which are depicted in Appendix C, Table C.4. The choice of runs we submitted to the Focused Task at INEX 2008 [BST08] was based on earlier results from SIGIR 2008 [BS08b]: as iP[0.01] is the metric that ranks the runs in INEX, we have chosen the setting that provides the highest retrieval quality at iP[0.01] from our previous experiments (detailed in Section 5.2.5), i.e., no stemming, but stopword removal in document-level retrieval. The Focused Task aims at returning a ranked list of elements or passages in a focused way, i.e., returned elements must not overlap. According to [KGT+08], participants were allowed to submit up to three element result-type runs per task and three passage result-type runs each, for the Focused, Relevant in Context, and Best in Context Task in the Ad Hoc Track. As we have only evaluated element result-type runs for the Focused Task, we have only been allowed to submit three runs as described in the following: • TopX-CO-Baseline-articleOnly: this run considers the non-stemmed terms in the title of a topic (including the terms in phrases, but not their sequence) except terms in negations and stopwords. We restricted the collection to the top-level article elements and computed the 1,500 articles with the highest scoreBM25 value as described in Section 5.2.4. Note that this approach corresponds to stan- dard document-level retrieval. This run is comparable to the baseline approach for document-level retrieval with stopword removal and disabled stemming used in Section 5.2.5. • TopX-CO-Proximity-articleOnly: this run re-ranks the results of the baseline run coined TopX-CO-Baseline-articleOnly by adding the proximity score con- tribution scoreprox as described in Section 5.2.4. We use gaps of size 30 for 5.2 XML 89 1: id=12 2: id=57 3: id=21 k: id=40 0 10 0 20 0 30 0 40 0 50 0 60 0 70 0 80 0 90 0 10 00 11 00 12 00 13 00 14 00 15 00 16 00 # characters Figure 5.2: Example: illustration for metric P[#characters]. section and p elements. This run is comparable to the gap-enhanced approach (3) used in Section 5.2.5. Due to the limited number of submittable runs to INEX 2008, we could not evaluate different gap sizes. • TopX-CO-Focused-all: this element-level run considers the terms in the title of a topic without phrases and negations, allowing all tags for results. Note that, unlike our contributions from earlier years (e.g., [BSTW07]), we do not use a tag-specific ief score, but a single global ief value per term. We demonstrated in [BS08b] that this achieves better result quality for CO queries than tag-specific ief values (cf. Section 5.2.4). run/metric iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP TopX-CO-Baseline-articleOnly 0.6700 0.6689 0.5940 0.5354 0.2951 TopX-CO-Proximity-articleOnly 0.6804 0.6795 0.5807 0.5265 0.2967 TopX-CO-Focused-all 0.7464 0.6441 0.5300 0.4675 0.1852 Table 5.5: Results: Focused Task INEX 2008, stopword removal, no stemming. Table 5.5 shows the results for these runs. It is evident that element-level retrieval generally yields a higher early precision than document-level retrieval, but the quality quickly falls behind that of document-level retrieval which means that results become significantly worse than article-only runs starting at a recall level of 0.01. This is reflected in the results: while the element-level run TopX-CO-Focused-all ranks at position 11 among 61 runs, the document-level runs rank at position 4 (TopX-CO-Baseline-articleOnly) and position 3 (TopX-CO-Proximity-articleOnly) among 61 runs, the last one being our best submitted run. Proximity scoring with gaps can in general help to improve early precision with document-level retrieval. MAiP val- ues are almost equal for the document-level baseline TopX-CO-Baseline-articleOnly and the gap-enhanced model TopX-CO-Proximity-articleOnly. 90 5. Extensions Our experiments in SIGIR 2008 [BS08b] showed significant improvements of the gap-enhanced approach (3) over the baseline. Unfortunately, at INEX 2008 [BST08] comparable runs did not demonstrate equally significant improvements (significance levels are p=18.77% and 35.85% for paired t-test and Wilcoxon signed rank test, re- spectively). As the iP metric returns the maximally achievable precision after the returned re- sults have reached a recall level of at least x, this metric hides the points in the result sets where the retrieval quality originates from. Therefore, for analytical reasons, we have a look at the results using an alternative metric that measures the precision af- ter x characters, abbreviated as P[x characters]. Figure 5.2 provides an example to illustrate how that alternative metric works. Assume that we want to calculate the precision value after 1,000 characters, P[1,000 characters]. We think of the result set as a characterwise concatenation of results for a given run; the evaluation measures the precision after reading the first 1,000 characters which corresponds to the retrieved number of characters carrying relevant content divided by the number of retrieved characters. Figure 5.2 characterwise aligns rectangles that represent the retrieved doc- uments of a fictitious run, relevant characters are represented as yellow boxes. The first retrieved document of that fictitious run has id 12 and consists of 300 characters where the first 200 characters are considered relevant. The second retrieved document with id 57 consists of 600 characters of which 300 characters are relevant. As just the first 100 characters of the third document fit into the 1,000 characters limit, this document cannot generate a positive contribution to the precision (the relevant portion of this document starts after 300 characters only). Hence, the value for P[1,000 characters] is calculated as 200+300 1,000 = 0.5. Figure 5.3 depicts the precision values after x characters for each of the three runs. It turns out that if we are interested in retrieving just a small amount of characters, it is worth considering to use the element-level run (leading up to 1,200 characters). Then the result quality of the element-level run deteriorates quickly and the proximity run outperforms the two other runs. Only very late, after 5,700 read characters, the baseline yields a slightly higher precision than the proximity run. Hence, to improve the retrieval quality, a hybrid approach could return the first 1,200 characters from the element-level run and fill the remaining characters with results from the proximity run. 5.3 Phrases By means of a case study, we rigorously analyze the potential of explicit phrases for retrieval quality and compare it to the proximity score used in [SBH+07]. This part is based on our work published in [BBS10]. 5.3.1 Evaluating the Potential of Phrases Phrases, i.e., query terms that should occur consecutively in a result document, are a widely used means to improve result quality in text retrieval [CCT97, CTL91, Fag87, 5.3 Phrases 91 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 10 0 50 0 90 0 13 00 17 00 21 00 25 00 29 00 33 00 37 00 41 00 45 00 49 00 53 00 57 00 61 00 65 00 69 00 73 00 77 00 81 00 85 00 89 00 93 00 97 00 #characters P [# ch ar ac te rs ] TopX-CO-Baseline-articleOnly TopX-CO-Proximity-articleOnly TopX-CO-Focused-all Figure 5.3: Comparison of the three runs: P[# characters] values. LLYM04, MdR05], and a number of methods has been proposed to automatically iden- tify useful phrases, for example [LLYM04, Z+07]. However, there are studies indicating that phrases are not universally useful for improving results, but that the right choice of phrases is important. For example, Metzler et al. [MSC06] reported that phrase detection did not work for their experiments in the TREC Terabyte Track, and Mitra et al. [MBSC97] reported similar findings for experiments on news corpora. The remainder of this chapter experimentally analyzes the potential of phrase queries for improving result quality through a case study on the TREC Terabyte bench- mark. We study the performance improvement through user-identified and dictionary- based phrases over a term-only baseline and determine the best improvement that any phrase-based method can achieve, possibly including term permutations. Experimental Setup We did a large-scale study on the effectiveness of phrases for text retrieval with the TREC GOV2 collection, and the 150 topics from the TREC Terabyte Tracks 2004– 2006 (topics 701–850) where we used the title only. More details about the collection and TREC can be found in Section 3.2.1. All documents were parsed with stopword removal and stemming enabled. We compared different retrieval methods: • A standard BM25F scoring model [RZT04] as established baseline for content- 92 5. Extensions based retrieval, with both conjunctive (i.e., all terms must occur in a document) and disjunctive (i.e., not all terms must occur in a document) query evaluation. The boosting weights are chosen as depicted in Table 5.6 and the same as the ones used in the GOV2 parser of the TopX search engine [TSW05]. • Phrases as additional post-filter on the results of the conjunctive BM25F, i.e., results that did not contain at least one instance of the stemmed phrase were removed. As the TREC topics do not contain explicit phrases, we considered the following ways to find phrases in the queries: – We performed a small user study where five users were independently asked to highlight any phrases in the titles of the TREC queries. – As example for a dictionary-based method for phrase detection, we matched the titles with the titles of Wikipedia articles (after stemming both), follow- ing an approach similar to the Wikipedia-based phrase recognition in [Z+07]. – To evaluate the full potential of phrases, we exhaustively evaluated the re- trieval quality, i.e., precision for 10 results, of all possible phrases for each topic and chose the best-performing phrase(s) for each topic. – To evaluate the influence of term order, we additionally considered all pos- sible phrases for all permutations of terms and chose the best-performing phrases, potentially after permutation of terms, for each topic. • A state-of-the-art proximity score by Büttcher [BCL06] (described in Section 2.4.2) as an extension of BM25F, including the modifications from [SBH+07]. This score outperformed other proximity-aware methods on TREC Terabyte; a thor- ough comparative experimental evaluation of various proximity-enhanced scoring models can be found in Section 4.2. We additionally report the best reported results from the corresponding TREC Terabyte tracks, limited to title-only runs. When we checked for significant improvements over the baseline BM25F (conjunctive), we used both the Wilcoxon signed rank (WSR) test and the paired t-test. Results Our small user study showed that users frequently disagree on phrases in a query: on average, two users highlighted the same phrase only in 47% of the queries, with individual agreements between 38% and 64%. For each topic with more than one term, at least one user identified a phrase; for 43 topics, each user identified a phrase (but possibly different phrases). The same user rarely highlighted more than one phrase in a topic. Overall, our users identified 227 different phrases in the 150 topics. Our experimental evaluation of query effectiveness focuses on early precision. We aim at validating if the earlier result by [MBSC97] (on news documents) that phrases do not significantly improve early precision is still valid when considering the Web. 5.3 Phrases 93 tags weight TITLE 4 H1, H2 3 H3-H6, STRONG, B, CAPTION, TH 2 A, META, EM, I, U, DL, OL, UL 1.5 Table 5.6: Boosting weights BM25F. BM25F user 1 user 2 user 3 user 4 user 5 topics (conjunctive) 701-750 (TREC 2004) 0.536 0.512 0.534 0.504 0.546 0.536 751-800 (TREC 2005) 0.634 0.576 0.484 0.548 0.592 0.602 801-850 (TREC 2006) 0.528 0.518 0.500 0.514 0.546 0.526 average 0.566 0.535 0.506 0.522 0.561 0.554 Table 5.7: P@10 for user-identified phrases. Table 5.7 shows precision values for the top-10 results when using the phrases identified by the different users (as strict post-filter on the conjunctive BM25F run). Surprisingly, it seems to be very difficult for users to actually identify useful phrases, there hardly is any improvement. In that sense, the findings from [MBSC97] seem to be still valid today. In the light of these results, our second experiment aims at exploring if phrase queries have any potential at all for improving query effectiveness, i.e., how much can result quality be improved when the ‘optimal’ phrases are identified. Tables 5.8 and 5.9 show the precision at 10 results for our experiment with the different settings introduced in the previous section, separately for each TREC year. BM25F BM25F best user Wikipedia topics (conjunctive) (disjunctive) phrases phrases 701-750 (TREC 2004) 0.536 0.548 0.546 0.566 751-800 (TREC 2005) 0.634 0.630 0.592 0.564 801-850 (TREC 2006) 0.528 0.538 0.546 0.526 average 0.566 0.572 0.561 0.552 Table 5.8: P@10 for different configurations and query loads, first part. It is evident from the tables that an optimal choice of phrases can significantly improve over the result quality of the BM25F baseline, with peak improvements between 12% and 14% when term order remains unchanged, and even 17% to 21% when term permutations are considered2. Topics where phrases were most useful include “pol pot” (843), “pet therapy” (793) and “bagpipe band” (794) (which were usually identified by users as well). On the other hand, frequently annotated phrases such as “doomsday cults” (745) and “domestic adoption laws” (754) cause a drastic drop in performance. Interesting examples for improvements when permuting terms are ”hybrid alternative 2both significant according to a paired t-test and Wilcoxon signed rank test, p ≤0.01 94 5. Extensions BM25F proximity best best phrases best title-only topics (conj.) score phrases +permutations TREC run 701-750 (TREC 2004) 0.536 0.574 0.616 0.668 0.588 751-800 (TREC 2005) 0.634 0.660 0.704 0.740 0.658 801-850 (TREC 2006) 0.528 0.578 0.606 0.654 0.654 average 0.566 0.604 0.642 0.687 0.633 Table 5.9: P@10 for different configurations and query loads, second part. fuel cars“ (777) where the best phrase is actually “hybrid fuel” (with a P@10 value of 0.8, compared to 0.5 for the best in-order phrase and 0.2 for term-only evaluation in the form of BM25F (conjunctive)), and “reintroduction of gray wolves” (797) with P@10 of 1.0 with the phrase “wolves reintroduction”, compared to 0.6 otherwise (amongst others: reintroduction of “gray wolves”). The best possible results are way above the best reported results for 2004 and 2005 and get close to the best result from 2006 (which was achieved, among other things, by the use of blind feedback)3. Wikipedia-based phrase recognition, a simple automated approach to phrase recognition, only leads to significant improvements for 2004 (paired t-test and Wilcoxon signed rank test, p≤0.05). For the remaining years we cannot observe significant improvements. Interestingly, the proximity-aware score yields significant improvements over the baseline4; as it automatically considers “soft phrases”, there is no need to explicitly identify phrases here. Discussion and Lessons Learned The experimental analysis for phrase queries in this section yields the following results: • We validated the common intuition that phrase queries can boost performance of existing retrieval models. However, choosing good phrases for this purpose is nontrivial and often too difficult for users, as the result of our user study shows. • Existing methods for automatically identifying phrases can help to improve query performance, but they have their limits (like the methods based on Wikipedia titles evaluated here). While we expect that more complex methods (such as the advanced algorithm introduced in [Z+07]) will get close to the upper bound, they need to include term permutations to exploit the full potential of phrases. The common intuition that term order in queries bears semantics does not seem to match reality in all cases. • Proximity-aware scoring models where the user does not have to explicitly identify phrases can significantly improve performance over a non-proximity-aware scoring model. 3no significance tests possible as we do not have per-topic results for these runs 4paired t-test and Wilcoxon signed rank test, p≤0.1 for TREC 2005 and p≤0.01 for the other two Chapter 6 Top-k Vs. Non-Top-k Algorithms This chapter starts with Section 6.1 that presents various top-k algorithms from the database systems community; they are classified according to the access methods to index lists required by the algorithms. Section 6.2 describes exact top-k algorithms (with and without term proximity component) and approximate top-k algorithms from the information retrieval community. The chapter concludes with Section 6.3 that explains some non-top-k algorithms. 6.1 Top-k Algorithms from DB Top-k algorithms aim at efficiently assembling a ranked list of the k objects that match best the user need expressed by means of a top-k query. In the scenarios used through- out this thesis, objects may represent either elements of XML documents or full doc- uments. To process top-k queries efficiently, a number of query processing techniques has been proposed over the last two decades (e.g., [Fag99, FLN03, CwH02, MBG04, GBK00, BGM02]). To score an object, a score aggregation function F aggregates all known scores from different dimensions for this object. Assume that we want to aggregate scores for two objects oi and oi′ . An aggregation function F is called monotone if F(si1, . . . ,sim) ≤ F(si′1, . . . ,si′m) when sij ≤ si′j for every dimension j, where sij and si′j are the scores from dimension j for object oi and oi′ , respectively. The following descriptions assume that top-k algorithms have access to a set of m inverted lists L = {L1, . . . ,Lm} that represent one out of m score dimensions each. These algorithms assume that the scores for objects in each dimension j have been precomputed and stored in an inverted list Lj which is sorted by descending score, i.e., lists start with objects having high scores and end with objects having lower scores. While a sorted access (also called sequential access) denotes an access to an object and its score during a sequential scan of a list, a random access denotes a direct access to an object and its score by the object identifier. Some algorithms use random lookups for promising candidates in dimensions where they have not yet been encountered; as such a random access (RA) is a lot more expensive than a sorted access (SA) (in the 95 96 6. Top-k Vs. Non-Top-k Algorithms order of 50 to 50,000 according to [BMS+06]), an intelligent schedule for these RAs has a great impact on efficiency. The cost of one SA and one RA is denoted cS and cR, respectively. In [Fag02] Fagin defines the middleware cost as cS · #SA + cR · #RA which corresponds to the query execution cost. Algorithms from the family of Threshold Algorithms are similar to dynamic pruning approaches from the IR community. They start with a phase of sequential scans to each list involved in the query execution in an interleaved, round-robin manner. Pro- cessing lists in a round-robin manner characterizes the sequence in which lists are read: (L1,L2, . . . ,Lm,L1, . . .), i.e., circular reads to each list, one after the other. As docu- ments are discovered in this process, they are maintained as candidates in an in-memory pool, where each candidate has a current score also called worstscore (aggregated from the scores in dimensions where the document has been encountered so far). Addition- ally, each candidate object oi has an upper score bound that is computed by setting all unknown scores to the highest possible score highj corresponding to the score at the current scan position (i.e., the last sequentially accessed tuple) of each list Lj : bestscore(oi) = F(pi1,pi2, . . . ,pim), where pij = sij if oi has been seen in Lj and pij = highj otherwise. pij is called predicate of object oi in dimension j, S(oi) denotes those lists in L where oi has been seen, S̄(oi) those lists in L where oi has not been encountered yet. A common choice for a monotonous aggregation function F is simple summation. Then, bestscore is defined as follows: bestscore(oi) = m∑ j=1 ( sij if Lj ∈ S(oi) highj if Lj ∈ S̄(oi) ) . (6.1) To evaluate a top-k query, the algorithms typically maintain two priority queues: a priority queue (ordered by decreasing worstscore) that maintains a list of the k candidates with the highest worstscore values called the (intermediate) top-k results R, and another priority queue (ordered by increasing bestscore) that maintains the list of remaining candidates C that have the potential to qualify for the final R. The lowest worstscore of any object in R is named min-k. Candidates whose bestscore is not greater than min-k (i.e., the head of R), can be safely removed from C. The execution stops if all candidates in C have been eliminated and no unseen document can qualify for the final results; this is typically the case long before the lists have been completely read. An excellent survey about top-k query processing techniques in the database systems area has been authored by Ilyas et al. [IBS08]. One way to classify these techniques is by the access methods to index lists required by the algorithms: 1. Sorted and random accesses to every list. 6.1 Top-k Algorithms from DB 97 2. No random accesses, only sorted accesses. 3. Sorted accesses with carefully scheduled random accesses. Marian et al. [MBG04] categorize sources (index lists) by their supported access meth- ods: while S-sources provide sequential accesses only, R-sources provide random ac- cesses only, and SR-sources provide both sequential and random accesses. 6.1.1 Sorted and Random Accesses The algorithms described in this section use sorted as well as random accesses to lists. This means that all lists have to be SR-sources. Fagin’s algorithm (FA) [Fag99] proceeds in two rounds. In a first round, it performs sorted accesses to all lists in a round-robin manner until at least k objects have been fully evaluated. In a second round, for the remaining objects that have been encountered in at least one, but not all dimensions by sorted accesses, it performs random accesses to the lists representing the missing dimensions. Finally, the aggregation function F is applied to all seen objects and the objects are sorted such that the k objects with the highest scores can be returned. The Threshold algorithm (TA) [FLN03] performs sorted accesses to all lists in a round-robin manner. When a new object o is seen by a sorted access to list Lx, TA performs random accesses to the remaining lists - therefore, it can compute the final score for o immediately. o is kept in the intermediate top-k results R iff it belongs to the k highest scores seen so far. C is always empty! After each round of sequential accesses, the highi values of the lists change, and the threshold value τ has to be updated: τ is calculated by combining the scores of the items read by the most recent sorted access to each list (i.e., highi for each list Li) in a monotone aggregation function F. If τ underscores the min-k score (the lowest worstscore of the intermediate top-k results), the algorithm can safely terminate as no not yet seen object will be able to overscore the min-k score and thus make it to the top-k results R. The algorithm assumes that one sorted access has the same cost as one random access which is not valid in the cost model, but just influences TA’s runtime behavior: this can lead to very expensive executions as the number of random accesses is not restricted and every sorted access can induce up to m-1 random accesses. The Quick-Combine algorithm [GBK00] is a variant of TA. It uses an indicator Δi = ∂F∂pi · (Si(di − c) − Si(di)) which estimates the utility to read from list Li. The indicator considers 1) the influence of the predicate pi used in list Li on the overall score F and 2) the decay of the score in Li over the last c steps which decreased the upper bound for not yet seen objects (i.e., Si(di − c) − Si(di)). The algorithm chooses the list with maximal Δi and works particularly well for skewed data. Like FA and TA, the Combined algorithm (CA) [FLN03] performs sorted accesses to all lists in a round-robin manner. It makes use of the cost ratio between random and sorted accesses, γ = �cR/cS�: every time the depth of sorted accesses increases by γ, it picks the object o with missing information whose bestscore is largest and performs 98 6. Top-k Vs. Non-Top-k Algorithms random accesses to all lists where o has not been encountered yet, short all lists in S̄(o). The algorithm can safely terminate if it has seen k objects and no object outside the top-k results R has a bestscore that overscores the min-k score and thus may make it to R. This includes the bestscore of the virtual document defined as F(high1, . . . ,highm). The algorithm assumes that random accesses are more expensive than sorted accesses. Each of the picks for random accesses induces up to m − 1 random accesses as for TA; however random accesses are only triggered every γ sequential accesses. One can view CA as a merge between TA and NRA (cf. Section 6.1.2). If γ is very large (e.g., larger than the number of objects in all lists), CA corresponds to NRA. If γ = 1, CA is similar to TA: while CA performs RAs to all lists in S̄(o) for some object o, TA performs RAs to all lists in S̄(o) for every object o seen during round-robin sorted accesses. 6.1.2 No Random Accesses Algorithms in this category only support sorted accesses and do not make use of random accesses. This means that all lists have to support sorted accesses (i.e, they are S-sources or SR-sources). The No Random Access algorithm (NRA) [FLN03] performs sorted accesses to all lists in a round-robin manner. For each seen object, it keeps track of its bestscore and worstscore and the most recently seen scores highi per list Li. The algorithm can safely stop when at least k objects have been seen and for all objects o that are not in the top-k results R (including the virtual document) holds bestscore(o) ≤ min-k. The Stream-Combine algorithm [GBK01] is similar to the NRA algorithm but favors sorted accesses to those lists that are more likely to lead to early termination than others. To estimate the utility of reading next from list Li, it uses an indicator similar to the one used in the Quick-Combine algorithm Δi = #Mi · ∂F∂pi · (Si(di − c) − Si(di)). This indicator also considers the cardinality of Mi, the subset of the intermediate top-k results R whose bestscore would be decreased or whose precise score would be known after reading from list Li. 6.1.3 Carefully Scheduled Random Accesses Algorithms in this category require that at least one list is sequentially accessible (i.e., the list is an S-source or an SR-source) to get an initial set of candidate objects that may make it into the final top-k results. The Upper and Pick algorithms [BGM02, MBG04] have been proposed in the context of Web-accessible sources categorized by their supported access methods. The Upper algorithm [BGM02] fills the bestscore-ordered candidate queue C us- ing round-robin sorted accesses to sorted sources (S-sources and SR-sources). In each round, it checks whether C has run empty or the object otop with the highest bestscore in C underscores the threshold τ defined as F(high1, . . . ,highm) (i.e., the bestscore of an unseen document). If one of these two conditions holds, the Upper algorithm performs a sorted access and inserts the read object o into the candidate queue C or updates o’s bestscore (if it has already been in the queue). Given the new highi value, 6.1 Top-k Algorithms from DB 99 τ can be updated. If the score of o is final (i.e., bestscore(o) = worstscore(o)), o is returned as a member of the top-k results. If none of these two conditions holds, the algorithm selects the best source for otop to perform a random access which can come in different implementations. The algorithm stops iff k objects have been returned. The Pick algorithm [MBG04] chooses the object o with the largest difference between worstscore(o) and bestscore(o) to perform a random access. The source to be probed for o is randomly chosen from the set of sources that represent a score dimension not yet known during the evaluation of o. The Minimal Probing (MPro) algorithm [CwH02] works in two phases, 1) the ini- tialization phase which performs only sorted accesses and 2) the probing phase which performs random accesses to complete scores. The initialization phase uses sources that provide sorted access to fill the candidate queue C with objects. Before inserting each object, the MPro algorithm assigns a bestscore value to the object that considers the maximum scores of the remaining, expensive sources representing unknown score dimensions. In each iteration, the probing phase removes the object o with the highest bestscore from the candidate queue and probes its next unevaluated source. If the evaluation for o is complete, o is returned as part of the top-k results, otherwise it is reinserted into the candidate queue. The algorithm stops as soon as k objects have qualified for the top-k results. Finding the optimal probing schedule for each object is an NP-hard problem: thus, optimal probing schedules are approximated using a greedy approach that relies on benefit and cost of each predicate obtained by sampling of ranked lists at query startup time. The authors prefer global scheduling (i.e., the prob- ing sequence is the same for every object) since per-object scheduling would generate the N-fold cost, given N objects in the database. In contrast to the Upper algorithm, MPro expects as input a fixed schedule of accesses to R-sources fixed during the initial sampling phase [MBG04]. Thus, during query processing, it selects only the object to probe next, but avoids source selection at run time that is necessary for Upper. Assum- ing that a sorted access is cheap, this algorithm aims at minimizing the cost of random accesses. IO-Top-k [BMS+06] processes index lists in batches of b sorted accesses which are distributed across the index lists. The authors reuse the cost model (middleware cost) introduced by Fagin in [Fag02] which makes cS · #SA + cR · #RA the overall objective function. The goal of SA scheduling is to optimize in each batch the individual batch sizes bi across all lists, such that some benefit function is maximized and ∑ i∈{1,...,m} bi = b which is equivalent to solving the NP-hard knapsack problem. The authors propose two strategies to handle the problem. The Knapsack for Score Reduction (KSR) method aims at reducing highi values as quickly as possible as low highi values allow earlier candidate pruning. Given the current scan positions in the index lists and a budget b, the goal is to find a schedule of individual batch sizes per list such that the total expected reduction of bestscore values for all candidates is maximized. bestscore values are estimated using histograms, assuming uniformly distributed scores. Besides, the optimization expects that the probability of seeing a particular document in list Li 100 6. Top-k Vs. Non-Top-k Algorithms where it has not been encountered yet is (close to) zero as only a small part of a list (i.e., bi entries) is scanned in the next batch. Thus, the expected reduction of that document’s bestscore corresponds to the estimated delta (with the help of histograms) in highi for bi read entries in Li. The Knapsack for Benefit Aggregation (KBA) method performs typically better than KSR, uses the notion of benefit per candidate, and aggregates all candidates’ benefits to decide about bi choices. The goal is to achieve low SA costs in the overall objective function. Like KSR, it uses histograms and current scan positions but also uses knowledge (obtained by the scans so far) about the candidate under view. The goal of RA scheduling is 1) to increase min-k in the beginning and 2) clarify scores for candidates to allow early termination in later stages. The authors propose two strategies coined Last-Probing and Ben-Probing. Last-Probing proceeds in two phases: the first phase consists of several rounds of SAs, the second phase performs only RAs. The second phase is started iff 1) the expected cost for RAs is less than the number of all SAs done up to that point and 2) ∑m i=1 highi ≤ min-k. The first condition aims at balancing the costs of SAs and RAs, the second condition ensures that all top-k items have been seen at this point. Ben-Probing uses a probabilistic cost model to compare the benefit of performing RAs against performing SAs. Costs are compared every b steps when SA scheduling has to be done anyway. The most efficient variant, coined RR-LAST mode, does round-robin sequential ac- cesses and schedules all RAs only at the end of the partial scans of inverted lists, namely, when the expected cost for RAs is below the cost of all sequential accesses so far. 6.2 Top-k Algorithms from IR Besides top-k algorithms from the database systems area (cf. Section 6.1), there are also approaches for top-k query processing in the information retrieval domain (e.g., [AM06, BCH+03b, SC07, TMO10, DS11]). Following the classification by Ding and Suel [DS11], indexes can be organized in different ways: • document-sorted: the postings in each inverted list are sorted by docid. • impact-sorted: the postings in each inverted list are sorted by their impact, i.e., their influence on the score of a document which assumes that the scoring function is decomposable (i.e., one can sum up contributions of single term entries). • impact-layered: the postings are organized in layers, with postings in layer i having a higher impact than the ones in layer i+ 1. Each layer’s entries are sorted by docid. In impact-sorted and impact-layered indexes the postings with the highest impact can be found at the start of the inverted lists such that they are read first during query processing. This property makes impact-sorted and impact-layered indexes popular for early termination algorithms. Impact-sorted indexes cannot use docids for compression 6.2 Top-k Algorithms from IR 101 since docids increase and decrease with decreasing impact. They can be compressed if the number of distinct impacts is small or if small integer numbers are used as impacts. Impact-layered indexes which employ a small number of layers may be better to com- press, but do not reach the same compression level as document-sorted indexes whose docid gaps are smaller. Only very few early termination techniques use document-sorted index structures. The IR community usually categorizes index traversal approaches as follows: • DAAT (document-at-a-time): the postings of document dj are processed before the postings of document dj+1. Each document is assigned a final score before the next document is scored and a set of the k documents with the currently highest scores is maintained. • TAAT (term-at-a-time): the inverted list of query term ti is processed before the inverted list of query term ti+1. Documents’ partial scores are maintained in so-called accumulators which keep both candidates C and (intermediate) top-k results R. As TAAT approaches do not know the final scores of objects imme- diately, but maintain partial scores instead, the memory footprint is larger for TAAT than for DAAT approaches. • SAAT (score-at-a-time): this approach is neither strictly TAAT nor DAAT. All inverted lists are open at the same time and pointers to list entries that make larger contributions to document scores are processed first. It requires the index structures to be organized in an impact-sorted or impact-layered form. 6.2.1 Exact Top-k Algorithms from IR This subsection elaborates on top-k algorithms from the IR community that deliver exact top-k results, but do not involve proximity scores. Anh and Moffat [AM06] propose dynamic pruning methods that use impact-layered indexes. For each document, all queryable terms are sorted by decreasing term fre- quency value within that document. Given the ordering, each term in the document is assigned an impact between an upper limit u (decided at indexing time) and 1. The number of terms assigned to lower valued layers is exponentially growing. Stop words are always assigned the lowest impact 1. Considering a document with nd distinct terms, of which ns stop words, the base of the layer is defined as B = (nd − ns + 1)1/u. The layers contain (B−1)Bi items, where i ∈ {0, . . . ,u−1}; in [AM06] Anh and Moffat choose u = 8. This approach allows a high compression and storage of documents in impact order. The index structure resembles the inverted block-index structure used by Bast et al. for IO-Top-k [BMS+06] (Section 6.1). Bast et al. partition each index list into blocks, which are ordered by descending score. Within each block, index entries are stored in item ID order which is comparable to document-sorted partial indexes. The authors describe a pruning method that proceeds in four stages and relies on an SAAT approach with an impact-layered index. The algorithm maintains accumulators that keep track of candidate documents C (and their worstscores) which may qualify 102 6. Top-k Vs. Non-Top-k Algorithms for the top-k results R. Furthermore, it keeps track of the current top-k results R, the min-k score (i.e., the lowest worstscore among the items in R, worstscore(Rk)). Ri denotes the result at rank i, i.e., the result with the ith largest score. For each list Li, the impact of the next not yet processed document in the inverted list is stored as nexti. As aggregation function Anh and Moffat use the commonly used simple summation. The algorithm runs in four phases: the initial OR phase accepts new candidates to be added. The subsequent AND phase only updates already existing candidates and top-k results but does not add new ones. The REFINE phase only considers documents that are in the top-k results and reorders them. The final IGNORE phase ignores the remainder of all inverted lists. The initial OR phase can be quit if no document that is not yet in C or R (i.e., that has not yet been read in an inverted list), can make it to the final top-k results R which holds if min-k ≥ ∑ Li∈L nexti. This criterion corresponds to the stopping criterion for the virtual document o applied in the NRA (Section 6.1.2): bestscore(o) ≤ min-k with bestscore(o) = F(high1, . . . ,highm) and highj = nextj . The subsequent AND phase can be left when the set of top-k results will not change any more: min-k ≥ max{bestscore(d) : d ∈ C,d �∈ R}, where bestscore(d) = worstscore(d) + ∑ Lj∈S̄(d) nextj. This criterion corresponds to the stopping criterion for all candidate objects applied in the NRA. The REFINE phase can be stopped when the sequence of the top-k results will not change any more. This holds if for all top-k documents the bestscore of the document at rank i is not larger than the worstscore of the document at rank i − 1 : ∀Ri,Ri−1 ∈ R,i ≤ k : bestscore(Ri) ≤ worstscore(Ri−1). The final IGNORE phase can ignore all remaining postings then. In addition, the authors propose a method which limits the number of entries read after the OR phase and reaches precision@20 values comparable to the algorithm just described, already when stopping reading inverted lists after 30% of the entries that have not been read during the OR phase. The method features low memory requirements and is way faster than exhaustive evaluation since only needed fragments of inverted lists are transferred from disk. Strohman and Croft [SC07] keep the entire index in main memory to avoid expensive random accesses to disk such that their query processing cost is determined by the number of read bytes. The impact-layered index uses the same impact model as Anh and Moffat, with only eight different integer valued term weights [AM06]. The index 6.2 Top-k Algorithms from IR 103 is organized in segments where every segment contains a set of documents sharing the same impact value. Each segment is document-sorted. The algorithm is based on Anh and Moffat’s approach adding some optimizations: while Anh and Moffat’s algorithm prunes candidates only once, once the top-k results are known (just before the REFINE phase starts), Strohman and Croft eliminate candi- dates after each inverted list segment has been processed: document d can be removed from the candidates if min-k ≥ bestscore(d); doing so, the number of candidates to be updated can be reduced earlier. In addition, the authors propose a technique to optimize the list-length depen- dent skipping distance such that inverted list skipping can be applied during both the AND and the REFINE phase. As both accumulators and inverted list segments are document-sorted, large sections in the inverted lists not worth decoding can be identi- fied and skipped. The WAND approach devised by Broder et al. [BCH+03b] uses a document-sorted index with DAAT-based query processing. It maintains a list of top-k items R scored so far, sorted by decreasing score, and sets the threshold τ to the score of the kth item, Rk. (Scores of items are always complete scores, i.e., bestscore=worstscore.) Furthermore, for each inverted list, the algorithm keeps track of the current scan position as well as the maximum score in the respective list. The algorithm uses pivoting in order to skip postings and proceeds in multiple iterations: at the beginning of each iteration, the pointers to the inverted lists are ordered by ascending current docid. Then the inverted lists’ maximum scores are aggregated (in the sequence of the ordered pointers), one after the other, until τ is exceeded. The term corresponding to the inverted list where τ has been exceeded is called pivot term, and the current document in the respective inverted list is called pivot document; the pivot document has the smallest docid with the chance to exceed τ. However, the pivot document is only valid, if the current docids of all preceding inverted lists are equal to the pivot docid. Then the corresponding document can be scored. Otherwise, the cursor of one of the preceding term lists is moved to the pivot docid and the next iteration can start. Ding and Suel’s Block-Max WAND (BMW) algorithm [DS11] is a state-of-the-art DAAT algorithm that uses dynamic pruning and which is based on the WAND algo- rithm. The focus of [DS11] is on top-k early termination query processing (in the sense of non-exhaustive evaluation) using main-memory based index structures. The authors devise an inverted index structure called block-max index which sorts the inverted list in docid order like the input for the WAND algorithm, but organizes the compressed inverted list in blocks. For each block, it stores the maximum impact score of a posting in the block in an uncompressed form to allow skipping long list parts. The inverted lists’ block size is 64 or 128 documents (postings) and supports decompressing individual blocks. There is an additional table outside the inverted list blocks (to avoid cache line effects) that stores the maximum (or minimum) docid and the block size. Storing this extra information only slightly increases the index size. For the WAND algorithm, skipping is limited because the inverted indexes only store 104 6. Top-k Vs. Non-Top-k Algorithms the maximum impact score of the entire list. Storing the maximum impact score per block for block-max indexes enhances the skipping known from the WAND algorithm. This way, the upper bound approximation of document impact scores can be lowered and large performance improvements achieved. The authors distinguish between deep pointer movement in an inverted list which usually involves block decompression and shallow pointer movement which moves the current pointer to the corresponding block without decompressing the block. To this end, the shallow pointer movement relies on the block boundary information stored in the additional table. As stated before, the BMW algorithm is based on WAND and thus uses pivoting in order to skip postings and proceeds in multiple iterations: before evaluating a pivoting docid, first the shallow pointers are moved to check whether the document can make it to the top-k based on the maximum score per block. If not, another candidate is chosen. Instead of moving the cursor in one list to pivot docid+1, Ding and Suel choose d′ = min{c1, . . . ,cp−1,cp}, where c1 to cp−1 are the block boundaries plus one of the first p− 1 lists and cp is the current docid in the pivot term list. This approach greatly improves skipping compared to moving the cursor in one list to pivot docid+1. Ding and Suel experimentally validate that the basic BMW algorithm outperforms their implementation of Strohman and Croft’s approach which is again faster than the WAND approach. Extensions include an impact-layered index organization and docid reassignments. The idea of docid reassignments is to give similar web-pages close docids to improve index compression. Ding and Suel attribute docids by alphabetical ordering of the URLs as in their earlier work with Yan [YDS09b]; it seems that after reassignment, documents in the same block tend to have more similar scores which in addition helps to speed up query processing. The average speed is still slightly slower than for exhaustive conjunctive evaluation, but the difference is greatly narrowed. The goal of impact-layered index organization here is to put high-scoring documents in the same layer and thus avoid spiky scores in the remaining layers. This approach is supposed to avoid reading as many blocks in the remaining layers as possible during the execution of the BMW algorithm. The impact-layered index is split into N layers and each layer is treated like a separate term with the disadvantage of the larger number of terms per query. Ding and Suel choose N=2 to avoid decreasing performance, only lists with at least 50,000 postings are split with 2% of the postings added to the first layer. That way runtime can be decreased further, almost meeting the average speed of exhaustive conjunctive evaluation. 6.2.2 Exact Top-k Algorithms from IR with a Term Proximity Com- ponent This subsection presents approaches that deliver exact top-k results and incorporate some term proximity scoring into dynamic pruning. There has been little work pub- lished that incorporates proximity scores to accelerate top-k text retrieval. [ZSLW07, 6.2 Top-k Algorithms from IR 105 ZSYW08, YSZ+10] use a combination of term as well as term proximity scores simi- lar to the solution presented later in Chapter 7 and make additional use of PageRank scores. [ZSLW07] propose a PageRank ordered index structure that segments in- dex entries based on the tags that surround the text the index entry has been gen- erated from. This results in a long body tag segment and a short segment for all remaining tags. An extension of this index structure splits the body tag segment in two segments, based on whether a document’s term weighting score exceeds or underscores a threshold score. A very recent approach to use term pair indexes to improve bounds in top-k text retrieval was presented in [YSZ+10] which focuses on the index building and query processing with term-pair indices on every local ma- chine of a cluster. They use an order-aware proximity score resulting in two term pair lists per term pair. Like [ZSLW07, ZSYW08], the approach is only applicable for two-term-queries. To keep the ranking flexible [YSZ+10] store position informa- tion while we make use of an integrated proximity score (cf. Section 7.2.2). Our approximate approach that we will present in Chapter 8 elaborates on the trade-off between index size and result quality which was not mentioned in [YSZ+10]. In con- trast to [ZSLW07, ZSYW08, YSZ+10, SBH+07, BS12], [TMO10] do not create term pair lists statically but dynamically during query processing to save on disk space. Document-sorted term pair posting lists are generated from two document-sorted sin- gle term lists by a merge join-based operation. To save on I/O operations, only single term lists are read from disk and decompressed: i.e., the pointer of the list with the minimum current docid in two document-sorted single term lists is moved. If the two pointers point to the same document, this document qualifies for the term pair list. The authors analyze two existing DAAT dynamic pruning strategies, MaxScore (terminates one item’s scoring if its score cannot exceed min-k) and WAND (cf. Section 6.2), and modify them to support proximity scores. They accelerate MaxScore and WAND in a two-stage approach: in a first stage, only single term posting lists are processed like in WAND or MaxScore. In a second stage, term pairs are subsequently processed using early termination (with the MaxScore strategy). The order-aware proximity score uses the sequential term dependence (SD) model for Markov Random Fields [MC05] (cf. Section 2.7.3). For all pairs of adjacent query terms it captures the number of exact phrase occurrences and term pair occurrences in a text window of size 8 for a document d. In [MOT11], the upper bound of a term pair (ti, ti+1)’s frequency is approximated by the maximum term frequency in the term post- ing lists L(ti) and L(ti+1): min(maxd∈L(ti)(tf(ti,d)),maxd∈L(ti+1)(tf(ti+1,d))) as no term pair can occur more often in a document than the least frequent of the constituent terms. If a term pair posting is selected for scoring, the exact term pair frequency for a text window size is computed using position lists in both single term posting lists. Otherwise, this computation can be avoided. 106 6. Top-k Vs. Non-Top-k Algorithms 6.2.3 Approximate Top-k Algorithms from IR While the approaches described above compute the exact top-k results for a scoring model with queries on an indexed collection, the approaches described in this subsection just approximate the top-k results instead which is often good enough in terms of result quality. Approximate top-k algorithms include probabilistic result pruning [TWS04], execu- tion with limited budget [SSLM+09], and improving score bounds for proximity scores by means of pruned bigram indexes [ZSYW08]. In contrast to dynamic pruning approaches which maintain full index lists and evaluate only a fragment of the indexed documents at query processing time, static pruning approaches such as [SCC+01, BC06] discard postings considered not important already at indexing time. This incurs less stored information on hard disk and often opens the opportunity to keep the indexes in memory of a single machine which saves on I/O time during processing. If indexes are wisely pruned, the retrieval quality of the top-k results is comparable to dynamic pruning approaches, usually at the expense of lower recall values. [SCC+01] introduced list pruning with quality guarantees for the scores of query results, assuming top-k style queries with a fixed (or at least bounded) k. For each list L, they consider the score sk(L) at position k (the kth highest score) in L, and drop each entry from that list whose score is below � · sk(L), where 0<�<1 is a tuning parameter. They assume a given k, �, and the original score S that uses unpruned lists as input. They prove that for each query q with r < 1 terms there is a scoring function S′ such that for every document (1 − �r)S(q,d) ≤ S′(q,d) ≤ S(q,d). S′ is similar to a scoring function on pruned lists except for the case that a document’s entries have been pruned away in too many dimensions such that its score becomes zero. Experiments are carried out with the topics 401-450 from the Ad Hoc Task of TREC-7 in short and long variants: short queries use the titles only, whereas long queries use titles and descriptions. Choosing � = 50% provides similar P@10 values as on the unpruned index. [BC06] prune lists using a document-centric approach. The approach decides, based on a term’s contribution to a document’s Kullback-Leibler divergence from the text collection’s global language model, whether the corresponding posting should remain in the index. For each document d in the text collection their best-performing approach (DCP (λ)Rel) keeps only the postings for the top-kd terms in d, where kd = �dt(d) · λ� and λ is a user-defined pruning parameter. Using a pruned index with λ = 0.1 (i.e., 10% of each document’s terms are kept) generates a result quality slightly worse than using an unpruned BM25 index evaluated with the 50 Ad Hoc topics from the TREC 2005 Terabyte Track. Its size of 1,570MB corresponds to 12% of the size of an unpruned index. Given a fixed response time, DCP (λ)Rel can provide a better result quality than two other strategies at most recall levels on the TREC 2004 and 2005 Terabyte Track test beds. The first strategy indexes a constant number of terms per document and the 6.3 Non-Top-k Algorithms 107 second strategy performs term-centric pruning which keeps the k best postings for the n most frequent terms. [ZSYW08] combine static index pruning with dynamic pruning techniques. They use pruned uncompressed bigram indexes derived by static index pruning as an additional input to dynamic pruning top-k processing. The pruned bigram indexes lower the upper bound for term proximity scores during query processing of two-term-queries. The prun- ing technique discards 1) bigrams when they are rare in the collection (as that two-term query is unlikely to be issued) and 2) bigrams when both terms are rare in the collection (inducing short term indexes which can be processed quickly). A combination of the pruned bigram index and a two-segment index (i.e., high and low score segment), with each segment ordered by PageRank score, processes retrieved results most efficiently. [WLM11] propose a cascade ranking model, a sequence of increasingly complex ranking models. The first stage returns the highest scoring documents according to the first applied scoring model. Each subsequent stage first prunes candidates and then refines the scoring for the remaining candidates used as input to the next stage. The authors propose rank-based, score-based, and score distribution-based prunings. Unigram features and bigram proximity features (both ordered and unordered term occurrences) as proposed in [MC05] are integrated into a Dirichlet and BM25 score, respectively. A boosting algorithm (based on AdaRank [XL07]) learns the cascade sequence and feature weights of the individual scoring functions. To this end it uses a tradeoff metric that weights effectiveness and efficiency (costs). The cost of a scoring model depends on the normalized average run time over a set of training queries and the input size of this stage. 6.3 Non-Top-k Algorithms The highly efficient top-k or dynamic pruning algorithms (cf. Sections 6.1 and 6.2) that are frequently applied for efficient query processing incur a non-negligible processing overhead for maintaining candidates and candidate score bounds, for mapping newly read index entries to a possibly existing partially read document using hash joins, and for regularly checking if the algorithm can stop. In scenarios with short index lists, this processing overhead is not necessary. Instead, it is sufficient to exhaustively evaluate queries in a DAAT fashion. If the lists are long, one should prefer a top-k algorithm instead. The n-way merge join is a DAAT algorithm which receives n docid-ordered lists as input and in each join step calculates the (full) score for the document dcurrent having the next smallest not yet evaluated docid. If the algorithm is executed in the exhaustive OR mode (disjunctive query evaluation), dcurrent does not have to be seen in every list. If the algorithm is executed in the exhaustive AND mode (conjunctive query evaluation), the current docid must be seen in every list and the computation of dcurrent’s score can be skipped if one list pointer points to a different document. If the score is higher than the min-k value, the document is kept in a heap of candidate 108 6. Top-k Vs. Non-Top-k Algorithms results, otherwise it is dropped as it cannot make it into the top-k results R any more. For every list it keeps track of the position up to which the list has been read so far and iterates to the next item if the document in this list has just been evaluated. If the items of all lists have been read completely, the algorithm terminates. Once all index entries have been read, the content of the heap is returned. One commonly used approach to accelerate query processing is to perform ranking in two phases. The first phase that uses a simple and easily-to-compute ranking model (e.g., BM25) pre-selects the documents to be re-ranked in the second phase with a usually more complex, not that easily-to-compute scoring model. Furthermore, using phrases is a common means in term queries to restrict the results to those that exactly contain the phrase and is often useful for effective query evaluation [CCB95]. A simple way to efficiently evaluate phrases are word-level indexes, inverted files that maintain positional information [WMB99]. There have been some proposals for specialized index structures for efficient phrase evaluation that utilize term pair indexes and/or phrase caching, but only in the context of boolean retrieval and hence not optimized for top- k style retrieval with ranked results [CP06, W+99, WZB04]. There are proposals to extend phrases to window queries, where users can specify the size of a window that must include the query terms to favor documents containing all terms within such a window [MSTC04, PA97, BAYS06]. However, this line of works has treated term prox- imity only as an afterthought after ranking, i.e., proximity conditions are formulated as a simplistic Boolean condition (e.g., requiring query terms to appear within the user-specified window size) and optimized as separate post-pruning step after ranked evaluation. Chapter 7 Casting Proximity Scoring Models into Top-k Query Processing 7.1 Introduction The first part of this chapter describes how we can modify Büttcher et al.’s scoring model to make it fit into top-k algorithm-based query processing. It is based on our work published in [SBH+07]. There has been a number of proposals in the literature for proximity-aware scor- ing schemes summarized in Chapter 2; however, there are only a few proposals that efficiently find the best results to queries in a top-k style with dynamic pruning tech- niques (cf. Section 6.2.2). We show that integrating proximity in the scoring model can not only improve retrieval effectiveness, but also improve retrieval efficiency; using pruned index lists, we gain up to two orders of magnitude compared to standard top-k processing algorithms for purely occurrence-based scoring models on unpruned lists. This insight opens the door for using a light-weight n-ary merge join in combination with pruned document-sorted index lists published in [BS08a] which realizes a similar speed up by one or two orders of magnitude compared to an evaluation with a top-k system such as TopX [TSW05] using unpruned lists. Hence, we can avoid top-k dynamic pruning techniques that maintain a candidate pool and compute best-/worstscores for result candidates to finally come up with the top-k results. Besides saving the overhead costs, this simple approach keeps up the excellent precision values and saves much disk space. The second part of this chapter aims at evaluating the feasibility of the proximity- enhanced scoring models surveyed in Chapter 2 for top-k algorithm-based query pro- cessing. Thereby, we try to figure out how to apply the techniques presented in the first part of this chapter to other scores. 109 110 7. Casting Proximity Scoring Models into Top-k Query Processing 7.2 Proximity Scoring 7.2.1 Proximity Scoring Models We focus on proximity scoring models that use a linear combination of a content-based score with a proximity score as they are usually more easily decomposable into their features and thus more straight forward to index than integrated scoring models. We have described a selection of such linear combination and integrated scoring models in Section 2.4 and Section 2.5, respectively. The particular scoring model we use is a scoring model proposed by Büttcher et al. [BC05, BCL06] (labelled Büttcher’s scoring model from now on) which has been described in detail in Section 2.4.2. We have experimentally validated in Section 4.2 that for the Web Track and Robust Track test beds, Büttcher’s scoring model is among the scoring models that provide the highest precision, MAP, and NDCG values. For the Terabyte Track test beds, it yields the highest retrieval quality for all test beds and retrieval metrics (except for topics 751-800 with the MAP metrics where it performs slightly weaker than Song et al.’s scoring model) and for all INEX test beds the high- est retrieval quality with all metrics. According to Metzler [Met06b], an ideal model that generalizes perfectly achieves an effectiveness ratio of 1. While effectiveness ratios below 90% indicate a scoring model’s missing ability to generalize, the most reason- able retrieval models have an effectiveness ratio above 95%. In Section 4.2.3 we have demonstrated for the MAP and NDCG@10 metrics that Büttcher’s scoring model has an effectiveness ratio that overscores 95% which holds only for two scores in our eval- uation. All scoring models exhibit high intracollection generalization values between 98% and 100% on all test beds with both the MAP and NDCG@10 metrics. In Sec- tion 4.2.4, we have shown for both the MAP and NDCG@10 metrics that Büttcher’s scoring model exhibits a relatively low spread, but a relatively high entropy. In our setting, we think that the spread value is more meaningful than the entropy value as it measures how much retrieval quality can decrease if we choose the wrong parameter combination. An initial set of experiments aimed at validating that Büttcher’s score outperformed the BM25 score for various parameter settings and thus shows improvements indepen- dent of the parameter choice. In particular, we wanted to find out whether the original parameter setting from [BC05, BCL06] is appropriate and can be used for our exper- iments. To this end, with the 100 topics from the TREC Terabyte Track, Ad Hoc Tasks 2004 and 2005 on the GOV2 collection, we evaluated the effect of Büttcher’s score over the BM25 score alone for 60 combinations of values for k1 and b, for pre- cision at different cutoffs and MAP. For all experiments, the results with Büttcher’s score were always at least as good as the results with BM25, significantly better (with p ≤ 0.05 for a signed t-test) for 42 configurations in precision at 10 results, for 59 configurations in precision at 100 results, and always for MAP. We use the parame- ter setting from [BC05, BCL06] (k1 = k = 1.2,b = 0.5), which was among the best configurations in our experiments as well. 7.2 Proximity Scoring 111 7.2.2 Modification of Büttcher’s Scoring Model To include Büttcher’s proximity score into query processing, it would be intriguing to use a standard word-level inverted list, i.e., an inverted list that stores with each document also the positions of the term occurrences in the document, and compute proximity scores on the fly as a document is encountered. We could use the tf(t,d) values for each query term t in d to compute a bestscore for a document: to this end we would have to ’construct’ a document that maximizes the pscore(d,q) value by putting tf(t,d) times query term t into the conceived document (we do not know the real document since we have not read the word-level inverted lists yet). This boils down to a combinatorial problem. For two-term queries {ti, tj} it is already hard to solve; if ti and tj share the same tf value in d, one has to place them alternately in the conceived document to maximize the pscore value as only non-equal adjacent query terms generate a proximity contribution. If ti and tj have different tf values in d, we first have to place the term with the lower tf value (w.l.o.g. ti) and then try to group the term occurrences of tj around the occurrences of ti. The longer the query, the more complex the combinatorial problem gets. However, this approach is not feasible in a top-k style processing as it is not possible to compute tight score bounds for candidates which in turn disables dynamic pruning and in addition the combinatorial problem does not seem to be trivial, especially for long queries. For an efficient computation of the top-k results, we need to precompute and store proximity score information in index lists that can be sequentially scanned and compute tight score bounds for early termination. The main problem with Büttcher’s scoring function in this respect is that the accumulator value accd(t) is computed as a sum over adjacent query term occurrences, which is inherently query dependent, and we cannot precompute query-independent information. An additional, minor issue is that the scoring function includes the document length which cannot be easily factorized into a precomputed score contribution. To solve this, we slightly modify Büttcher’s original scoring function; this does not have much influence on result quality, but allows precomputation. In addition to dropping the document length, by setting b = 0 in the formula, we consider every query term occurrence, not only adjacent occurrences. The modified accumulation function acc′ is defined as acc′d(tk) = ∑ (i, j) ∈ Qall,d(q) : pi = tk, pi �= pj idf(pj ) (i − j)2 + ∑ (i, j) ∈ Qall,d(q) : pj = tk, pi �= pj idf(pi) (i − j)2 . (7.1) As the value of acc′d(tk) does not only depend on d and tk, but also on the other query terms, we still cannot precompute this value independently of the query. However, we 112 7. Casting Proximity Scoring Models into Top-k Query Processing can reformulate the definition of acc′d(tk) as follows: acc′d(tk) = ∑ t∈q idf(t) ⎛ ⎜⎜⎜⎜⎜⎜⎜⎝ ∑ (i, j) ∈ Qall,d(q) : pi = tk, pj = t, pi �= pj 1 (i − j)2 + ∑ (i, j) ∈ Qall,d(q) : pi = t, pj = tk, pi �= pj 1 (i − j)2 ⎞ ⎟⎟⎟⎟⎟⎟⎟⎠ ︸ ︷︷ ︸ :=accd(tk,t) (7.2) = ∑ t∈q idf(t) · accd(tk, t). (7.3) We have now represented acc′d(tk) as a monotonous combination of query term pair scores accd(tk, t). We can precompute these pair scores for all term pairs occurring in documents and arrange them in index lists that are sorted by descending accd(tk, t) scores. Note that term order does not play a role, i.e., accd(tk, t) = accd(t,tk). Including these lists in the sequential accesses of our processing algorithm, we can easily compute upper bounds for acc′d(tk) analogously to query term dimensions by plugging in the score at the current scan position in the lists where d has not yet been encountered. The current score of a document is then computed by evaluating our modified Büttcher score with the current value of acc′d, and the upper bound is computed using the upper bound for acc′d; this is correct as the modified Büttcher score is monotonous in acc ′ d. 7.3 Indexing and Evaluation Framework 7.3.1 Precomputed Index Lists and Evaluation Strategies Our indexing framework consists of the following precomputed and materialized index structures, each primarily used for sequential access, but with an additional option for random access: • Term index list (short: term list): for each single term t a list that contains an entry for each document d where this term occurs (i.e., tf(t,d) > 0). This entry has the form (d.docid,scoreBM25(d,t)) where d.docid is a unique numerical id for document d. TL(t) denotes the term list of term t. The chosen parameters have been disclosed in Section 7.2.1. • Proximity index list (short: proximity list): for each single term pair (t1, t2) a list that contains an entry for each document d where this term pair occurs within any text window of size W (we will discuss the window size in Section 7.3.4). This entry has the form (d.docid,accd(t1, t2)) where the proximity contribution of (t1, t2) for d is stored in accd(t1, t2). t1 and t2 are lexicographically ordered (i.e., t1 < t2) such that for any single term pair combination we keep the corresponding proximity list just once. PXL(t1, t2) denotes the proximity list for the term pair (t1, t2). 7.3 Indexing and Evaluation Framework 113 • Combined index list (short: combined list): for each single term pair (t1, t2) a list that contains an entry for each document d where this term pair occurs within any text window of size W (we will discuss the window size in Section 7.3.4). This entry has the form (d.docid,accd(t1, t2),scoreBM25(d,t1),scoreBM25(d,t2)) where the proximity contribution of (t1, t2) for d is stored in accd(t1, t2). t1 and t2 are lexicographically ordered (i.e., t1 < t2) such that for any single term pair combination we keep the corresponding combined list just once. CL(t1, t2) denotes the combined list for the term pair (t1, t2). Both PXLs and CLs are term pair lists (short: pair lists). The order of entries in the index lists depends on the algorithm used for query processing. Entries can be ordered either by docid or by descending scores (scoreBM25 for the term lists, accd values for the term pair lists). We illustrate the layout of our index lists with score-based ordering in Figure 7.1. It depicts the term, proximity, and combined index lists which can be used to process the query {bike, trails}. d es ce n d in g sc or e B M 25 (1,9.3) (12,7.2) (5,5.0) (2,4.5) TL(bike) TL(trails) (4,9.1) (2,8.6) (1,5.9) (25,4.6) CL(bike,trails) d es ce n d in g ac c d (t i ,t j) (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (12,0.5,7.2,3.0) (9,0.2,1.7,2.0) PXL(bike,trails) (2,3.0) (4,0.7) (12,0.5) (9,0.2) (d.docid, scoreBM25(d, ti)) (d.docid, scoreBM25(d, tj)) (d.docid, accd(ti, tj), scoreBM25(d, ti), scoreBM25(d, tj)) (d.docid, accd(ti, tj)) Figure 7.1: Score-ordered term, proximity, and combined index lists which can be used to process the query {bike, trails} in several processing strategies. The index structures depicted in Figure 7.1 can be combined into several processing strategies: • TL: this corresponds to standard, text-based retrieval (just BM25 scores are em- ployed) without usage of proximity scores. To process the query {bike, trails}, it uses the two term lists TL(bike) and TL(trails). • PXL: this scans only the proximity lists and uses the proximity part of our modified Büttcher scoring function for ranking. To process the query {bike, trails}, this strategy uses the proximity list PXL(bike, trails). • TL+PXL: this scans proximity and content score lists (which would be the straight- forward implementation of our scoring model with a Threshold algorithm). To 114 7. Casting Proximity Scoring Models into Top-k Query Processing process the query {bike, trails}, this strategy uses the two term lists TL(bike) and TL(trails) as well as the proximity list PXL(bike, trails). • TL+CL: this strategy, which is the main contribution of this chapter, exploits the additional content scores in the CLs to reduce the uncertainty about the score of documents with high proximity scores early in the process, which often allows early termination of the algorithm. We can additionally tighten the bounds when a CL for a pair (t1, t2) runs empty: if a document was seen in the TL for t1, but not in the CL for (t1, t2), it is certain that it will not appear in the TL for t2 any more. To process the query {bike, trails}, this strategy uses the two term lists TL(bike) and TL(trails) as well as the combined list CL(bike, trails). We restrict ourselves to answering soft phrase queries. Once indexing considering term pair occurrences in a text window has been performed, it is not possible to process strict phrase queries with a pair-based index. This means that we cannot exclude those documents from the result set that do not contain the terms from the phrase consecutively. However, proximity scores are usually higher for documents with phrase occurrences than for those without phrase occurrences. Therefore, documents with phrase occurrences are not pruned away from the pair lists such that they will be very likely to be considered during query processing. Query-independent weights such as PageRank weights [BP98] may be stored in term lists. As they are small in size for commonly used document collections, they may be kept even in main memory (but this has also not been considered in other papers such as [DS11, SC07]). If updates are needed, the updates will just have to be carried out in one place - the term lists. There has been a noticeable amount of work using precomputed lists for docu- ments containing two or more terms to speed up processing of conjunctive queries, for example [CCKS07, KPSV09, LS05], for centralized search engines, and [PRL+07] for distributed search engines. None of these approaches includes proximity scores, so they can only improve processing performance, not result quality. Another bunch of papers deals with efficiently precomputing indexes for phrase queries [BWZ02, CP06, WZB04], but again they do not include proximity scores. Some of these consider the problem of reducing the index size while providing decent performance for most queries, usually by restricting to phrases or term pairs in frequently occurring queries. 7.3.2 Evaluation Setup We evaluated our algorithms with the Java-based, open-source TopX search en- gine1 [TSW05] which stores index lists in an Oracle database. Our experiments were run using the GOV2 collection with roughly 25 million documents, corresponding to about 426 GB of data (see Section 3.2.1 for more details). We evaluated our methods with the 100 Ad Hoc topics (topic numbers 701-800) from the 2004 and 2005 TREC Terabyte Track, Ad Hoc Tasks. The topic sets are listed in Tables B.1 and B.2. As we are focusing on top-k retrieval, we measured precision values at several cutoffs. 1 http://topx.sourceforge.net 7.3 Indexing and Evaluation Framework 115 To evaluate efficiency, we measured the number of sequential (SA) and random (RA) accesses to the index lists and the number of bytes transferred from disk, assuming sizes of 8 bytes for scores and docids. As random accesses are usually much more expensive than sequential accesses, we additionally compute a byte-based abstract cost Cost(γ) = #bytes(SA) + γ · #bytes(RA) for each run, based on the cost ratio γ := cR/cS of random to sequential accesses; we used γ values of 100 and 1,000 to determine abstract costs. We indexed the documents with the indexer included in the TopX system with stopword removal enabled and computed the pair lists needed for the queries with an additional tool. We ran the results with TopX configured in RR-LAST mode and a batch size of 5,000, i.e., round-robin sequential accesses in batches of 5,000 items to the index lists and postponing random accesses to the end. 7.3.3 Results Table 7.1 shows our experimental results for top-10 retrieval with stemming enabled. Configuration P@10 #SA #RA #bytes(SA) #bytes(RA) Cost(100) Cost(1,000) TL 0.56 24,175,115 196,174 386,801,840 1,569,392 543,741,040 1,956,193,840 TL+PXL 0.60 24,743,914 149,166 395,902,624 1,193,328 515,235,424 1,589,230,624 TL+CL 0.60 4,362,509 8,663 108,743,568 79,256 116,669,168 187,999,568 PXL 0.40 867,095 2,925 13,873,520 23,400 16,213,520 37,273,520 Table 7.1: Experimental results for top-10 retrieval of 100 Ad Hoc topics from the 2004 and 2005 TREC Terabyte Track, Ad Hoc Tasks. It is evident that the configuration TL+CL improves P@10 to 0.60 over the original BM25 setting (which corresponds to TL with a P@10 value of 0.56), with a t-test and a Wilcoxon signed-rank test confirming statistically significant improvements at p < 0.01. The configuration TL+PXL with simple proximity lists achieves the same improvement in precision as it uses the same scoring function as TL+CL whereas scanning only the PXLs exhibits poor result precision. We verified by additional experiments that the retrieval quality of our modification of Büttcher’s scoring model was as good as the original version of Büttcher’s scoring model. In addition to the improved retrieval quality of TL+CL over the TL baseline, it dra- matically reduces the number of accesses, bytes transferred, and abstract costs by a factor of 5 to 10. This is due to the additional content scores available in CL and the better bounds. The configuration TL+PXL needs to run longer than TL+CL until it can safely stop. Scanning only the PXLs is much faster (at the expense of result quality). index/limit unpruned size(#items) required space TL 3.191·109 47.5GB PXL/CL (estimated!) 1.410·1012 20.5TB / 41.0TB Table 7.2: Index sizes in items and required space for unpruned indexes. 116 7. Casting Proximity Scoring Models into Top-k Query Processing Table 7.2 shows the index sizes (number of list entries and required space) for term (exact) and pair lists (estimated). As the complete set of pair lists was too large to completely materialize it, we randomly sampled 1,500,000 term pairs with a frequency of at least 10, of which about 1.2% had a non-empty pair list. They are calculated/estimated according to the kind of data stored in the lists as described in Section 7.3.1, assuming an uncompressed storage. We assume that document identifiers and scores have a size of 8 bytes each. Therefore one TL entry or PXL entry (consisting of document identifier and BM25 score or accumulated score, respectively) takes a size of 16 bytes whereas one CL entry takes a size of 32 bytes as it stores the document identifier, the accumulated score, and two BM25 scores. It is evident that keeping all pair lists consumes prohibitively much disk space (for the GOV2 collection the estimated disk space to store unpruned PXL and CL indexes amounts to 20.5TB and 41.0TB, respectively): for large collections, the size of the inverted lists may be too large to completely store them, especially when the index includes term pair lists. As we do not consider only adjacent terms, but any terms occurring in the same document, a complete set of pair lists will be much larger than the original text collection. Lossless index compression techniques (see, e.g., [dMNZBY00]) are one way to solve this problem, but the compression ratio will not be sufficient for really huge collections. We therefore apply index pruning (which is a lossy index compression technique) to reduce the size of the index, while at the same time sacrificing as little result quality as possible. Following the literature on inverted lists for text processing, a common way is pruning lists horizontally, i.e., dropping entries towards the end of the lists. These entries have low scores and hence will not play a big role when retrieving the best results for queries. Unlike term lists, term pair lists contain many entries with very low scores (as the score depends on the distance of term occurrences), so the pruning effect on pair lists should be a lot higher than on term lists. 7.3.4 Results with Pruned Index Lists Our indexing framework provides three different pruning methods, mainly geared to- wards term pair lists. First, we heuristically limit the distance of term occurrences within a document, as occurrences within a large distance have only a marginal contri- bution to the proximity score. Second, we heuristically limit the list size to a constant, usually in the order of a few thousand entries. Third, we leverage the seminal work by Soffer et al. [SCC+01] for pair lists. They introduced list pruning with quality guar- antees for the scores of query results, assuming top-k style queries with a fixed (or at least bounded) k. For each list Li, they consider the score sk(Li) at position k of the list, and drop each entry from that list whose score is below � · sk(Li), where 0 < � < 1 is a tuning parameter. We first study the size of our indexes at different levels of pruning for an index (without stemming as this is an upper bound for the index size with stemming). Ta- ble 7.3 shows the influence of index list pruning on the number of index items. It is 7.3 Indexing and Evaluation Framework 117 index/limit 500 1,000 1,500 2,000 2,500 3,000 unpruned TL 295 355 402 442 472 496 3,191 PXL/CL (est.) 368,761 435,326 481,949 515,079 542,611 566,277 1,410,238 PXL/CL, accd ≥ 0.01 (est.) 23,050 28,855 34,023 38,985 42,085 45,186 87,049 Table 7.3: Index sizes (million items) with different length limits, with and without minimum acc-score requirement. index/limit 500 1,000 1,500 2,000 2,500 3,000 unpruned TL 4.4GB 5.3GB 6.0GB 6.6GB 7.0GB 7.4GB 47.5GB PXL (est.) 5.4TB 6.3TB 7.0TB 7.5TB 7.9TB 8.2TB 20.5TB PXL, accd ≥ 0.01 (est.) 343.5GB 430GB 507GB 580.9GB 627.1GB 673.3GB 1.3TB CL (est.) 10.7TB 12.7TB 14.0TB 15.0TB 15.8TB 16.5TB 41.0TB CL, accd ≥ 0.01 (est.) 686.9GB 860GB 1.0TB 1.1TB 1.2TB 1.3TB 2.5TB Table 7.4: Index sizes (disk space) with different length limits, with and without mini- mum acc-score requirement. evident that keeping all pair lists, even with a length limit, is infeasible. However, limiting the text window size to 10 reduces the number of items in the CL index noticeably to at most a factor of 8-15 over the unpruned term index, which may be tolerated given the cheap disk space available today. We mark settings with limited window sizes by accd ≥ 0.01; one term occurrence of both ti and tj in a text window of 10 amounts to an accd(ti, tj ) contribution of at least 0.01. Table 7.4 shows the index sizes (required disk space) for the very same lists. The size of TLs is not a big issue as the unpruned TLs only amount to 47.5GB, and can be further downsized using maximum list lengths. The far more critical indexes are PXLs and CLs that exhibit the prohibitive estimated size of 20.5TB and 41.0TB, respectively. Limiting the list size helps, although the lists remain too large. Additionally restricting PXLs and CLs by a minimum acc-score of 0.01 finally leads to tolerable sizes between 343.5GB and 673.3GB for PXLs and 686.9GB and 1.3TB for CLs. As we show later in Table 7.5, excellent results can be achieved when limiting the index size to 1,000 entries per list. Hence, we need less than 900GB of disk space to execute TL+CL(1,000;accd ≥ 0.01) on a document collection with 426GB data. Note that in this setting, both TLs and CLs keep at most 1,000 entries and CL entries require a minimum acc-score of 0.01. Additional lossless compression may further reduce the index sizes. We then evaluated retrieval quality with pruned (term and combined) index lists, where we used combinations of window-based pruning with a maximal size of 10, fixed- length index lists, and the pruning technique by Soffer et al. [SCC+01] for k = 10. All measurements were done without random accesses (i.e., using NRA), hence we report only a single cost value based on the number of bytes transferred by sequential accesses. Additional experiments without this constraint in RR-LAST mode showed that TopX only rarely attempts to make RAs in this setting as the pruned lists are often very short: hence, RR-LAST degenerates into NRA. Table 7.5 shows the experimental results for top-10 queries in this setup, again with 118 7. Casting Proximity Scoring Models into Top-k Query Processing Configuration P@10 #SA bytes(SA) cost TL+CL (accd ≥ 0.01) 0.60 5,268,727 111,119,408 111,119,408 TL (500) 0.27 148,332 2,373,312 2,373,312 TL (1,000) 0.30 294,402 4,710,432 4,710,432 TL (1,500) 0.32 439,470 7,031,520 7,031,520 TL (2,000) 0.34 581,488 9,303,808 9,303,808 TL (2,500) 0.36 721,208 11,539,328 11,539,328 TL (3,000) 0.37 850,708 13,611,328 13,611,328 TL+CL (500) 0.53 295,933 7,178,960 7,178,960 TL+CL (1,000) 0.58 591,402 14,387,904 14,387,904 TL+CL (1,500) 0.58 847,730 20,605,312 20,605,312 TL+CL (2,000) 0.60 1,065,913 25,971,904 25,971,904 TL+CL (2,500) 0.60 1,253,681 30,648,064 30,648,064 TL+CL (3,000) 0.60 1,424,363 34,904,576 34,904,576 TL+CL ( = 0.010) 0.60 4,498,890 87,877,520 87,877,520 TL+CL ( = 0.025) 0.60 3,984,801 73,744,304 73,744,304 TL+CL ( = 0.050) 0.60 4,337,853 75,312,336 75,312,336 TL+CL ( = 0.100) 0.60 5,103,970 84,484,976 84,484,976 TL+CL ( = 0.200) 0.58 6,529,397 105,584,992 105,584,992 TL+CL (500; = 0.025) 0.54 281,305 6,628,528 6,628,528 TL+CL (1,000; = 0.025) 0.58 521,519 12,034,320 12,034,320 TL+CL (1,500; = 0.025) 0.59 732,919 16,606,064 16,606,064 TL+CL (2,000; = 0.025) 0.60 910,721 20,377,904 20,377,904 TL+CL (2,500; = 0.025) 0.60 1,060,994 23,519,296 23,519,296 TL+CL (3,000; = 0.025) 0.60 1,191,956 26,211,376 26,211,376 TL+CL (500; accd ≥ 0.01) 0.58 290,788 6,931,904 6,931,904 TL+CL (1,000; accd ≥ 0.01) 0.60 543,805 12,763,376 12,763,376 TL+CL (1,500; accd ≥ 0.01) 0.61 780,157 18,117,552 18,117,552 TL+CL (2,000; accd ≥ 0.01) 0.61 984,182 22,734,544 22,734,544 TL+CL (2,500; accd ≥ 0.01) 0.61 1,166,144 26,854,608 26,854,608 TL+CL (3,000; accd ≥ 0.01) 0.61 1,325,250 30,466,512 30,466,512 Table 7.5: Experimental results for top-10 retrieval with pruned lists. stemming enabled. It is evident that TL+CL with length-limited lists and a minimum acc-score constraint (limited window size) gives a factor of 50-150 over the unpruned TL baseline in terms of saved cost, while yielding the same result quality (TL+CL (1,000; accd ≥ 0.01)). Using TL as processing strategy with term lists of limited length is a lot worse in effectiveness. Pruning with � is not as efficient, and large values for � in fact increase cost: many entries from the pair lists are pruned away, but at the same time the additional content scores available from these entries are not available any more. In combination with length limiting, results are comparable to our best configura- tion, but with slightly longer lists. Figures 7.2 to 7.5 illustrate some of these experimen- tal results. We obtain the best precision values when limiting the list size to 1,500 or more elements (for TL+CL(#items; accd ≥ 0.01) runs). Out of the approaches depicted in Figures 7.2 and 7.3, TL+CL(#items) is the approach with the worst precision values at the highest cost. TL+CL(#items; accd ≥ 0.01) provides the best precision values at a medium cost, whereas TL+CL(#items; � = 0.025) only comes up with a slightly better precision than TL+CL(#items), however at the best costs. For mere static index list pruning, precision values are most favorable for choices of � below 0.1. Table 7.6 demonstrates that, compared to TL+CL with pruned lists, TL+PXL with 7.3 Indexing and Evaluation Framework 119 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 500 1,000 1,500 2,000 2,500 3,000 co st #items TL+CL (#items) TL+CL (#items; TL+CL (#items; acc Figure 7.2: TL+CL approaches: cost. 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 500 1,000 1,500 2,000 2,500 3,000 P @ 10 #items TL+CL (#items) TL+CL (#items; ε=0.025) TL+CL (#items; acc ≥0.01) Figure 7.3: TL+CL approaches: P@10. 70,000,000 75,000,000 80,000,000 85,000,000 90,000,000 95,000,000 100,000,000 105,000,000 110,000,000 115,000,000 120,000,000 0.00 0.05 0.10 0.15 0.20 co st (1 00 ) Figure 7.4: TL+CL(� varied): cost. 0.58 0.585 0.59 0.595 0.6 0.00 0.05 0.10 0.15 0.20 P @ 10 ε Figure 7.5: TL+CL(� varied): P@10. pruned lists suffer from a strongly reduced retrieval quality. This is due to the fact that documents from pruned CLs are often not among the top-documents in TLs such that their BM25 scores are missing in pruned TLs: these additional BM25 scores from the CLs have a decisive impact on the retrieval quality as the results of runs using TL+PXL with pruned lists deteriorate. Runs with minimum acc-score constraints and the pruning technique by Soffer et al. [SCC+01] delivered comparable results. Hence, we do not consider TL+PXL settings with pruned lists any more. Given that obvious importance of BM25 scores in pruned CLs for the high retrieval quality of TL+CL(#items), we now investigate to which extent the result quality changes if we use only the BM25 scores from the pruned CLs. To this end, we devise another list structure called CTL (combined term index list) that is based on CL but removes the acc-score dimension. For each single term pair (t1, t2) there is a list that contains an entry for each document d where this term pair occurs within any text window of size W . This entry has the form (d.docid,scoreBM25(d,t1),scoreBM25(d,t2)). t1 and t2 are lexicographically ordered (i.e., t1 < t2) such that for any single term pair combination we keep the corresponding combined list just once. CTL(t1, t2) denotes the combined term index list for the term pair (t1, t2). Pruned CTLs keep those entries from the corresponding CLs where the total contribution from both BM25 score dimensions is highest. Table 7.7 compares the retrieval quality for top-100 retrieval of pruned TL+CTL and TL+CL settings. For a list length of 1,000, P@100 for TL+CL is comparable to TL+CTL; 120 7. Casting Proximity Scoring Models into Top-k Query Processing Configuration P@10 P@100 TL+CL (accd ≥ 0.01) 0.60 0.39 TL+PXL (1,000) 0.43 0.26 TL+PXL (2,000) 0.45 0.28 TL+PXL (3,000) 0.47 0.30 TL+PXL (3,000; = 0.025) 0.47 0.30 TL+PXL (3,000; accd ≥ 0.01) 0.47 0.30 TL+CL (1,000) 0.58 0.37 TL+CL (2,000) 0.60 0.38 TL+CL (3,000) 0.60 0.39 Table 7.6: Retrieval quality for top-10 and top-100 retrieval with pruned lists. Configuration P@100 TL+CTL(1,000) 0.3768 TL+CTL(2,000) 0.3753 TL+CTL(3,000) 0.3769 TL+CL (1,000) 0.3710 TL+CL (2,000) 0.3841 TL+CL (3,000) 0.3877 Table 7.7: Retrieval quality for top-100 retrieval with pruned TL+CTL and TL+CL settings. for longer lists, TL+CL outperforms TL+CTL. Due to this advantage, in Chapter 8, we will focus on TL+CL settings when we determine pruning levels for index structures. Podnar et al. [PRL+07] use static index list pruning in a peer-to-peer setting with terms and term sets as keys. They distinguish discriminative keys (DKs) that occur in at most DFmax documents and non-discriminative keys (NDKs) that occur in more than DFmax documents. Posting lists of NDKs are truncated to their best DFmax entries. A key is called intrinsically discriminative if it is a DK and all smaller subsets of this key are NDKs. A key is called highly discriminative key (HDK) if it has at most smax terms, and it occurs in a window of size w, and in addition is an intrinsically discriminative key. Full posting lists are stored for HDKs. Each peer maintains a local index which is built in several iterations, starting with one-term keys to smax-term keys, and adds local HDKs and NDKs with their posting lists to the global network. The P2P network maintains the global posting lists and notifies the responsible peers if an inserted HDK becomes globally non-discriminative. In that case, the peers in charge of the globally non-discriminative keys start expanding the keys with additional terms to produce new HDKs of increased key size. The authors use BM25 scores as a scoring model which is similar to using TL+CTL(DFmax). As especially pruning along the lines of Soffer et al. [SCC+01] is done for a specific value of k, it is interesting to see how good results using the index pruned with k = 10 are for larger values of k. Tables 7.8 and 7.9 show the results for top-100 retrieval with pruned and unpruned lists. Even though proximity awareness cannot improve much on result quality, most runs with pruning are at least as effective as the unpruned runs, while saving one or two orders of magnitude in accesses, bytes transferred, and cost. The combination of length-limited lists and limited window size is again best, with a peak factor of more than 320 over the unpruned TL baseline at the same quality (TL+CL (1,000; accd ≥ 0.01)). 7.3 Indexing and Evaluation Framework 121 Configuration P@100 MAP@100 #SA #RA #bytes(SA) #bytes(RA) TL 0.37 0.13 42,584,605 434,233 681,353,680 3,473,864 TL+PXL 0.39 0.14 44,450,513 394,498 711,208,208 3,155,984 TL+CL 0.39 0.14 12,175,316 32,357 302,386,896 380,552 PXL 0.27 0.09 867,095 2,925 13,873,520 23,400 TL+CL (accd ≥ 0.01) 0.39 0.14 17,714,952 0 346,997,712 0 TL+CL (500) 0.34 0.11 310,469 0 7,558,816 0 TL+CL (1,000) 0.37 0.13 610,983 0 14,838,144 0 TL+CL (1,500) 0.38 0.13 904,910 0 21,911,520 0 TL+CL (2,000) 0.38 0.14 1,184,658 0 28,615,776 0 TL+CL (2,500) 0.39 0.14 1,457,093 0 35,138,176 0 TL+CL (3,000) 0.39 0.14 1,723,204 0 41,493,728 0 TL+CL (500; = 0.025) 0.33 0.11 281,485 0 6,631,408 0 TL+CL (1,000; = 0.025) 0.36 0.12 527,171 0 12,156,256 0 TL+CL (1,500; = 0.025) 0.37 0.13 753,012 0 17,054,112 0 TL+CL (2,000; = 0.025) 0.37 0.13 957,593 0 21,371,376 0 TL+CL (500; accd ≥ 0.01) 0.34 0.12 290,968 0 6,934,784 0 TL+CL (1,000; accd ≥ 0.01) 0.37 0.13 551,684 0 12,940,576 0 TL+CL (1,500; accd ≥ 0.01) 0.38 0.13 802,538 0 18,638,752 0 TL+CL (2,000; accd ≥ 0.01) 0.38 0.13 1,039,466 0 23,969,632 0 TL+CL (2,500; accd ≥ 0.01) 0.38 0.13 1,261,124 0 28,907,200 0 TL+CL (3,000; accd ≥ 0.01) 0.38 0.13 1,483,154 0 33,856,144 0 Table 7.8: Experimental results for top-100 retrieval with unpruned and pruned lists. Configuration Cost(100) Cost(1,000) TL 1,028,740,080 4,155,217,680 TL+PXL 1,026,806,608 3,867,192,208 TL+CL 340,442,096 682,938,896 PXL 16,213,520 37,273,520 TL+CL (accd ≥ 0.01) 346,997,712 346,997,712 TL+CL (500) 7,558,816 7,558,816 TL+CL (1,000) 14,838,144 14,838,144 TL+CL (1,500) 21,911,520 21,911,520 TL+CL (2,000) 28,615,776 28,615,776 TL+CL (2,500) 35,138,176 35,138,176 TL+CL (3,000) 41,493,728 41,493,728 TL+CL (500; = 0.025) 6,631,408 6,631,408 TL+CL (1,000; = 0.025) 12,156,256 12,156,256 TL+CL (1,500; = 0.025) 17,054,112 17,054,112 TL+CL (2,000; = 0.025) 21,371,376 21,371,377 TL+CL (2,500; = 0.025) 25,288,646 25,288,646 TL+CL (3,000; = 0.025) 28,924,720 26,211,376 TL+CL (500; accd ≥ 0.01) 6,934,784 6,934,784 TL+CL (1,000; accd ≥ 0.01) 12,940,576 12,940,576 TL+CL (1,500; accd ≥ 0.01) 18,638,752 18,638,752 TL+CL (2,000; accd ≥ 0.01) 23,969,632 23,969,632 TL+CL (2,500; accd ≥ 0.01) 28,907,200 28,907,200 TL+CL (3,000; accd ≥ 0.01) 33,856,144 33,856,144 Table 7.9: Costs for top-100 retrieval with unpruned and pruned lists. 122 7. Casting Proximity Scoring Models into Top-k Query Processing 7.3.5 Comparison: TopX(RR-LAST Mode) on Unpruned Lists vs. Merge Join on Pruned Lists Our experiments so far have shown that lists cut at a maximum length of 2,000 (when combined with window- or epsilon-based pruning even less) can retain the retrieval quality of unpruned index lists. So we might be able to save the overhead costs induced by the family of threshold algorithms. (d.docid, scoreBM25(d, ti)) d es ce n d in g sc or e B M 25 merge join top-k results (heap) (1,9.3) (12,7.2) (5,5.0) (2,4.5) TL(bike) TL(trails) CL(bike, trails) d es ce n d in g ac c d (t i ,t j ) P ru n e an d r eo rg an iz e in d ex li st s (1,9.3) (12,7.2) (5,5.0) (2,4.5) (1,5.9) (4,9.1) (25,4.6) (2,8.6) (4,9.1) (2,8.6) (1,5.9) (25,4.6) TL(map) (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (12,0.5,7.2,3.0) (9,0.2,1.7,2.0) as ce n d in g d o ci d (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (12,0.5,7.2,3.0) (9,0.2,1.7,2.0) CL(bike, map) CL(map, trails) (d.docid, scoreBM25(d, tj)) (d.docid, accd(ti, tj), scoreBM25(d, ti), scoreBM25(d, tj)) Figure 7.6: Example: query={bike, trails, map}, merge join with processing strategy TL+CL using pruned term lists and combined lists. The highly efficient top-k or dynamic pruning algorithms [AM06, FLN03] that are frequently applied for efficient query processing incur a non-negligible processing over- head for maintaining candidate lists and candidate score bounds, for mapping newly read index entries to a possibly existing partially read document using hash joins, and for regularly checking if the algorithm can stop. In our scenario with index lists that are pruned to a rather short maximal length, this processing overhead is not necessary since we will almost always read the complete lists anyway. Instead, it is sufficient to evaluate queries in document-at-a-time evaluation. Our merge-based processing archi- tecture is depicted in Figure 7.6 for index lists relating to the example query terms bike, trails, and map, and consists of the following components: 1. After pruning index lists to a fixed maximal size (and, possibly, using a minimal score cutoff for combined lists), we resort each list in ascending order of docids, and optionally compress it. 7.3 Indexing and Evaluation Framework 123 2. At query time, the n term and combined lists for the query are combined using an n-way merge join that combines entries for the same document and computes its score. The n-way merge join receives the n document-sorted lists as input and in each join step calculates the score for the next smallest not yet evaluated docid. If that score is higher than the current kth best score, the document is kept in a heap of candidate results (e.g., described on page 125 in [MRS08]), otherwise it is dropped as it cannot make it into the top-k results any more. For every list it keeps track of the position up to which the list has been read so far and iterates to the next item if the document in this list has just been evaluated. If the items of all lists have been read completely, the algorithm terminates. Note that we process queries in a disjunctive manner, i.e., docids that do not occur in every list can still qualify for the top-k results. 3. Once all index entries have been read, the content of the heap is returned. Instead of maintaining a heap with the currently best k results, an even simpler implementation could keep all results as result candidates and sort them in the end; however, this would increase the memory footprint of the execution as not k, but all encountered documents and their scores need to be stored. Independent of the actual algorithm, processing a query with our pruned index lists has a guaranteed maximal abstract execution cost (i.e., the number of index entries read from disk during processing a query), so worst- and best-case runtime are very similar and basically depend only on the number of lists involved in the execution and the cutoff for list lengths. This is a great advantage over using non-pruned term lists with algorithms for dynamic pruning and early stopping, which can read large and uncontrollable fractions of the index lists to compute the results, and may give arbitrarily bad results when stopped earlier [SSLM+09]. Our experiments use a server running the Microsoft Windows 2003 Enterprise 64-bit edition on a Dual Core AMD Opteron CPU with 2.6GHz and 32 GB RAM. Both TopX and the merge join-based approach have been executed in a Sun Java 1.6 VM that was allowed to use at maximum 4 GB RAM, although the real memory requirements are way below 4 GB. Index lists have been stored in an Oracle 10g DBMS. The baseline of the experiments has been computed with TopX with a batch size of 100,000 in RR-LAST mode, i.e., round-robin sequential accesses to the index lists in batches of 100,000 items and postponing the random accesses to the end of the query processing. Unlike the previous experiments where we used only abstract cost measures, now, we will show that the abstract cost advantages translate into accelerated query processing by measuring real runtimes (in ms). Table 7.10 shows average precision values and measured runtimes (in ms) for various TopX- and merge join-based runs. Please note that we encounter minor differences in result quality compared to the experiments carried out in the first part of this chapter. This is due to a modified policy for tie-breaking; in Table 7.10 documents with lower docids are given preference in case of equally scored documents. The order of result quality and execution speed are still the same, however. 124 7. Casting Proximity Scoring Models into Top-k Query Processing k=10 k=30 k=50 k=70 k=100 Run P@k t[ms] P@k t[ms] P@k t[ms] P@k t[ms] P@k t[ms] T o p X TL 0.57 5,220 0.49 6,129 0.45 6,868 0.42 7,974 0.38 8,827 TL+PXL 0.61 6,266 0.52 8,835 0.48 11,127 0.45 10,363 0.41 15,524 TL+CL 0.61 821 0.52 1,350 0.48 1,651 0.45 1,955 0.41 2,042 M e rg e J o in TL (1,000) 0.30 35 0.24 36 0.22 35 0.20 34 0.18 35 TL (2,000) 0.34 68 0.27 71 0.24 64 0.22 67 0.20 67 TL (3,000) 0.37 101 0.29 103 0.27 100 0.25 99 0.22 100 TL+CL (1,000) 0.60 78 0.51 80 0.46 80 0.43 81 0.39 79 TL+CL (2,000) 0.61 156 0.51 155 0.47 153 0.44 155 0.40 158 TL+CL (3,000) 0.61 225 0.52 230 0.47 224 0.44 223 0.40 232 TL+CL (1,000; = 0.025) 0.60 79 0.51 80 0.46 80 0.43 81 0.39 81 TL+CL (2,000; = 0.025) 0.61 154 0.51 153 0.47 153 0.44 155 0.40 153 TL+CL (3,000; = 0.025) 0.61 223 0.52 223 0.47 225 0.44 225 0.40 224 TL+CL (1,000; accd ≥ 0.01) 0.62 70 0.51 73 0.46 70 0.42 71 0.38 76 TL+CL (2,000; accd ≥ 0.01) 0.62 137 0.51 135 0.47 134 0.44 135 0.39 134 TL+CL (3,000; accd ≥ 0.01) 0.62 193 0.51 199 0.47 191 0.44 191 0.40 191 Table 7.10: Comparison: TopX with unpruned lists vs merge join on pruned lists. TL+CL runs are faster than TL+PXL runs at similar precision values, and TL runs exhibit a decreased precision compared to TL+CL and TL+PXL runs. While run times for TopX runs usually grow with increasing k, run times of the merge join runs are independent of k: TopX can terminate the query evaluation early dependent on the number of results; for merge joins, the lists are read completely, no matter how k is chosen. For the merge join implementation, run times are linearly proportional to the lengths of the read lists. The merge join implementation of the pruned TL+CL lists can keep up the excellent precision values of the TopX runs with unpruned lists. Using a light-weight n-ary merge join in combination with pruned index lists sorted by docid, we achieve substantial performance gains. That way, we save much disk space and accelerate query processing by one to two orders of magnitude compared to an evaluation with TopX using unpruned lists. 7.3.6 Conclusion of the Experiments We have presented novel algorithms and implementation techniques for efficient evalu- ation of top-k queries on text data with proximity-aware scoring. We have shown that our techniques can speed up evaluation by one or two orders of magnitude, trading in runtime for cheap disk space and maintaining the very high result quality (effectiveness) of proximity-aware scoring models. Furthermore, we have shown that the abstract cost advantages can be turned into substantial runtime benefits using a light-weight n-ary merge join in combination with pruned document-sorted index lists. The speed up by one or two orders of magnitude compared to an evaluation with TopX in RR-LAST mode using unpruned lists can be confirmed, still providing the same excellent precision values and in addition saving much disk space. 7.4 Feasibility of Scoring Models for Top-k Query Processing 125 7.4 Feasibility of Scoring Models for Top-k Query Pro- cessing While some of the techniques presented in Chapter 2 demonstrate significant improve- ments in result quality (cf. Chapter 4), they do not consider the problem how these scores can be efficiently implemented in a search engine. Usually, implementations therefore resort to enriching term index lists with position information (e.g., [YDS09a]) and compute proximity scores after having determined an initial set of documents with ‘good’ text scores (e.g., cf. Section 2.4.1). Orthogonal to this kind of works, we discuss the feasibility of scoring models for top-k query processing with early termination in the sense of whether they can be used for NRA-based query processing (cf. Section 6.1). Early termination means that the index lists do not have to be processed completely but reading can stop usually long before all index entries have been read. In particular, we want to explore whether the scoring models are suitable for usage with the index model based on pair lists as presented in Section 7.3.1. Assessing this involves a judgment whether we can build a queryload-independent index that allows to precompute score bounds for result candidates during query processing (cf. Section 7.2.2) to enable early termination by means of tight score bounds. 7.4.1 Linear Combinations of Scoring Models We start our discussion with the class of linear combinations of scoring models which combine content and proximity score models. Rasolofo and Savoy: Rasolofo and Savoy process queries in two steps: step one computes the k documents with the highest cscore values, step two re-ranks the doc- uments from step one using proximity scores. One way to make Rasolofo and Savoy’s approach feasible for top-k query processing works in analogy to the approach used for the modification of Büttcher’s score (Section 7.2.2) and employs term and term pair lists to store for cscore and pscore computations, respectively. Rasolofo and Savoy use scoreBM25(d,q)= ∑ ti∈q (k1 + 1) · tf(ti,d) k · [(1 − b) + b · ld avgdl ] + tf(ti,d) ·max{0, log N − df(ti) df(ti) }·qtf ′(ti), where qtf ′(ti) = qtf (ti) k3+qtf (ti) . A term list for term ti could keep an entry of the form (d.docid,scoreBM25(d,ti)/qtf ′(ti)) for each document d with at least one occurrence of term ti. The list is ordered by descending scoreBM25(d,ti)/qtf ′(ti) values. Dividing scoreBM25(d,ti) by qtf ′(ti) removes the query dependency from scoreBM25(d,ti). Additionally, the term list can maintain the idf2(ti) score in order to compute qw(ti) = idf2(ti) qtf (ti) k3+qtf (ti) which is required to compute pscore values. A term pair list for a term pair (ti, tj ) can be used for pscore computations if the text window size dist is kept fixed. For each document with at least one occurrence of the term pair in the 126 7. Casting Proximity Scoring Models into Top-k Query Processing text window, it contains an entry of the form (d.docid,wd(ti, tj,dist)) and is sorted by descending wd(ti, tj,dist) values. During query processing, the BM25 score-related part in the term list is multiplied by qtf ′(ti), which, for a given query, remains constant for all entries of ti’s term list. The resulting cscore for ti is combined with the pscore part which can be derived from wd(ti, tj,dist) and qw(ti),qw(tj ), respectively. Top-k query processing should definitely be applied to step one to allow early ter- mination. Step two could use either random accesses to fetch proximity scores just for the documents from step one if the term pair index supports random accesses, or step two could scan the complete term pair lists, skipping documents not retrieved in step one. In the latter case it may be conceivable to order term pair lists by docid (best with a block structure and skip pointers) and perform a merge join of the document-sorted result lists from step one with the term pair lists. Optionally, query processing could merge the two steps into one and process term and term pair lists together in a single step. Considering cscore and pscore at the same time, this approach would not re-rank just the top-k documents with the highest BM25 scores. Thus, this approach would omit the precomputation of the top-k doc- uments with the highest BM25 scores. Like our top-k variant of Büttcher’s score (cf. Section 7.2.2), this could help faster termination through tighter earlier score bounds. Uematsu et al.: Adapting Uematsu et al.’s approach for top-k query processing is not easily possible: when it comes to compute score bounds for the pscore part, we need to know the number of sentences where all query terms co-occur, coocc(q,d). Building up lists with the number of co-occurrences for all possible query term combinations would solve that problem. As the query load is not known in advance, this will render indexes quickly prohibitively large. As a compromise, we may store d.docid and coocc(q,d) values in lists ordered by descending coocc value just for two-term-queries or selected frequent queries from query logs. For the general case of n-term queries however, the precomputation of pscore values is problematic as indexes are likely to end up being too large. To save on runtime, an approximation comparable to Tonellotto et al.’s approach (cf. Section 6.2.2) could be used: coocc(q,d) cannot be greater than the minimum of coocc({ti, tj},d) for all {ti, tj} ⊆ q which could be used as an upper bound for coocc(q,d). One could generate term pair lists for (ti, tj ) with (d.docid,coocc({ti, tj},d)) entries ordered by descending coocc({ti, tj},d) values. For each term t, we build a term list (ordered by descending BM25 score) with (d.docid,scoreBM25(d,t)) entries in the same way as for Büttcher’s approach (with a modified idf score) but with additional sentence- level posting lists: it may be necessary to use position lists to clarify the final coocc(q,d) value since coocc(q,d) is not decomposable into coocc({ti, tj},d) values. Monz: Monz’ approach cannot be straight-forwardly incorporated into top-k query processing. 7.4 Feasibility of Scoring Models for Top-k Query Processing 127 The Lnu.ltc score (cscore) could be computed by means of an lnu-score ordered term list for each term t. That term list keeps the idf(t) value and a list of (d.docid, lnu(d,t)) tuples for each document that contains t. The ltc(t) value (which uses only idf and qtf values) can be computed before reading the first entry from the lnu list and factorized into the documents’ Lnu.ltc scores. However, normalization by the maximum Lnu.ltc score achievable by any document in the collection is an issue: it requires knowledge about the top-1 result over the complete query and cannot yet be computed after reading only the first entry of each term list. Computing minimum matching spans is inherently query-dependent. Their compu- tation requires knowledge about the positions of query terms and the subset of query terms that occur in each document. Hence, precomputing the pscore which builds upon minimum matching spans is problematic as we do not know the queryload in advance. For two-term queries, proximity scores (product of ssr and mtr features) can be stored in a term pair list. In principle, for n-term queries (n > 2) this is doable as well; how- ever space requirements to store lists will quickly render this approach non-practical. Anyway, precomputation could be done for selected, very frequent queries from a query log. Keeping posting lists with term position information for any (term, document)- pair would be sufficient to compute minimum matching spans but lacks support to precompute score bounds for the pscore part. Tao and Zhai: Tao and Zhai combine a baseline content score, which can be either the KL-divergence or the BM25 score, with a proximity score. The content score part can be easily stored in term lists which allow top-k query processing with dynamic pruning. A term list for term t keeps an entry for each document d with at least one occurrence of term t. Term lists that use KL- divergence as cscore, maintain entries of the form (d.docid,scoreKL(d,t)/qtf(t)), where scoreKL(d,t)/qtf(t) = ln(1 + tf (t,d) μ·p(t|C) ) + ln μ ld+μ and p(t|C) = ctf (t) lC . Term lists that use BM25 scores as cscore, follow the schema described for Rasolofo and Savoy’s ap- proach and contain (d.docid,scoreBM25(d,t)/qtf ′(t)) entries. The lists are ordered by descending scoreKL(d,t)/qtf(t) and scoreBM25(d,t)/qtf ′(t) values, respectively. The qtf and qtf ′ values are constant for a given query-term combination and are incorporated while scoring a document at query processing time. Tao and Zhai propose a pscore of the general form π(d,q) = log(α + e−δ(d,q)) that employs a selection of span-based and distance aggregation measures to populate δ(d,q). Span-based measures require position information and are inherently query-depen- dent so that we cannot precompute scores. To compute the Span value, we need knowledge about the maximum and minimum position of all query term occurrences in a document. Building up a term list for each term with minimum and maximum position of that term’s occurrences in a document is not good enough. As long as there is just one missing dimension, the maximum and minimum position can still change (unless they are already 1 and ld, respectively); score bounds cannot be precomputed that way. Term pair lists can resolve this issue just for two-term queries and use the difference of 128 7. Casting Proximity Scoring Models into Top-k Query Processing maximum and minimum position. To compute the MinCover value, we face a similar problem, since the length of the shortest document part that covers each query term at least once has to be found. It seems that MinCover cannot be decomposed into term pair lists to perform early candidate pruning as position information is required for all query terms. The normalized versions of Span and MinCover share the same problems. The situation is better for distance aggregation measures which can be represented by a term pair list for each term pair {ta, tb}. For each document di with at least one occurrence of both ti and tj , the term pair list for {ta, tb} contains an entry of the form (di.docid,mindist(ta, tb,di)), where mindist(ta, tb,di) = min{|a − b| : pa(di) = ta ∧pb(di) = tb}. Term pair lists may be sorted either by descending or ascending score since we will see in the following descriptions, that we need both highj and lowj scores to compute score bounds which is different from typical scenarios involving the NRA where lists are ordered by descending score. Therefore, it may be good to read term pair lists from both ends of the lists in order to decrease their high and increase their low values at the same time. If the list is sorted by descending score, highj is the score at the current scan position and lowj the lowest score available in a list, respectively. If the list is sorted by ascending score, highj is the highest score available in a list and lowj the score at the current scan position in a list, respectively. We reuse the notation from Section 6.1 to compute a document di’s bestscore and worstscore values (c and p indexes denote the cscore and pscore component, respec- tively): bestscore(di) = 1 2 · bestscorec(di) + 1 2 · bestscorep(di) and worstscore(di) = 1 2 · worstscorec(di) + 1 2 · worstscorep(di). For a given query, Lcscore is the set of term lists, Lpscore is the set of term pair lists, both with remaining unread entries. Sc(di) ⊆ Lcscore and Sp(di) ⊆ Lpscore denote the set of term and term pair lists where di has been seen, whereas S̄c(di) = Lcscore −Sc(di) and S̄p(di) = Lpscore − Sp(di) represent the set of not completely processed term and term pair lists where di has not yet been encountered. The minimum pair distance (MinDist) is the smallest distance over all query term pairs in document di with MinDist(di,q) = minta,tb∈Tdi (Pdi (q)),ta �=tb{mindist(ta, tb,di)}. Hence, π(di,q) = log(α + e−δ(di,q)) = log(α + e −minta,tb∈Tdi (Pdi (q)),ta �=tb {mindist(ta,tb,di)}). The corresponding worstscore and bestscore for the proximity part π(di,q) with MinDist can be calculated as follows: worstscorep(di) = log(α + e −min(maxLj ∈S̄p(di)(highj ),minLj ∈Sp(di)(mindist(ta,tb,di)))) and bestscorep(di) = log(α + e −min(minLj ∈S̄p(di)(lowj ),minLj ∈Sp(di)(mindist(ta,tb,di)))). As worstscorep(di) represents the minimally possible proximity score for di, the ex- ponent of the exponential function must become as negative as possible. Consequently, we use the maximum among all highj for Lj ∈ S̄p(di) as higher values render the 7.4 Feasibility of Scoring Models for Top-k Query Processing 129 argument of −min larger and the exponent more negative. As bestscorep(di) repre- sents the maximally possible proximity score for di, the exponent of the exponential function must become as little negative as possible. Hence, we use the minimum of all lowj values as lower values render the argument of −min smaller and the exponent less negative. minLj∈Sp(di)(mindist(ta, tb,di)) represents MinDist in the dimensions where di has already been encountered and is used both for worstscore and bestscore computations. The exponent starts with −min as −δ(di,q) = −MinDist(di,q). The average pair distance (AvgDist) is the average distance over all query term pairs in document di with AvgDist(di,q) = 2n(n−1) ∑ ta,tb∈Tdi (Pdi (q)),ta �=tb mindist(ta, tb,di) and n being the number of unique matched query terms in di. Hence, π(di,q) = log(α + e −δ(di,q)) = log(α + e 2 n(n−1) P ta,tb∈Tdi (Pdi (q)),ta �=tb mindist(ta,tb,di) ). 2 n(n−1) corresponds to 1 (n2) which is the reciprocal of the number of term pairs and used to build the average distance over all query term pairs. The worstscore and bestscore for the proximity part π(di,q) with AvgDist can be calculated as follows: worstscorep(di) = log(α+e −maxA⊆S̄p(di)[f (A,Sp(di))( P Lj ∈A highj + P Lj ∈Sp(di) mindist(ta,tb,di))]) and bestscorep(di) = log(α+e −minA⊆S̄p(di)[f (A,Sp(di))( P Lj ∈A lowj + P Lj ∈S(di) mindist(ta,tb,di))]), where f(A,Sp(di)) = 2(|A|+|Sp(di)|)(|A|+|Sp(di)|−1) and (|A| + |Sp(di)|) represents the true value for n in di. As worstscorep(di) represents the minimally possible proximity score for di, we have to select the subset A of term pair lists where di has not been encountered yet (i.e., S̄p(di)) such that the exponent of the exponential function becomes as negative as possible. We use highj for Lj ∈ A as higher values render the argument of −max larger and the exponent more negative. In contrast, bestscorep(di) represents the max- imally possible proximity score for di: we have to select the subset A of term pair lists where di has not been encountered yet (i.e., S̄p(di)) such that the exponent of the exponential function is as little negative as possible. We use lowj for Lj ∈ A as lower values render the argument of −min smaller and the exponent less negative. Both for worstscorep(di) and bestscorep(di), ∑ Lj∈Sp(di) mindist(ta, tb,di) represents the contribution of proximity score dimensions, where di has already been seen. The maximum pair distance (MaxDist) is the maximum distance over all query term pairs in document di with MaxDist(di,q) = maxta,tb∈Tdi (Pdi (q)),ta �=tb{mindist(ta, tb,di)}. Hence, π(di,q) = log(α + e−δ(di,q)) = log(α + e maxta,tb∈Tdi (Pdi (q)),ta �=tb {mindist(ta,tb,di)}). The worstscore and bestscore for the proximity part π(di,q) using MaxDist can be calculated as follows: worstscorep(di) = log(α + e −max(maxLj ∈S̄p(di)(highj ),maxLj ∈Sp(di)(mindist(ta,tb,di)))) and bestscorep(di) = log(α + e −max(minLj ∈S̄p(di)(lowj ),maxLj ∈Sp(di)(mindist(ta,tb,di)))). 130 7. Casting Proximity Scoring Models into Top-k Query Processing While Sp(di) = ∅, worstscorep(di) = log(α + e−ldi ) for all distance aggregation measures. This takes account of the case that there is only one query term match in a document, for which MinDist, MaxDist and AvgDist are all defined as the length of the document ldi . To safely stop, the following inequality must be fulfilled for not yet seen (virtual) documents: 1 2 · ∑ Lj∈Lcscore qtf ′(tj )highj + 1 2 · log(α + e−minLj ∈Lpscore (lowj )) < min-k. The left side of the inequation represents the bestscore for not yet seen documents, the right side the smallest worstscore of the temporary top-k results. This holds for all presented distance aggregation measures. For AvgDist and MaxDist, bestscorep becomes the largest if the virtual document is contained only in one term pair list, the one with the lowest lowj score. For MaxDist, we can safely remove the maximum from the exponent then (as there is only one dimension). For MinDist, bestscorep becomes the largest if the virtual document is contained in the term pair list with the lowest lowj score. The only difference to AvgDist and MaxDist is that it may also occur in other term pair lists; however this would not affect the bestscorep bound of the virtual document. 7.4.2 Integrated Score Models This subsection discusses the feasibility of selected integrated score models for top-k query processing. De Kretser and Moffat: De Kretser and Moffat describe two algorithms to compute a ranking of documents. As the first algorithm follows a greedy approach, we focus on the second. The second algorithm uses the maximum score among the scores at posi- tions of query term occurrences in d as document score. The algorithm could be imple- mented by means of lists for ordered term pairs (ti, tj ), PL(ti, tj ). They consist of en- tries of the form (d.docid,x, ∑ l∈Pd(tj ) c ′ tj (x,l)) ordered by descending ∑ l∈Pd(tj ) c ′ tj (x,l) values and x being a position with a term occurrence of term ti, i.e., x ∈ Pd(ti). c′tj (x,l) equals ctj (x,l)/qtf(tj ) as qtf(tj ) values from the height component in ctj (x,l) are only known at query processing time. Evaluating a query q processes all PL(ti, tj ) indexes, where ti, tj ∈ q. In this setup, bestscore(d,q) equals maxx∈Pd(q)bestscore(d,q,x), where bestscore(d,q,x) is the highest score achievable at a query term position x. Note that this approach may not be usable in practice due to blown-up indexes: every document generates not only one entry per term pair (ti, tj ), but one entry per occurrence of ti in d. In addition, as the term pairs are ordered, we need two term pair lists per term pair. Song et al.: Song et al. partition documents into groups of subsequent query term occurrences, so-called espans. Espans (the number of query terms in them and their density, respectively) do not seem to be representable by term or term pair lists as 7.4 Feasibility of Scoring Models for Top-k Query Processing 131 they highly depend on the position of all query term occurrences in a document: the solution to the problem of determining bestscore bounds for documents could follow an approach similar to the approach we sketched in Section 7.2.2 for Büttcher et al.’s score. Again, everything boils down to a combinatorial problem where we construct conceived documents by means of tf(t,d) scores that maximize the document score which is dependent on spans. Compared to the solution for Büttcher et al.’s score, this problem seems to be even more difficult as the document has to be additionally split into espans. Mishne and de Rijke: Mishne and de Rijke’s scoring model follows an ”everything- is-a-phrase” approach which means that every term-level n-gram of an ordered query forms a phrase. Proximity terms relax phrases to term set occurrences in a text win- dow. All these approaches share the problem that queries are not length-limited and consequently phrases and proximity terms are not length-limited either. While the idf values of phrases could be estimated by aggregating idf values stored as meta- information per term list, we do not see a way to cast the scoring model into term or term pair lists without keeping and reading posting lists with position information of individual terms which would prevent precomputation of score bounds. Position lists are required to assess whether phrase terms occur adjacently to each other or in a text window, respectively, and not only to clarify the final score value. Like for Uematsu et al., an approximation comparable to Tonellotto et al. (cf. Section 6.2.2) could be used to compute an upper bound for tf values of a phrase p: tf(p,d) cannot be greater than the minimum of tf(t,d) for all single terms t in p. The accuracy of the upper bound could be improved by usage of term pair lists instead. This would alleviate the problem, but not solve it. 7.4.3 Language Models with Proximity Components This subsection discusses the feasibility of two language models with proximity com- ponents for top-k query processing. Lv and Zhai: Lv and Zhai’s approach can be adapted to top-k query processing with early termination. Ranking documents comes in three variants which all aggregate scores at document positions. The score at position i in d is defined as S(d,q,i) = − ∑ t∈V p(t|q) · log p(t|q) p(t|d,i) = − ∑ t∈V p(t|q) · log p(t|q) + ∑ t∈V p(t|q) · log p(t|d,i) = − ∑ t∈V qtf(t) |q| · log qtf(t) |q| + ∑ t∈V p(t|q) · log p(t|d,i) ∝ ∑ t∈V p(t|q) · log p(t|d,i) = ∑ t∈V qtf(t) |q| · log p(t|d,i) = ∑ t∈q qtf(t) |q| · log p(t|d,i). 132 7. Casting Proximity Scoring Models into Top-k Query Processing As − ∑ t∈V qtf (t) |q| · log qtf (t) |q| is constant for all documents given a query q, it can be omitted without influencing the document ranking. We further can restrict the sum- mation to query terms, since non-query terms have no influence on S(d,q,i) because their qtf value is always 0. The positional language model (in the non-smoothed version) at position i in docu- ment d is defined as p(t|d,i) = c ′(t,i)P t′∈V c ′(t,i), where c ′(t, i) = ∑ld j=1 c(t,j) · k(i,j). Hence, S(d,q,i) ∝ ∑ t∈q qtf (t) |q| · log c′(t,i)P t′∈V c ′(t,i). Given a fixed kernel and spread, it is possible to store for every term t a term list with entries of the form (d.docid,d.i, log c ′(t,i)P t′∈V c ′(t′,i) ), where d.i is the scored position in document d, ordered by descending log c ′(t,i)P t′∈V c ′(t′,i) values. Our solution for de Kretser and Moffat’s approach described in Section 7.4.2 also assumes a fixed kernel and spread, but uses ordered term pair lists instead of term lists. The Jelinek-Mercer smoothed variant of the positional language model is defined as pJM (t|d,i) = (1 − λ)p(t|d,i) + λp(t|C) = (1 − λ) c ′(t, i)∑ t′∈V c ′(t′, i) + λ ctf(t) lC Hence, S(d,q,i) ∝ ∑ t∈q qtf (t) |q| · log[(1 − λ) c′(t,i)P t′∈V c ′(t′,i) + λ ctf (t) lC ]. Given a fixed kernel, spread, and weight λ, it is possible to store for every term t a term list with entries of the form (d.docid,d.i, log[(1 − λ) c ′(t,i)P t′∈V c ′(t′,i) + λ ctf (t) lC ]) which is ordered by descending log[(1 − λ) c ′(t,i)P t′∈V c ′(t′,i) + λ ctf (t) lC ] values with d.i being the scored position in document d. qtf (t)|q| is only known at query processing time and is factorized into the score by simple multiplication by the indexed scores. The positional language model with Dirichlet prior smoothing is defined as pDP (t|d,i) = c′(t, i) + μp(t|C)∑ t′∈V c ′(t′, i) + μ′ = c′(t, i) + μctf (t) lC∑ t′∈V c ′(t′, i) + μ′ = c′(t,i)·lC+μ·ctf (t) lC∑ t′∈V c ′(t′, i) + μ′ = c′(t, i) · lC + μ · ctf(t) lC · ( ∑ t′∈V c ′(t′, i) + μ′) . Hence, S(d,q,i) ∝ ∑ t∈q qtf (t) |q| · log c′(t,i)·lC+μ·ctf (t) lC·( P t′∈V c ′(t′,i)+μ′). Given a fixed kernel and spread, we propose to store term lists for every term t with entries of the form (d.docid,d.i, log c ′(t,i)·lC+μ·ctf (t) lC·( P t′∈V c ′(t′,i)+μ′) ), where d.i is the scored position in document d, ordered by descending log c ′(t,i)·lC+μ·ctf (t) lC·( P t′∈V c ′(t′,i)+μ′) values. In the following, we elaborate on bestscore and worstscore bounds for documents when Dirichlet prior smoothing is used which provides the best results in the studies of Lv and Zhai. We detail score bounds for each of the three ranking options suggested by 7.4 Feasibility of Scoring Models for Top-k Query Processing 133 Lv and Zhai. While S(d,i) denotes the set of term lists where an entry for document d at position i has been encountered, S̄(d,i) represents the set of term lists which still contain unread entries and where d has not been seen yet. We define f(d,i,t) = log c ′(t,i)·lC+μ·ctf (t) lC·( P t′∈V c ′(t′,i)+μ′). If we score all documents by the best position in that document, the score bounds are worstscore(d) = maxi∈{1,...,ld}{0, ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj )} and bestscore(d) = maxi∈{1,...,ld}{ ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj ) + ∑ Lj∈S̄(d,i) qtf(tj ) |q| highj}. If we score all documents by the average of the best k positions in that document, we get worstscore(d) = 1 k · ∑ i ∈ top-k of S(d, q, ·)worstscores max{0, ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj )} and bestscore(d) = 1 k · ∑ i ∈ top-k of S(d, q, ·)bestscores ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj ) + ∑ Lj∈S̄(d,i) qtf(tj ) |q| · highj. If we score all documents using a weighted score based on various spreads βσ with σ ∈ R, i.e., S(d,q) = ∑ σ∈R βσ · maxi∈{1,...,ld}{Sσ(d,q,i)}, we obtain worstscore(d) = maxi∈{1,...,ld}{0, ∑ σ∈R βσ · ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj )} and bestscore(d) = maxi∈{1,...,ld}{ ∑ σ∈R βσ · ( ∑ Lj∈S(d,i) qtf(tj ) |q| f(d,i,tj )+ ∑ Lj∈S̄(d,i) qtf(tj ) |q| highj )}. To safely stop, the following inequation must be fulfilled for not yet seen (virtual) documents and holds for every score variant: ∑ Lj∈L qtf(tj ) |q| highj < min-k, where L represents the set of lists with remaining unread entries. The left part of the inequation represents the bestscore of not yet seen documents. As a virtual document has not been encountered in any dimension, for every bestscore computation, we can ignore the part which sums up contributions over Lj ∈ S(d,i). Instead we expect to see virtual documents in all lists with remaining unread entries. When scoring documents by the best position, the virtual document d has the following bestscore: bestscore(d) = maxi∈{1,...,ld}{ ∑ Lj∈L qtf (tj ) |q| highj}. If we assume that all highj point to the same (d,i) pair, we obtain the highest possible bestscore for d which corresponds to the left side of the inequation. 134 7. Casting Proximity Scoring Models into Top-k Query Processing When scoring documents by the average of the best k positions in that document, the virtual document d has the following bestscore: bestscore(d) = 1 k · ∑ i ∈ top-k of bestscores at positions in d ∑ Lj∈S̄(d,i) qtf(tj ) |q| · highj. If we assume that all highj point to the same (d,i) pair (with (d,i) as top-1 bestscore) and the remaining k-1 bestscore positions in d have the same bestscore as (d,i), we ob- tain the highest possible bestscore for d. Hence, bestscore(d) = 1 k ·k· ∑ Lj∈L qtf (tj ) |q| · highj which corresponds to the left side of the inequation. When scoring documents using a weighted score based on various spreads βσ with σ ∈ R, we obtain: bestscore(d) = maxi∈{1,...,ld}{ ∑ σ∈R βσ · ( ∑ Lj∈L qtf(tj ) |q| highj )}. With ∑ σ∈R βσ=1 and assuming again that all highj point to the same (d,i) entry, we obtain the left side of the inequation. Zhao and Yun: We can cast Zhao and Yun’s retrieval model into index lists to allow top-k query processing with early termination. To this end, we first transform the score into indexable components: score(d,q) = ∑ tf (ti, d) > 0, ti in q p(ti|θ̂q) log ps(ti|d,u) αd · p(ti|C) + log αd = ∑ tf (ti, d) > 0, ti in q qtf(ti) |q| log( tf (ti,d)+λP roxB (ti)+μ· ctf (ti)lC ld+ P|V | i=1 λP roxB (ti)+μ μ ld+ P|V | i=1 λP roxB (ti)+μ · ctf (ti) lC )+log αd = ∑ tf (ti, d) > 0, ti in q qtf(ti) |q| log( lC·(tf(ti,d) +λProxB(ti)+μ· ctf (ti)lC ) μ · ctf(ti) ) + log μ ld + ∑|V | i=1 λProxB(ti) + μ ≈ ∑ tf (ti, d) > 0, ti in q qtf(ti) |q| log( tf(ti,d) · lC μ · ctf(ti) + λProxB(ti) · lC μ · ctf(ti) +1) + log μ ld + ∑ t∈q λProxB(t) + μ . log μ ld+ P|V | i=1 λP roxB (ti)+μ ≈ log μ ld+ P t∈q λP roxB (t)+μ as ProxB(t) becomes very small for non-query terms in V. 7.4 Feasibility of Scoring Models for Top-k Query Processing 135 For each term ti, we maintain a term list which keeps entries of the form (d.docid, tf (ti,d)lC μ·ctf (ti) ), ordered by descending tf (ti,d)lC μ·ctf (ti) values. For each term pair {ti, tj}, like for Tao and Zhai’s scoring models, we keep a list with (d.docid,mindist(ti, tj,d)) entries. Analogously to the approach we proposed for Tao and Zhai’s retrieval model, term pair lists may be sorted either by descending or ascending score since we will see in the following descriptions, that we need both highj and lowj scores to compute score bounds which is different from typical scenarios involving the NRA where lists are ordered by descending score. If the list is sorted by descending score, highj is the score at the current scan position and lowj the lowest score available in a list, respectively. If the list is sorted by ascending score, highj is the highest score available in a list and lowj the score at the current scan position in a list, respectively. For a given query, Lcscore is the set of term lists. Lpscore(ti) stands for the set of not completely read term pair lists for ti and a different query term. Rp(d,ti) denotes the set of completely read term pair lists where d was not encountered, one partner term is ti and the other one a different query term. While Rp(ti) denotes the set of completely read term pair lists, R̄p(ti) denotes the set of not yet completely read term pair lists, where one partner term is ti and the other one a different query term. Sp(d,ti) represents the set of term pair lists for ti and a different query term, where d has been encountered. S̄p(d,ti) is the set of not completely read term pair lists for ti and a different query term, where d has not yet been encountered. If we use minimum distance as proximate centrality, i.e., ProxM inDist, we obtain the following score bounds for the proximity component in the non-αd part: worstscorep(ti,d) = λx −min(ld,minLij ∈Sp(d,ti)(mindist(ti,tj ,d))) · lC μ · ctf(ti) and bestscorep(ti,d) = λx −min(ld,minLij ∈Sp(d,ti)(mindist(ti,tj ,d)),minLij ∈S̄p(d,ti)(lowij )) · lC μ · ctf(ti) . worstscorep(ti,d) gets smallest if the exponent of x is as negative as possible. This is the case if d is not contained in any list Lij ∈ S̄p(d,ti) which would result in Dis(ti, tj,d) = ld. For those lists where d has been encountered we have to set mindist(ti, tj,d) as distance. bestscorep(ti,d) gets largest if the exponent of x is as little negative as possible. To this end, we expect to see d in Lij from S̄p(d,ti) having the smallest lowij value. For worstscorep,αd (q) and bestscorep,αd (q) the actions to take to minimize/maximize the score, respectively, are switched since x does not stand in the numerator, but in the denominator. Consequently, for αd, we obtain the following score bounds: worstscorep,αd (q) = μ ld+ ∑ ti∈q λx −min(ld,minLij ∈Sp(d,ti)(mindist(ti,tj ,d)),minLij ∈S̄p(d,ti)(lowij ))+μ and bestscorep,αd (q) = μ ld + ∑ ti∈q λx −min(ld,minLij ∈Sp(d,ti)(mindist(ti,tj ,d)))+μ . 136 7. Casting Proximity Scoring Models into Top-k Query Processing To safely stop, the following inequation must be fulfilled for not yet seen (virtual) documents: ∑ Li∈Lcscore qtf(ti) |q| ·log(highi + λx −min(ld,minLij ∈Lpscore(ti)(lowij ))·lC μ · ctf(ti) +1) + log μ ld + ∑ ti∈q λx −ld (ti)+μ < min-k. highi is the highest score in the term list of ti which becomes 0 if the list has been read completely. As the left side of the inequality is to be maximized, we expect for the non-αd−part that we see d with the lowest possible lowij score. If all term pair lists have been read, we have to use ld. For the αd−part, we expect d not to be seen in any of the remaining term pair lists and hence employ ld. If we use average distance as proximate centrality, i.e., ProxAvgDist, we get the following score bounds for the proximity component in the non-αd part: worstscorep(ti,d) = λx −maxA⊆S̄p(d,ti)[f (Sp(d,ti),A)(g(Sp(d,ti))+ P Lij ∈A highij + P Lij ∈Rp(d,ti) ld)] · lC μ · ctf(ti) and bestscorep(ti,d) = λx −minA⊆S̄p(d,ti)[f (Sp(d,ti),A)(g(Sp(d,ti))+ P Lij ∈A lowij + P Lij ∈Rp(d,ti) ld)]·lC μ · ctf(ti) , where f(Sp(d,ti),A) = 1|Sp(d,ti)|+|A|−1 and g(Sp(d,ti)) = ∑ Lij∈Sp(d,ti) mindist(ti, tj,d). For the computation of worstscorep(ti,d), we again aim at making the exponent of x as negative as possible. To this end, we negatively maximize it over all subsets A of S̄p(d,ti); we expect d to be encountered in all lists in A with the highest possible value highij. Hence, f(Sp(d,ti),A) = 1n−1 where n = |Sp(d,ti)| + |A|. For bestscorep(ti,d), we make the exponent of x as little negative as possible. Therefore, we expect d to be encountered in all lists in A with the lowest possible value lowij. For αd, we obtain the following score bounds: worstscorep,αd (q) = μ ld+ ∑ ti∈q λx −minA⊆S̄p(d,ti)[f (Sp(d,ti),A)(g(Sp(d,ti))+ P Lij ∈A lowij + P Lij ∈Rp(d,ti) ld)]+μ and bestscorep,αd (q) = μ ld+ ∑ ti∈q λx −maxA⊆S̄p(d,ti)[f (Sp(d,ti),A)(g(Sp(d,ti))+ P Lij ∈A highij + P Lij ∈Rp(d,ti) ld)]+μ . worstscorep,αd (q) and bestscorep,αd (q) are handled analogously to the other proxi- mate centrality variants. 7.4 Feasibility of Scoring Models for Top-k Query Processing 137 To safely stop, the following inequation must hold for not yet seen (virtual) docu- ments: ∑ Li∈Lcscore qtf(ti) |q| ·log(highi + λx −minA⊆R̄p(ti)[ 1 |A|−1 ( P Lij ∈A lowij + P Lij ∈Rp(ti) (mind∈C)ld)]·lC μ · ctf(ti) + 1) + log μ ld + ∑ ti∈q λx −maxA⊆R̄p(ti)[ 1 |A|−1 ·( P Lij ∈A highij + P Lij ∈Rp(ti) (maxd∈Cld))]+μ < min-k. highi is the highest score in the term list of ti which becomes 0 if the list has been read completely. For the non-αd−part, we aim at rendering the exponent of x as little negative as possible: to this end, we use lowij for all lists in A and the minimum document length in the collection for completely read lists. For the αd−part, we aim at rendering the exponent as negative as possible. Therefore, we use highij for lists in A and the maximum length of any document in the collection for completely read lists. If we use summed distance as proximate centrality, i.e., ProxSumDist, we can work with the following score bounds for the proximity component in the non-αd part: worstscorep(ti,d) = λx −(g(Sp(d,ti))+ P Lij ∈Rp(d,ti) ld+ P Lij ∈S̄p(d,ti) ld) · lC μ · ctf(ti) and bestscorep(ti,d) = λx −(g(Sp(d,ti))+ P Lij ∈Rp(d,ti) ld+ P Lij ∈S̄p(d,ti) lowij ) · lC μ · ctf(ti) , where g(Sp(d,ti)) = ∑ Lij∈Sp(d,ti) mindist(ti, tj,d). worstscorep(ti,d) gets smallest if the exponent of x is as negative as possible. To accomplish that we set ld for all lists where d has not been encountered (i.e., lists in Rp(d,ti) or S̄p(d,ti)) which means that d will not be seen in those lists any more. bestscorep(ti,d) gets largest if the exponent of x is as little negative as possible. For each list Lij in S̄p(d,ti) we set lowij to fulfill this goal. worstscorep,αd (q) and bestscorep,αd (q) are handled analogously to the other proxi- mate centrality variants. For αd, we obtain the following score bounds: worstscorep,αd (q) = μ ld+ ∑ ti∈q λx −(g(Sp(d,ti))+ P Lij ∈Rp(d,ti) ld+ P Lij ∈S̄p(d,ti) lowij ) + μ and bestscorep,αd (q) = μ ld+ ∑ ti∈q λx −(g(Sp(d,ti))+ P Lij ∈Rp(d,ti) ld+ P Lij ∈S̄p(d,ti) ld) + μ . To safely stop, the following inequation must be fulfilled for not yet seen (virtual) documents: ∑ Li∈Lcscore qtf(ti) |q| ·log(highi + λx −(PLij ∈Lpscore(ti) (lowij )+ P Lij ∈Rp(ti) (mind∈Cld)) · lC μ · ctf(ti) +1) + log μ ld + ∑ ti∈q λx −(PLij ∈Rp(ti) (maxd∈Cld)+ P Lij ∈Lpscore(ti) (maxd∈Cld))+μ < min-k. 138 7. Casting Proximity Scoring Models into Top-k Query Processing highi is the highest score in the term list of ti which becomes 0 if the list has been read completely. For the non-αd−part, (to render the exponent of x as little negative as possible) we expect for the not yet completely read term pair lists that the virtual document is encountered with the lowest possible value lowij . For the completely read term pair lists in Rp(ti) we use the minimum length of any document in the collection. For the αd−part, we aim at rendering the exponent as negative as possible. Therefore, we use the maximum length of any document in the collection. 7.4.4 Learning to Rank In this subsection, we discuss the feasibility of some Learning to Rank approaches for early termination in an NRA setting. Svore et al.: Svore et al.’s approach is based on Song et al.’s work which incorporates assessing the goodness of espans. If the learned scoring model involves this kind of features, it shares the same problems (cf. Section 7.4.2). This also applies to proximity match features which require knowledge about spans in each document. Non-span related features, however, can be cast into term lists (λBM25 features) or term pair lists (λBM25-2 features) and make early termination possible. Metzler and Croft: In principle, Metzler and Croft’s retrieval model can be cast into score-ordered index lists to make early termination possible. For each of the three kinds of potential functions, we need a separate list with docids and their feature values ordered by descending feature value: • For single terms qi, a term list for qi could store entries of the form (D.docid, log[(1 − αD) tf(qi,D) lD + αD ctf(qi) lC ]). • For ordered potential functions representing phrases ”qi, . . . ,qi+k”, we could store entries of the form (D.docid, log[(1 − αD) tf#1(qi,...,qi+k),D lD + αD ctf#1(qi,...,qi+k) lC ]). • For unordered potential functions representing ordered or unordered occurrences of query term sets {qi, . . . ,qj}, we could store entries of the form (D.docid, log[(1 − αD) tf#uwN (qi,...,qj ),D lD + αD ctf#uwN (qi,...,qj ),D lC ]). The sum of the weighted scores from the three potential functions can be used to compute score bounds during query processing. As usual, storing entries for single terms is not an issue. The problem that prevents this approach from being practical for top-k query processing, without restrictions on the queries, is the huge amount of possible phrases and sets of query terms that may occur within a text window. 7.4 Feasibility of Scoring Models for Top-k Query Processing 139 Cummins and O’ Riordan: Cummins and O’ Riordan use genetic programming to learn a scoring model. The baseline scores can be cast into term lists. scoreES is materialized analo- gously to scoreBM25 as described for Rasolofo and Savoy’s approach. A term list for term ti could keep an entry of the form (d.docid,scoreES(d,ti)/qtf(ti)) for each docu- ment d with at least one occurrence of term ti where the list is ordered by descending scoreES(d,ti)/qtf(ti) values. The learned retrieval model is feasible for top-k query processing if only measures are used which can be cast into term pair lists. These measures include measures 1 to 8 listed in Section 2.7.4. The document length (measure 12) is known at indexing time so that it can also be incorporated into precomputed scores. The number of unique query terms in a document (measure 13) can be captured by both reading term and term pair lists. For learned scoring models where we can factor out the qt-part of the original score, we could follow the approach we have proposed for making the AvgDist measure by Tao and Zhai feasible for top-k querying. While scanning the lists, a document’s qt value ranges from the number of distinct query terms seen so far for that document to the number of query terms: while the bestscore would maximize the score value, the worstscore minimizes the score value over all possible qt values. Having completely read a scoreES or scoreBM25 based term list can shrink the range of qt. Measures 10 and 11 (i.e., FullCover and MinCover) correspond to measures used by Tao and Zhai (FullCover corresponds to Span in Tao and Zhai’s paper) and are inherently query-dependent and cannot be easily decomposed into term pair lists. More details can be found in the paragraph about Tao and Zhai’s approach in Section 7.4.1. 7.4.5 Summary We have shown that, for a surprisingly high fraction of proximity score-enhanced re- trieval models, it is possible to cast them into precomputed term and term pair lists: Rasolofo and Savoy’s scoring model can be cast into precomputed term and term pair lists similar to our modification of Büttcher’s approach. Tao and Zhai’s approach is feasible for index precomputation when one of the three distance aggregation measures is used. However, different from a conventional NRA strategy that orders lists by decreasing impact, here, term pair lists may be ordered by ascending impact since both low and high values of lists are needed to determine worstscores and bestscores of documents. Thus, it may be worthwhile thinking about reading from both ends of the lists for faster termination. De Kretser and Moffat’s second algorithm can be cast into term pair lists with document-position related scores although the index may take up too much space to be practical; the maximum score over all positions becomes the document score. Lv and Zhai’s approach can be cast into term lists that keep scores for (document, position) pairs. Zhao and Yun’s approach keeps three kinds of index lists: term lists for the content score part, term pair lists for the proximity component and one query-independent document-constant list. Cummins and O’Riordan’s approach can be used for top-k query processing if we restrict learned scores to consist only of 140 7. Casting Proximity Scoring Models into Top-k Query Processing components that can be cast into term pair lists. Some scoring models cannot be cast into index lists due to space reasons: Tao and Zhai’s model with span-based measures (Span and MinCover) requires position information at query processing time. Precomputing index lists for Monz’ approach is problematic since minimum spans are inherently query-load dependent. The approaches by Song et al. and Svore et al. rely on espans which can be determined only if we know, for any query term occurrence in a document, which query term follows next in the document and at which position. The approaches suggested by Mishne and de Rijke and by Metzler and Croft rely on tf values of phrasal occurrences and term set occurrences within text windows, respectively. Precomputing and storing them for arbitrary n-grams exceeds reasonable space requirements. Chapter 8 Index Tuning for High-Performance Query Processing The first part of this chapter introduces a joint framework for trading off index size and result quality. It provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance under controlled result quality, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce the index size. Extensive experiments with two large text collections demonstrate runtime improvements of more than one order of magnitude over existing text-based processing techniques with reasonable index sizes. This part is based on our article published in [BS12] and enriched with results from our participation in the INEX 2009 Ad Hoc and Efficiency Tracks [BS09] and the TREC 2010 Web Track [BS10]. The second part of this chapter introduces a new index structure to improve cold cache performance by reducing the number of fetched lists traded in for more read bytes. This part is based on our work presented in [BS11]. 8.1 Introduction 8.1.1 Motivation In Chapter 7 we have presented an approach to integrate proximity scores as an integral part of query processing. This showed that proximity scores can improve not only result quality, but also efficiency, by means of index pruning. However, the index parameters for pruning were chosen in an ad hoc manner, lacking systematic optimization. We now extend results from Chapter 7 towards a configurable indexing framework which can be tuned either for maximal and dependable query performance under result quality control or for maximal result quality given an index size budget. Existing methods for the integration of proximity scores into efficient query processing algorithms for quickly 141 142 8. Index Tuning for High-Performance Query Processing computing the best k results (e.g., [CCKS07, PRL+07]) make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes, or focusing on conjunctive queries only. There are existing techniques for lossy index compression that materialize only a subset of all term pairs, e.g., those term pairs occurring in queries of a query log. In contrast and orthogonally to these techniques, this chapter aims at limiting the size of each term pair list by limiting the maximal list length and imposing a minimal proximity score per tuple in a term pair list. At the same time, the choice of term pair index lists to be materialized can be based on frequent queries in a query log. Our method can be tuned towards either guaranteeing maximal result quality or maximal query performance at controlled result quality within a given index size constraint. For both optimization goals, the result of the method is a set of pruned index lists of a fixed maximal length, which means that the worst-case cost for evaluating a query with this index can be tightly bound as well. In our experiments with the GOV2 collection (reported in Section 8.5), we show that 310 entries per list can be enough to give the same result quality as a standard score taking only term frequencies into account. We have measured an average warm cache retrieval time of less than 30ms at a cache size of just 64MB for a standard query load of 50,000 queries, an average cold cache retrieval time of 127ms and a hot cache retrieval time of less than 1ms. In this configuration, the size of the compressed index is 95GB, only slightly larger than the compressed collection. Similar query processing costs can be achieved for much larger collections, such as the recent ClueWeb09 collection. 8.1.2 Contributions This chapter makes the following important contributions: • It introduces a tunable indexing framework for terms and term pairs for opti- mizing index parameters towards either maximal result quality or maximal query processing performance under result quality control, given a maximal index size. • It allows a selective materialization of term pair index lists based on information from a query log. • The resulting indexes provide dependable query execution times while providing a result quality comparable to or even better than unpruned term indexes. • It experimentally demonstrates that the resulting index configurations allow query processing that yields almost one order of magnitude performance gain compared to a state-of-the-art top-k algorithm while returning results of at least comparable quality. 8.1.3 Outline of the Chapter The remainder of this chapter is structured as follows. Section 8.2 elaborates on the in- dex organization and the employed index compression techniques. Section 8.3 presents 8.2 Indexes 143 the index tuning framework within the MapReduce paradigm and formulates tuning as an optimization problem that considers both index size and retrieval quality. Section 8.4 shows how the size of the index can be reduced further using a query log. Section 8.5 experimentally evaluates our index tuning techniques from Section 8.3 with two large text collections from TREC, namely GOV2 and ClueWeb09 (cf. Section 3.2.1), for dif- ferent result size cardinalities. It can be tuned either towards effectiveness or efficiency, given a size limit for the pruned indexes, both in the presence and absence of relevance assessments. We compare the query processing performance of merge joins with pruned indexes as input to a state-of-the-art document-at-a-time algorithm that uses dynamic pruning on unpruned indexes and provide additional results for a proximity-enhanced variant of that state-of-the-art document-at-a-time algorithm. Query processing per- formance is measured both by abstract measures and average query processing times for different cache settings. Furthermore, we evaluate the effect of query log-based combined list pruning. Additional results with ClueWeb09 demonstrate the scalability of our index tuning approach. As a third collection we use the textual content of the INEX Wikipedia collection from 2009 (cf. Section 3.2.2). We present results from our participation in the INEX 2009 Efficiency and Ad Hoc Tracks. Section 8.6 presents a novel hybrid index structure that accelerates cold cache query processing, trading off a reduced number of index lists for an increased number of bytes to read. Please note that the techniques described in this chapter are not limited to this particular proximity scoring model (cf. Section 7.2.2) we use throughout the chapter: whenever bigram features, representing a proximity score contribution, can be stored in term pair index lists, our techniques can be applied as well. In Section 7.4 which inves- tigates the feasibility of various proximity scoring models for top-k query processing, we have described which approaches can be cast into term pair index lists. 8.2 Indexes Our studies in Section 7.3 have shown that by means of pruned term and combined index lists as input to a TL+CL processing strategy, we can achieve the best retrieval quality among the presented processing approaches for many pruning levels. We have described the abstract layout of these two index structures in Section 7.3.1 and will now discuss the efficient physical implementation of these index structures. The index tuning framework described in this chapter transparently supports all kinds of index compression. We will now introduce our proof-of-concept implementation of index compression which applies delta and v-byte encoding [CMS10, ZM06]; we did not perform any specific optimization for the parameters, for example the number of bits to represent a score, but we think that the values we chose are reasonable. Our inverted lists are usually sorted by docid, but may also be sorted by descending score (scoreBM25 for term lists, accd for combined lists). Due to the implementation of our tuning framework (cf. Section 8.3.2) which par- allelizes the indexing process across a cluster of servers, each index list is assigned to 144 8. Index Tuning for High-Performance Query Processing one of several partitions. Figure 8.1 depicts the general structure of our term list index. Figure 8.1: Index and data files for TLs. The hashcode of a term determines the partition where its term list (TL) is stored. For each partition we generate one index file and one data file. All data files together contain the complete index information and consist of key, value pairs: each key is a term whose value is its TL. The index files are used to find the start address to lookup in the data file where a term’s TL may be stored. Each index file stores every kth key, where the keys are stored in ascending lexicographical order. Every key in an index file is assigned to an address offset (in bytes) which points to the position in the data file of the corresponding partition where the key and its TL are stored. The access structure to find the inverted list for a given key is implemented analogously to that of MapFiles in Hadoop [Whi09]; again, this is just a proof-of-concept implementation, we could alternatively have implemented the access structure with B+-trees, for example. In the system’s initialization phase, before processing any query, all index files are loaded into main memory. To locate the inverted list for a key, we first determine its partition id by means of its hashcode. The key or its closest smaller neighbor key (in lexicographic order) is determined in the in-memory index using binary search, then the data file is searched linearly from the offset of that key until either the right list is 8.2 Indexes 145 found or a larger key is encountered; in the latter case, there is no list for that key in the data file, i.e., the key is not in the index. Indexing combined lists (CLs) works analogously to indexing TLs, with the only difference that the keys are term pairs instead of terms and the values are CLs instead of TLs. Our indexes are materialized with 54 partitions and k=128 as step width, however these numbers are configurable. deltaOffsetNextTL maxScoreBM25(t1) idf(t1) #doc.s in TL(t1) header TL(t1) 1 roundedNormScoreBM25(d1,t1) 4 roundedNormScoreBM25(d4,t1) 106 roundedNormScoreBM25 (d106,t1) 1079 roundedNormScoreBM25(d1079,t1) TL(t1) TL(t2) header TL(t2) as ce n d in g te rm o rd er 15 roundedNormScoreBM25(d15,t2) 43 roundedNormScoreBM25 (d43,t2) 784 roundedNormScoreBM25 (d784,t2) 3594 roundedNormScoreBM25 (d3594,t2) t1 maxScoreBM25(t2) idf(t2) #doc.s in TL(t2) t2 as ce n d in g d o ci d as ce n d in g d o ci d deltaOffsetNextTL Figure 8.2: Compressed TLs in docid-order. We will now describe in detail how the data files are organized. Figure 8.2 and Figure 8.3 show the structure of compressed TLs and CLs stored in docid order, re- spectively. In both figures, we mark the encoding and data types by different kinds of lines: green solid lines indicate UTF-8 encoding (consuming two bytes plus the number of UTF-8 bytes), violet dotted lines a v-byte encoding (of flexible size), and orange dashed lines float-typed data (consuming 4 bytes each). Figure 8.2 shows that each TL(t) is preceded by a header that contains the UTF-8 encoded term t, the v-byte encoded byte offset value to the beginning of the next TL (needed to search the right list in the data file), and the maximum scoreBM25 value of TL(t) which is required to reconstruct the stored BM25 scores in the corresponding TL. Furthermore, we maintain the idf(t) value that is required to process CLs as described later. Additionally, we store the number of documents for each TL. The actual TL contains a list of pairs that contain the docid and its rounded normalized BM25 score, 146 8. Index Tuning for High-Performance Query Processing deltaOffsetNextCL maxAcc (t1,t2) maxScoreBM25ModIDF1(t1) #doc.s in CL(t1,t2) header CL(t1,t2) 1 roundedNormScoreBM25ModIDF1 (d1,t1) 4 1079 CL(t1,t2) CL(t1,t3) header CL(t1,t3) as ce n d in g te rm p ai r o rd er 15 roundedNormScoreBM25ModIDF1 (d15,t1) roundedNormScoreBM25ModIDF2(d15,t3) 784 3594 t1$t2 maxScoreBM25ModIDF2(t2) deltaOffsetNextCL maxAcc (t1,t3) maxScoreBM25ModIDF1(t1) #doc.s in CL(t1,t3) t1$t3 maxScoreBM25ModIDF2(t3) roundedNormAccd15(t1,t3) roundedNormAccd1(t1,t2) roundedNormScoreBM25ModIDF2(d1,t2) roundedNormScoreBM25ModIDF1 (d4,t1) roundedNormAccd4(t1,t2) roundedNormScoreBM25ModIDF2(d4,t2) roundedNormScoreBM25ModIDF1 (d1079,t1) roundedNormAccd1079(t1,t2) roundedNormScoreBM25ModIDF2(d1079,t2) roundedNormScoreBM25ModIDF1 (d784,t1) roundedNormScoreBM25ModIDF2(d784,t3) roundedNormAccd784(t1,t3) roundedNormScoreBM25ModIDF1 (d3594,t1) roundedNormScoreBM25ModIDF2(d3594,t3) roundedNormAccd3594 (t1,t3) as ce n d in g d o ci d as ce n d in g d o ci d Figure 8.3: Compressed CLs in docid-order. where roundedNormScoreBM25(d,t) is defined as round((214 − 1) · scoreBM25(d,t)/maxd′∈C(scoreBM25(d ′, t))). As this value is in [0, 214 − 1], it can be encoded into at most 2 bytes with v-byte encoding. Figure 8.3 shows that the header for each CL(ti, tj ) contains the UTF-8 encoded term pair string ti$tj ($ is the term delimiter), and the v-byte encoded byte offset value to the term pair of the next CL in the same data file. Furthermore, the header contains the maximum accd(ti, tj ) score in that CL named maxAcc(ti, tj ), and the maximum scoreBM25 modulo idf values for both scoreBM25 dimensions in CL(ti, tj ) named maxScoreBM25ModIDF1(ti) and maxScoreBM25ModIDF2(tj ), respectively. We do not include idf scores in the index as they are not yet known at CL indexing time. As TLs and CLs are used in combination for query processing, the idf scores can be obtained from the TLs at query processing time. Like for each TL, for convenience reasons during query processing, we store the number of documents included in each CL. The actual CL contains a list of tuples; each tuple contains the docid plus three scores: • roundedNormAccd(ti, tj ) = round((214 − 1) · accd(ti,tj ) maxd′∈C(accd′ (ti,tj )) ), • roundedNormScoreBM25ModIDF1(d,ti) = round((2 14−1)· scoreBM25(d,ti) idf (ti) maxd′∈C scoreBM25(d ′,ti) idf (ti) ), and 8.3 Parameter Tuning 147 • roundedNormScoreBM25ModIDF2(d,tj ) = round((2 14−1)· scoreBM25(d,tj ) idf (tj ) maxd′∈C scoreBM25(d ′,tj ) idf (tj ) ). Like in TLs, each v-byte encoded rounded normalized score does not require more than two bytes. When both CLs and TLs are docid-ordered, the docid values in each list are first delta-encoded and then stored as v-bytes, and the score(s) of the entries are encoded as v-bytes with at most 2 bytes per score. For score order, TLs are sorted by descending scoreBM25 which are delta-encoded and then stored as v-bytes, the corresponding docids are encoded as v-bytes. CLs are sorted by descending accd scores which are delta-encoded and then stored as v-bytes; the corresponding docids and the scoreBM25 contributions for the two terms represented by the combined list are encoded as v-bytes. In score order, ties are broken using docid. While ties are rare for term lists, they are more frequent for combined lists which is due to the fact that accd scores as sorting criterion in combined lists are more similar than BM25 scores as sorting criterion in term lists. 8.3 Parameter Tuning 8.3.1 Tuning as Optimization Problem We have demonstrated in Section 7.3 that using term and combined index lists together for query processing can reduce processing cost by an order of magnitude compared to using only term index lists and a standard top-k algorithm. At the same time, the proximity component of the score helps to additionally improve result quality. However, these great properties come at a big price: an index that maintains complete informa- tion for all combined lists will be several orders of magnitude larger than the original collection of documents and is therefore infeasible even for medium-sized collections. We proposed to keep only prefixes of fixed length of each list, and demonstrated that this improved both result quality and query performance while greatly reducing index size. Section 7.3 also included experiments indicating that term pair occurrences that are more than approximately 10 positions apart (runs marked with accd ≥ 0.01) hardly play a role for result quality and can therefore usually be ignored. We take over this finding, so whenever we talk about term pair occurrences, we mean occurrences of dif- ferent terms within a window of at most 10 positions in the same document. Note, however, that all our methods are still valid when this constraint is relaxed. However, in Section 7.3, we did not provide any means for selecting the list length cutoff, which usually depends on the document collection and on the required result quality. There is a tradeoff between index size and quality: longer lists usually mean better results, but also a bigger index, while setting the length cutoff very low will greatly reduce index size, but at the same time also hurt result quality. This section introduces an automated method to tune index parameters such that both the size of the resulting index and the quality of results generated using this 148 8. Index Tuning for High-Performance Query Processing index meet predefined requirements. (Note that for the moment, our approach keeps all combined lists, but limits the information stored in each list. We will discuss in Section 8.4 how a subset of all combined lists can be selected based on the occurrence of the pairs in a query log.) We will proceed as follows: we first define two parameters for tuning the index size, then we show how to estimate the size of an index given the tuning parameters. Next, we define measures for the quality of a pruned index, and finally, we formally define index tuning as an optimization problem and show how to solve it. Parameters We start with defining two parameters to tune the selection of index entries stored in each term or combined index list: • Minimal score cutoff: we keep only index entries with a score that is not below a certain lower limit m. • List length cutoff: we keep at most the l entries from each list that have the highest scores. These two parameters allow us to systematically reduce the size of the resulting index with a controllable influence on result quality. Figure 8.4 shows how the index size for GOV2, relative to an unpruned index, changes with varying l and m. Figure 8.4: Relative index size with varying list length and minscore cutoffs. We denote the index consisting of all term index lists for collection C by T(C), and the index consisting of all term and combined index lists for C by I(C). We will 8.3 Parameter Tuning 149 use the term inverted lists synonymously for index lists. We write I(C,l,m) for the index for document collection C that consists of term and combined index lists, where each list is limited to the l entries with highest score and the combined lists contain only entries with an accd-score of at least m. We use the similar notation T(C,l) for an index consisting of only term lists where each list contains only the l entries with highest score. Note that we do not perform score-based pruning on term lists. We omit C when the collection is clear from the context. Index Size An important constraint in our optimization process is the maximal storage space that the final pruned index is allowed to occupy. We will denote the size of an index I in bytes by |I|. The size of an uncompressed index depends on (1) the aggregated number N(I) of index entries in all lists, (2) the size s of each index entry in bytes, (3) the number of different keys K(I) (i.e., terms and/or term pairs) in the index, and (4) the per-key overhead a of the access structure to associate a key with an offset in the inverted file. For a compressed index, s is not constant, but depends on the entry and the previous entry (due to delta encoding). We can formally define the size of the index I as |I| := s · N(I) + a · K(I). This simple definition is only valid when all index lists are of the same type. In our application, we may have two different index lists, term lists and combined lists, which may differ in number of entries, number of keys, and entry size. We therefore write Nt(I) for the number of term list entries in index I and Nc(I) for the number of combined list entries in I, with N(I) = Nt(I) + Nc(I), and use a similar notation for s and K(I). The more accurate size of an index I is then |I| := st · Nt(I) + sc · Nc(I) + a · (Kt(I) + Kc(I)). For an uncompressed index, assuming that integers and floats need 4 bytes to store, we can set st := 4 + 4 = 8 (document ID and content score) and sc := 4 + 4 + 4 + 4 = 16 (document ID, proximity score, and content scores for both terms). We can estimate a similarly (for example, by assuming that a corresponds to the average key length plus the space for a pointer into the inverted file). We are typically interested in estimating the size of a pruned index I(l,m) or T(l) without actually materializing it (because materializing it takes a lot of time and the index may be too large to be completely materialized anyway). In the following we dis- cuss how to estimate |I(l,m)|, the adaptation to |T(l)| is straightforward. We consider only a sample P of all possible keys (i.e., terms and term pairs) and use it to approxi- mate the distribution of list lengths, given a list length cutoff l and minimal score cutoff m. Formally, we denote by X(l,m) a random variable for the length of an index list in index I(l,m), and want to estimate the distribution F(l,m) of that random variable, i.e., estimate F(l,m; x) = P[X(l,m) ≤ x]. We sample the index lists for a subset P of n keys chosen independently from all keys; each sample yields a value Xi(l,m) for the 150 8. Index Tuning for High-Performance Query Processing length of that list in I(l,m). Using the empirical distribution function [Was05], we can estimate the cdf of this distribution as F̂n(l,m; x) := ∑n i=1 J(Xi(l,m) ≤ x) n , where J(Xi(l,m) ≤ x) = { 1 if Xi(l,m) ≤ x 0 else . All we actually need is the expected length E[F(l,m)], which can again be estimated from the sample as Xi(l,m) [Was05]. Assuming that there are K(P) keys in the sample, the expected number of entries in the index for the sample is therefore K(P) ·Xi(l,m). To extend this estimate to the complete collection, we make sure that the size of P relative to the size of the collection is known, for example by sampling p% of all keys (this can be easily implemented using hash values of keys). The expected number of keys in the index is therefore 100·K(P ) p , and the expected number of entries in the index is N(l,p) := 100 · K(P) p · Xi(l,m). The size estimator for a compressed index is built similarly, but instead of computing just the length Xi(l,m), we materialize and compress the list, and use its actual size, avoiding the need to estimate the average value of s. As the space of feasible values for the parameters l and m is in principle infinitely large, we cannot compute the estimate for all combinations. Instead, our implementa- tion considers only selected step sizes for l and m, computes estimates for those values, and interpolates sizes for other value combinations. We currently consider a step size of 100 for l and 0.05 for m. Index Quality Intuitively, the fewer entries we keep in each list, the more will reduce the quality of query results, since the probability that relevant documents are dropped from the pruned lists increases. The goal is to find values for m and l that maximize index quality while generating an index that fits into a predefined amount of memory. We now define different notions of index quality measures M(C,l,m,k) for index I(C,l,m) and a fixed number k of results. In the best case, a set of predefined reference or training topics Λ is available that include human assessments of the relevance of documents in the collection. Such a set of topics can be build, for example, by first selecting a set of representative topics from a query log, then computing top-k results for different parameter settings, pooling those results per topic, and have human assessors determine the relevance of each result. Alternatively, click logs could be used to estimate the relevance of results (but with much lower confidence). Topic sets of this kind are frequently available for test collections such as TREC .GOV or .GOV2, but they cannot be reused for different document collections. Given such a set Λ of reference topics, we denote by pΛ[k; I] 8.3 Parameter Tuning 151 the average quality of the top-k results over all topics (e.g., precision@k or NDCG@k) computed using index I; our implementation currently uses average precision at k. We can now define effectiveness-oriented and efficiency-oriented absolute index quality: • Effectiveness-oriented absolute index quality: this is quantified as the ratio of the quality of the first k results with the pruned index to the quality of the first k results with the unpruned index or, formally, pΛ[k;I(C,l,m)] pΛ[k;I(C)] . • Efficiency-oriented absolute index quality: this is quantified as the reciprocal of the maximal query processing cost per query term and query term pair (i.e., 1 l ) when the result quality of the pruned index is not worse than that of an unpruned term-only index without proximity lists (formally, when pΛ[k;I(C,l,m)] pΛ[k;T (C)] ≥ 1), and 0 otherwise. Here, the effectiveness-oriented index quality measure aims at finding the best possible results by including as much proximity information in the index as possible. The efficiency-oriented quality measure, on the other hand, assumes that the quality of a term-only index is already sufficient and tries to minimize the length of index lists (assuming that query processing efforts are directly proportional to the lengths of index lists). For most applications, such a set of reference topics does not exist or would be too expensive to generate. In this case, we fix a set Γ of queries (e.g., representative samples from a query log) and use relative quality to estimate how good results with the pruned index are, compared to results with the unpruned index. We define, for each query γi ∈ Γ, the set of relevant results to be the top-k documents with some index configuration I′ and use this to compute the result quality of index configuration I. When the quality measure is precision, this boils down to computing the overlap of the top-k results with index configurations I and I′. We formally denote the resulting quality of index I as pΓ[k; I|I′]. We can now define relative index quality measures in an analogous way to the abso- lute measures defined before. However, we then would always favor index configurations that produce exactly the results of the corresponding unpruned index, as we assume that any results not in the top-k results with the unpruned index are non-relevant. This is often overly conservative in practice, as many of the new results will be relevant to the user as well, so it is usually sufficient to provide a “high” overlap, not a perfect one. We therefore introduce another application-specific tuning parameter α that denotes the threshold for relative quality above which we accept an index configuration. This is especially important for efficiency-oriented index quality: we cannot expect that we will get the same results with the pruned index with term and combined lists as with just the unpruned term lists, so achieving an overlap of 1 there would be impossible. Instead, we use I(C) also in that case and set α to a value below 1. • Effectiveness-oriented relative index quality: this is the relative result quality of the pruned index pΓ[k; I(C,l,m)|I(C)]. 152 8. Index Tuning for High-Performance Query Processing • Efficiency-oriented relative index quality: this is the reciprocal of the maximal query processing cost per query term and query term pair (i.e., 1 l ) when the relative result quality of the pruned index pΓ[k; I(C,l,m)|I(C)] is at least α and 0 otherwise. Index Tuning We can now formally specify the index tuning problem: Problem 1. Given a collection C of documents, an upper limit S for the index size, a target number of results k, and an index quality measure M, estimate parameters m and l such that M(C,l,m,k) is maximized, under the constraint that |I(C,l,m)| ≤ S. When there is more than one combination of m and l that maximize the quality measure and satisfy the size constraint, pick one of them where the index size is minimal. Note that even though the index is tuned for a specific number k of results, it can be still used to retrieve any other number of results. We will experimentally validate in Section 8.5.2 that result quality does not degrade much in these cases. 8.3.2 Implementation of the Tuning Framework We implemented our tuning framework within the MapReduce paradigm [DG08], di- viding the tuning process into several map-reduce operations. As stated before, the input to the tuning process is the collection C, a target index size S, a target number of results k, and an index quality measure M that includes a set of training topics T . Additionally, we fix the fraction p of index keys (for both terms and term pairs) to be sampled. The tuning process then proceeds in the following order, where each step is implemented as a map-reduce operation: 1. Compute index for sample and training topics. The map phase considers each document in the collection, parses it, and creates index entries for terms and term pairs that are either part of the sample or the training topics. These entries are still incomplete, because the final BM25 scores can be computed only when global properties of the collection are known, so they contain only term frequencies and document lengths (but already complete accd(t1, t2) values for term pairs); their key is the term or term pair. The reduce phase then combines items with the same key into an index list, completing their scores as all global parameters of the score (average document length, number of documents, and document frequency of each term) are now known1. The output of this phase are two indexes, one for the sample, the other for the set of training topics. 1 At least Hadoop 0.20 does not directly provide these global parameters to the reduce phase, so we need to store them in files and aggregate them in each reducer. The alternative would be to combine the initial map with a do-nothing reducer, include additional map-reduce operations to compute the global values, and then have a map-reduce operation with a do-nothing mapper and the reducer we just described. 8.3 Parameter Tuning 153 2. Prepare the estimator for the index size. The map phase considers each key in the sample and computes, for each combination (l,m) it considers, the size s of the corresponding index list when pruned according to the l and m cutoffs (or the size of its compressed representation for compressed indexes), which is then written out with key (l,m). The algorithm starts with l = k and increases it by the step size for l, and considers all values for m, starting at 0 and increasing it by the step size for m. The reduce phase combines all values for a single pair of (l,m) cutoffs and computes the average index list size for this cutoff. This value is then stored in an on-disk data structure as size estimate for (l,m). This phase also counts the overall number of keys in the sample. 3. Prepare solving the optimization problem. In an initial map-reduce oper- ation, we compute the baseline precisions. The map phase then considers each topic with its corresponding assessments and computes, for each (l,m) pair pro- vided by the size estimator, the quality of the index for this topic. This can be efficiently implemented by a stepwise incremental join algorithm. In the first step, the algorithm sets l = k, i.e., it reads the first k entries from each list and incrementally computes results for (k,m), starting at the highest value for m and decreasing it by the step size of m. This yields, for each m, a temporary set of results with (partial) scores, from which the k documents with highest partial score are considered as result. The index quality for this result is computed and written out with key (k,m). If the score of the entry at position k is less than m (i.e., the list would be cut before it), the value m is marked as completed and will not be considered later. As soon as m exceeds the score of the last read entry, all smaller values for m will get the same index quality. In the following steps, the algorithm reads more entries from each list correspond- ing to the step size for l. Assume that it read up to l entries from each list. It continues with the temporary set of partial results from the previous step and the highest value for m not yet marked as completed and repeats the above process. This phase ends when either all values for l have been considered or all lists have been completely read. It is evident that each entry of the lists is read at most once, so the complexity is linear in the aggregated number of entries in the index lists for this topic. Note that for the efficiency-oriented quality measures, the map phase does not write the actual index quality measure introduced in Section 8.3.1: instead of 1/l, the reciprocal of the maximal query processing cost per query term and query term pair, respectively, the map phase writes the actual precision of the top-k results. This is due to the fact that only (l,m) combinations are valid that can provide a given precision (averaged over all training topics in T). The reduce phase will transfer the precision values for each valid (l,m) combination to such an index quality measure later. The reduce phase averages, for each combination of (l,m), the per-topic index 154 8. Index Tuning for High-Performance Query Processing quality values computed by the map phase, and computes the final index quality for this combination. For the efficiency-oriented measures, this means that it compares the average precision with the result quality of the term-only index and uses 1/l as final index quality when the average precision is high enough. If the (l,m) combination has a non-zero index quality, the reducer estimates its size using the size estimator. For each (l,m) combination with a non-zero index quality that matches the size constraint S, the reduce phase outputs an (l,m,q,s) tuple, where q is the index quality and s is the index size. 4. Compute an approximate solution of the optimization problem. The following centralized phase scans all output tuples from the previous step and determines the tuple (l,m,q,s) with highest quality. Optionally, it can further explore the solution space around (l,m) for better solutions. The output of this step is an approximate solution to Problem 1. 5. Materialize the final index. Analogously to phase 1, the final index is ma- terialized in a single map-reduce operation. Note that each mapper can already restrict the index entries it generates: for term pair entries, it does not emit any entries whose score is below m, and for term entries, it emits only the l entries with highest scores (which can be achieved using an additional combiner). An additional optimization for this step would be to generate only an approximation of the final index: if there are M mappers used to parse the collection, each map- per needs to emit at most β M · l entries, where β ≥ 1 is a tuning parameter that steers the expected number of entries missing in the final index. 8.4 Log-Based Term Pair Pruning Even with relatively short list length cutoffs l̄, the overall space consumption of the pruned combined lists can still be pretty huge, because there are a lot more combined lists than term lists. On the other hand, the majority of combined lists are unlikely to ever occur in any query. A possible solution can be to selectively materialize only combined lists for term pairs that occur at least t times in a query log, which can drastically reduce the number of lists. When counting term pairs in the query log, we consider each query separately, build up all possible term pair combinations for that particular query, and finally count for each term pair the number of occurrences over the complete query log. However, when one of these unlikely queries is issued for which not all or even no combined lists are available, answering it using the pruned term lists and the available subset of combined lists only may affect the result quality for this query. Figure 8.5 demonstrates this effect, using the AOL query log and our training topics on .GOV2 (see Section 8.5.1), with l = 4310 and m = 0.00. The x-axis of this chart shows different values for the threshold t of term pairs in the AOL log, and the y-axis shows the precision at 10 results. The line with diamonds depicts the result of running our merge-based algorithm from Section 7.3.5 with the available index lists 8.4 Log-Based Term Pair Pruning 155 only. It is evident that the higher the threshold, the lower the result quality gets, which can be explained by fewer and fewer combined lists being materialized. For very high thresholds (not depicted in the chart), the precision drops to 0.396, compared to 0.617 when using all lists. 0.4 0.45 0.5 0.55 0.6 0.65 0 5 10 15 20 25 30 35 40 45 50 minimal frequency in the query log p re c is io n @ 1 0 Pruned lists only Full TL when one pair list missing Full TL when all pair lists missing Figure 8.5: Effect of log-based pruning on query performance (on training topics). To overcome this negative effect, we propose to keep the unpruned term index lists when log-based pruning is applied. As soon as at least one combined list for a query term is missing (variant 1) or, alternatively, all combined lists (variant 2) for a query term are missing, we read the available combined lists and the unpruned term list for that term. This improves result quality to at least the quality of an unpruned term index, but at the same time incurs an increased cost for query evaluation as longer term lists have to be read. Figure 8.5 also depicts the effect of these approaches on result quality (line with squares: read full term lists when at least one pair is missing; line with triangles: read full term lists when all combined lists are missing). It is evident that this combined execution helps to keep precision close to the level of the precision with the unpruned T(C) index only (which is 0.585). Our tuning framework can be extended to consider only combined lists where the corresponding term pair occurs at least t times in a query log, and tunes the parameters to reach the optimization goals even with this limited selection of combined lists. Maintaining only the term pair lists for term pairs that appear in previous queries (e.g., term pairs that appear in queries from a query log) 156 8. Index Tuning for High-Performance Query Processing may be restrictive for rare queries that will appear in the future. For those few queries which are affected, the results of the proposed approach would not benefit from term proximity. However, we can still achieve a retrieval quality similar to that using BM25 scores as we can use the unpruned term lists. If rare queries become more frequent over time, one may consider using an updated query log file to update the index structures. To accelerate query processing and to save on accumulators, we split unpruned term lists in two pieces: the l̄ entries with highest scores are stored in docid order and the remaining entries in score order. When processing a query where some combined lists are missing, in a first phase we process the first piece of the term lists and the available combined lists with the merge-based algorithm from Section 7.3.5, keeping all documents and their scores in memory. After that, in a second phase, a standard top-k algorithm (in our case NRA, cf. Section 6.1) consumes the second piece of the term lists, using the already read documents as candidates. The accd contribution for non-available combined lists is 0 in both steps. This algorithm will terminate more quickly than running it on the unpruned term lists alone, and will usually give better results due to the proximity score from the combined lists. We give a more detailed explanation of how we process queries for the case of a 4-term-query {t1, t2, t3, t4} where one pruned combined list is missing (due to non- sufficient frequency of that term pair in the query log). W.l.o.g. assume that the missing combined list is CL(t1, t2). For both variants, we load pruned CLs for (t1, t3), (t1, t4), (t2, t3), (t2, t4), and (t3, t4). For variant 1, we load the pruned TLs for t3 and t4, the unpruned TL for t1 and for t2 as at least one combined list for t1 and at least one pruned combined list for t2 is missing, namely (t1, t2). In the first phase, we process the pruned TLs, the first piece (docid-ordered) of the unpruned TL for (t1, t2), and the available pruned CLs using an n-way merge join algorithm. In the second phase, we process the second, score-ordered piece of the unpruned TL for (t1, t2) using NRA with the already seen documents as candidates. For variant 2, we load only pruned TLs for all query terms as for every query term not all combined lists are missing. As we work with pruned lists only, we only execute an n-way merge join on the pruned lists, the second phase is not needed. 8.5 Experimental Evaluation This section presents results of a large-scale experimental evaluation of our techniques with two standard text collections. To facilitate reading, we first give a short overview of the content and the goals for each subsection. Section 8.5.1 gives details about the experimental setup and the employed test beds. Section 8.5.2 describes the evaluation of our index tuning techniques from Section 8.3 for different result size cardinalities (10 and 100). Both for effectiveness- and efficiency-oriented index quality measures, we present parameter tuning results given a size limit for the pruned indexes. We present results that tune indexes in the presence (absolute index quality) and absence (relative index quality) of relevance assessments. The goal of Section 8.5.2 is to show the 8.5 Experimental Evaluation 157 feasibility of our index tuning techniques. Section 8.5.3 compares the query processing performance of merge joins with pruned indexes as input to the recently proposed Block-Max WAND (BMW) algorithm [DS11], a state-of-the-art document-at-a-time algorithm that uses dynamic pruning on unpruned indexes. Besides the original BMW, we provide additional results for our proximity score-enhanced BMW variant. The goal of Section 8.5.3 is to compare the query performance of merge joins with pruned indexes to dynamic pruning with unpruned indexes. The query processing performance is measured both by abstract measures (e.g., the number of opened lists, average number of entries and bytes read from disks) and average query processing times for hot and cold cache settings. A running system (i.e., a warm cache scenario) is simulated using an LRU-based cache. For various cache sizes, we report cache hit ratios, the number of non-cached lists, the warm cache query processing times, and for BMW, in addition, the number of read blocks. Section 8.5.4 evaluates the effect of query log-based combined list pruning, one way to shrink the index size that is orthogonal to index compression techniques. Section 8.5.5 summarizes the conclusions from Section 8.5.2 to Section 8.5.4. Section 8.5.6 presents additional results with ClueWeb09 and aims at demonstrating the scalability of our index tuning approach by means of similar experiments as the ones shown in Section 8.5.2. Section 8.5.7 describes our efforts that apply our tuning framework to the INEX 2009 test bed for the Efficiency Track. 8.5.1 Setup We evaluated our methods with two standard text collections from TREC2, the GOV2 collection and the ClueWeb09 collection (cf. Section 3.2.1 for more details about both collections). The TREC GOV2 collection consists of approximately 25 million docu- ments from U.S. governmental Web sites with an uncompressed size of approximately 426GB. We used the 100 Ad Hoc Task topics from the TREC 2004 and 2005 Terabyte Tracks3 (cf. Appendix B, Tables B.1 and B.2) as training topics for tuning index pa- rameters, and the 50 Ad Hoc Task topics from the TREC 2006 Terabyte Track (cf. Appendix B, Table B.3) for testing the quality of results. We used the AOL query log4 for the log-based technique. We measure result quality as precision values P@k, i.e., the average number of relevant results among the first k results and additionally report normalized discounted cumulative gain (NDCG@k) [JK02] that considers the order of results, not the result set as a whole. The ClueWeb09 collection5 consists of approximately 1 billion Web documents crawled in January and February 2009. Following standards at the TREC Web Track, we consider only the approximately 500 million English documents (e.g., also used by [NC10]), from which we chose the 50% documents with the smallest probabilities to be spam according to the Waterloo Fusion spam ranking6 (spaminess has also been 2 http://trec.nist.gov 3 http://trec.nist.gov/data/terabyte.html 4 http://gregsadetsky.com/aol-data/ 5 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 6 http://durum0.uwaterloo.ca/clueweb09spam/ 158 8. Index Tuning for High-Performance Query Processing used in [BFC10] for example). The resulting document set has an uncompressed size of about 6TB. We use the 50 topics from the Ad Hoc Task of TREC Web Track 20097 (cf. Appendix B, Table B.4) to train and optimize the index parameters; documents without assessment are considered non-relevant. The 50 topics from the Ad Hoc Task of Web Track 20108 (cf. Appendix B, Table B.5, non-assessed topics are marked) are employed as test topics. Due to a few missing relevance assessments for the test top- ics, we only provide precision values for our runs submitted to Web Track 2010, based on a subset of 48 topics with assessments as published in our contribution to TREC 2010 [BS10]. The most significant part of our experiments is built on the GOV2 collection, since the number of available topics for the ClueWeb09 collection is lower and the assess- ments sparser than for the GOV2 counterpart. Therefore, we run only a limited set of experiments on the ClueWeb09 collection and report detailed tuning results for GOV2 only. As an additional test bed we employ the test bed from INEX 2009. The INEX Wikipedia collection from 2009 consists of approximately 2.67 million articles and 1.4 billion elements. There are two types of queries used in our experiments: 115 Type A topics from the Ad Hoc Track which includes classic Ad Hoc-style focused passage or element retrieval with a combination of NEXI CO and CAS queries. They are partially enriched with phrasetitle elements that indicate important phrases in the title field. 115 Type B topics have been generated from the Type A topics by running Rocchio- based blind feedback on the results of the article-only Ad Hoc reference run. Therefore, Type B topics can consist of partly more than 100 keywords. We show results from our participation in the INEX 2009 Ad Hoc and Efficiency Tracks where we evaluate CO queries (i.e., title fields and partially phrasetitle fields). Type A topics are listed in Appendix C, Table C.5 to C.8, Type B topics are not listed due to their length. Whenever we report times for parameter tuning or index construction, they were measured on a cluster of 10 servers in the same network, where each server had 8 CPU cores plus 8 virtual cores through hyperthreading, 32GB of memory, and four local hard drives of 1TB each. The cluster was running Hadoop 0.20 on Linux, with replication level set to two. Query execution times are reported using a single core of a CPU of a single node in the cluster. All algorithms are implemented in Java 1.6. We distinguish between test indexes and full indexes: while test indexes contain only lists for terms that are part of topics in a given test bed and term pairs that can be built from each topic, full indexes do not make this restriction, but contain lists for all terms and term pairs (occurring in a window of size W=10). If not explicitly stated differently, we build up test indexes to keep the indexing effort manageable as to both the time required for index construction and the required space on disk. Hence, we can maintain many indexes for evaluation on the same disk. 7 http://trec.nist.gov/data/web09.html 8 http://trec.nist.gov/data/web10.html 8.5 Experimental Evaluation 159 8.5.2 Index Tuning on GOV2 We evaluated our index tuning techniques from Section 8.3 for different maximal index sizes and result counts (10 and 100). We report all results in this section with and without index compression. We are aware of the fact that compression is preferable to using uncompressed indexes, because the data can be read faster and decompressing is less expensive than reading more data. We do not exclude experiments without com- pression from this section as we want to be able to quantify the effects of compression within our tuning framework. The effect of log-based combined list pruning will be evaluated in Section 8.5.4. For each setting, we first estimated index parameters using the training topics, built an index with these parameters, and then evaluated result quality on the test topics. Absolute Index Quality Table 8.1 shows the results of index tuning on the training topics with selected size limits below the collection size, for uncompressed indexes. In this table, each row shows results for a given index size constraint and number of query results, namely the resulting index parameters, the estimated and real index size for these parameters, and the result quality on the training and test topics with this index. The rows with size limit ∞ denote the corresponding unpruned indexes with term+combined lists (named I(C)) or term lists (named T(C)), respectively. To build up the unpruned combined lists, we consider only term pair occurrences within a text window of fixed size W = 10 as used in Section 7.3. Estimating one set of parameters took approximately 5 hours, where about 3.5 hours were required for the first map-reduce phase to build the index for the sample and the training topics. The time for building the final index strongly depends on the chosen parameters; for a full index with up to 310 entries per list and a score threshold of 0.05, this took less than five hours on our cluster. Opt. size size[GB] P@k on NDCG@k on goal k limit l m est. real train test train test 10 100GB 19010 0.40 96.4 96.9 0.596 0.572 0.4766 0.5121 200GB 19010 0.15 170.5 170.8 0.610 0.578 0.4819 0.5153 effective- 400GB 4310 0.00 396.4 396.4 0.617 0.592 0.4874 0.5262 ness ∞ I(C) 757.0 0.614 0.578 0.4875 0.5158 oriented 100 100GB 10200 0.35 97.4 97.6 0.3899 0.3146 0.4117 0.4174 index 200GB 17900 0.15 169.1 169.4 0.3975 0.3244 0.4190 0.4256 quality 400GB 4200 0.00 394.3 394.3 0.4035 0.3176 0.4245 0.4204 ∞ I(C) 757.0 0.4108 0.3338 0.4264 0.4324 efficiency- 10 100GB 5010 0.30 87.2 87.0 0.586 0.578 0.4634 0.5174 oriented 200GB 310 0.05 128.1 127.9 0.588 0.534 0.4658 0.4732 index ∞ T (C) 22.9 0.585 0.538 0.4701 0.4852 quality 100 400GB 800 0.00 270.6 270.6 0.3848 0.2850 0.4093 0.3888 ∞ T (C) 22.9 0.3847 0.3002 0.4078 0.3954 Table 8.1: GOV2: index tuning results for absolute index quality without index com- pression. It is evident that all indexes with the estimated parameters meet the index size 160 8. Index Tuning for High-Performance Query Processing constraint. For the effectiveness-oriented quality goal, all precision results (for the training and, more importantly, also for the test topics) are better than the precision with an unpruned term-only index (significantly better under a paired t-test with p < 0.05 when the size limit is at least 200GB), so the additional combined index lists help to improve precision even when they are pruned. For the efficiency-oriented quality goal, it turns out that already very short list prefixes (310 entries for top-10, 800 entries for top-100 results) are enough to yield results with a quality comparable to standard term indexes, given a sufficiently large index size constraint. If this constraint is too tight, short lists cannot guarantee the quality target. 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 10 40 70 100 N D C G @ k k 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.59 0.62 10 40 70 100 P @ k k 19010,0.40 19010,0.15 4310,0.00 10200,0.35 17900,0.15 4200,0.00 I(C) 5010,0.30 310,0.05 800,0.00 T(C) Figure 8.6: P@k and NDCG@k on test topics for effectiveness- and efficiency-oriented absolute index quality without index compression. Although we tune for document retrieval of either the best k=10 or k=100 result documents, we are aware that sometimes it may be necessary to retrieve a number of results k′ that is different from the number of results k used for tuning. Figure 8.6 shows P@k and NDCG@k values for the test topics with all index configurations from Table 8.1 for varying numbers of retrieved results, namely 10, 40, 70, and 100 result documents. It is evident that result quality with indexes tuned for k=10 results does not degrade much when returning longer result lists, i.e., choosing k′ greater than 10. Differences for NDCG are a bit larger than for precision as we tuned our indexes using the precision measure. Compared to the original runs from the TREC 2006 Terabyte Track [BCS06], our tuned indexes do well in terms of precision. The best P@20 we get for the effectiveness-oriented goal is 0.5310 (for (10200, 0.35)), none of the P@20 values underscores 0.5210. Our best indexes outperform 14 of 20 competitors in P@20. Note that our index tuning was not carried out with the TREC 2006 topics but with the training topics and for retrieval of the top-10 or top-100 results instead of the top-20, which imposes a penalty on us. For the efficiency-oriented goal the best index (5010, 0.30) reaches a P@20 of 0.5210; very short list lengths deteriorate in later precision values, at 0.4520 for (310, 0.05). Table 8.2, which has the same layout as the table before, shows the results of in- dex tuning on the training topics with selected size limits below the collection size for 8.5 Experimental Evaluation 161 compressed indexes. Our index compression scheme is effective: an index in configu- ration (4310, 0.00) requires 396.4GB uncompressed, but only 248.8GB compressed, an index in configuration (5010, 0.30) requires 87.0GB uncompressed, but only 55.4GB compressed, and an index in configuration (310, 0.05) requires 127.9GB uncompressed, but only 94.9GB compressed. Opt. size size[GB] P@k on NDCG@k on goal k limit l m est. real train test train test 10 50GB 6810 0.75 48.3 48.1 0.589 0.586 0.4612 0.5144 70GB 19010 0.30 64.6 64.5 0.599 0.574 0.4779 0.5140 100GB 19010 0.20 98.5 98.0 0.608 0.574 0.4795 0.5139 effective- 200GB 19010 0.05 173.8 172.8 0.615 0.584 0.4861 0.5182 ness 400GB 4310 0.00 249.7 248.8 0.617 0.592 0.4874 0.5262 oriented ∞ I(C) 468.9 0.614 0.578 0.4875 0.5158 index 100 50GB 10400 0.85 49.9 49.7 0.3795 0.3124 0.4017 0.4120 quality 70GB 19800 0.30 64.9 64.7 0.3926 0.3250 0.4156 0.4249 100GB 20000 0.20 99.0 98.5 0.3970 0.3248 0.4189 0.4260 200GB 19700 0.05 174.4 173.4 0.4008 0.3264 0.4223 0.4273 400GB 14400 0.00 295.1 293.5 0.4067 0.3272 0.4263 0.4288 ∞ I(C) 468.9 0.4108 0.3338 0.4264 0.4324 10 50GB 6310 0.75 47.9 47.7 0.585 0.588 0.4591 0.5156 70GB 5010 0.30 55.6 55.4 0.586 0.578 0.4633 0.5173 100GB 310 0.05 94.9 94.9 0.588 0.534 0.4658 0.4732 efficiency- 200GB 310 0.05 94.9 94.9 0.588 0.534 0.4658 0.4732 oriented 400GB 310 0.05 94.9 94.9 0.588 0.534 0.4658 0.4732 index ∞ T (C) 14.5 0.585 0.538 0.4701 0.4852 quality 100 50GB 10400 0.85 49.9 49.7 0.3795 0.3124 0.4017 0.4120 70GB 6000 0.30 56.9 56.7 0.3847 0.2988 0.4077 0.4026 100GB 3600 0.15 84.6 84.2 0.3849 0.2946 0.4099 0.3979 200GB 900 0.00 193.8 193.5 0.3861 0.2868 0.4108 0.3902 400GB 900 0.00 193.8 193.5 0.3861 0.2868 0.4108 0.3902 ∞ T (C) 14.5 0.3847 0.3002 0.4078 0.3954 Table 8.2: GOV2: index tuning results for absolute index quality with index compres- sion. It is evident that all indexes with the estimated parameters meet the index size constraint, and the size estimator only slightly overestimates the final index size. For the effectiveness-oriented quality goal, all precision results for indexes with a size constraint of at least 70GB (for the training and, more importantly, also for the test topics) are better than the precision with an unpruned term-only index (significantly better under a paired t-test with p < 0.05 when the size limit is at least 200GB), so the additional combined index lists help to improve precision even when they are pruned. NDCG results behave similarly. For the efficiency-oriented quality goal, it turns out that already very short list prefixes (310 entries for top-10, 900 entries for top-100 results) are enough to yield results with a quality comparable to standard term indexes, given a sufficiently large index size constraint. If this constraint is too tight, short lists cannot guarantee the quality target. Note that index tuning for different index size limits may result in identical optimal index parameters (l,m); for example, efficiency-oriented index quality tuning for top- 162 8. Index Tuning for High-Performance Query Processing 10 retrieval and size limits between 100GB and 400GB results in (310,0.05). Recall from the description of efficiency-oriented index quality and Problem 1 in Section 8.3 that index list pruning aims at minimizing the list length l and as an afterthought the index size; anyway, the respective pruned index has to provide at least the P@10 quality of unpruned term lists. As (l,m) combinations with l < 310 or l = 310 and m > 0.05 cannot provide the same P@10 values that unpruned term lists provide, the optimal index parameters (310,0.05) remain constant. In this case, the resulting pruned index does not use the full amount of space given by the size limit, which comes as a consequence of the efficiency-oriented quality definition. Efficiency-oriented tuning aims at minimizing the maximum query processing cost per list and provides at least the same retrieval quality as using unpruned T(C) indexes. If the retrieval quality goal can be met with small indexes, we do not waste space. 0.23 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.59 10 40 70 100 P @ k k 6810,0.75 19010,0.30 19010,0.20 19010,0.05 4310,0.00 10400,0.85 19800,0.30 20000,0.20 19700,0.05 14400,0.00 I(C) 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 10 40 70 100 N D C G @ k k Figure 8.7: P@k and NDCG@k on test topics for effectiveness-oriented absolute index quality with index compression. 0.23 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.59 10 40 70 100 P @ k k 6310,0.75 5010,0.30 310,0.05 10400,0.85 6000,0.30 3600,0.15 900,0.00 T(C) 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 10 40 70 100 N D C G @ k k Figure 8.8: P@k and NDCG@k on test topics for efficiency-oriented absolute index quality with index compression. 8.5 Experimental Evaluation 163 Although we tune for retrieving either the best k=10 or k=100 result documents, it can often happen that a different number of results should be retrieved. Figures 8.7 and 8.8 show precision and NDCG values for the test topics with all index configurations from Table 8.2 for varying numbers of retrieved results, namely 10, 40, 70, and 100 result documents. In Figure 8.7, for each choice of k, the rightmost bar represents I(C), the baseline for effectiveness-oriented tuning; in Figure 8.8, the rightmost bar for each k represents T(C), the baseline for efficiency-oriented tuning. It is evident that result quality with pruned indexes tuned for k=10 results does not degrade much relative to the result quality provided by T(C) (the baseline for efficiency-oriented tuning) or I(C) (the baseline for effectiveness-oriented tuning) when returning more results. Even if we select the weakest setting at late precision values, namely (310,0.05), we still achieve a P@100 value of 0.26 compared to 0.30 for T(C). Differences for NDCG are slightly larger, which could be expected since we tuned for precision, not NDCG. Relative Index Quality We first performed an experiment to estimate good values for α, the application-specific tuning parameter that denotes the threshold for relative quality above which we accept an index configuration: we computed, for a selection of possible values for α, optimal index parameters (l,m) for the training topics under relative index quality, then instan- tiated the corresponding pruned indexes and compared the resulting absolute precisions (using the assessments from TREC) to the precision of the same topics with I(C) and T(C). The results of this experiment are displayed in Table 8.3. This allows to esti- mate values for α that are sufficient to yield similar precision values as the unpruned term-only index for the efficiency-oriented measure. A good choice for α is 0.75 as p[100;I(C,l,m)] p[100;T (C)] is close to 1 which means that, using pruned term and pruned combined lists for top-100 document retrieval, we achieve a precision comparable to that using unpruned term lists. α 0.7 0.75 0.8 0.85 0.9 0.95 p[100;I(C,l,m)] p[100;I(C)] 0.9343 0.9471 0.9626 0.9759 0.9914 1.0010 p[100;I(C,l,m)] p[100;T (C)] 0.9945 1.0081 1.0246 1.0388 1.0553 1.0655 Table 8.3: Relative result quality for different values of α. Table 8.4 gives tuning results for relative index quality with uncompressed indexes. We can get close to the result quality for top-10 results of an unpruned index with the effectiveness-oriented techniques (we even get better quality for some scenarios), for both the test and the training topics. For top-100 results, the situation is slightly worse, there is a small gap to the quality of an unpruned index (which, however, may be tolerable). For the efficiency-oriented indexes, we achieve comparable or even better precision values than for the unpruned text indexes, at a reasonable index size of less than 100GB. Figure 8.9 depicts P@k and NDCG@k values for efficiency-oriented and effectiveness-oriented index quality goals on all (l,m) combinations from Table 8.4, 164 8. Index Tuning for High-Performance Query Processing 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 10 40 70 100 N D C G @ k k 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.50 0.53 0.56 0.59 10 40 70 100 P @ k k 12010,0.30 19810,0.15 19810,0.05 18100,0.35 19800,0.15 19800,0.05 I(C) 1910,0.30 8800,0.30 T(C) Figure 8.9: P@k and NDCG@k on test topics for effectiveness- and efficiency-oriented relative index quality without index compression. for varying numbers of retrieved results. It is evident that the relative index quality approach ensures retrieval quality on test topics even without relevance assessments. Like stated before, the result quality of indexes tuned for k=10 results does not degrade much for more retrieved results relative to T(C) and I(C), respectively and for both retrieval measures. Opt. size size[GB] overlap P@k on NDCG@k on goal k limit l m est. real on train train test train test 10 100GB 12010 0.30 99.7 100.1 0.837 0.591 0.580 0.4711 0.5179 200GB 19810 0.15 171.4 171.8 0.893 0.609 0.578 0.4809 0.5153 effective- 400GB 19810 0.05 293.3 293.4 0.924 0.615 0.584 0.4858 0.5182 ness ∞ I(C) 757.0 - 0.614 0.578 0.4875 0.5158 oriented 100 100GB 18100 0.35 100.0 100.4 0.773 0.3899 0.3252 0.4120 0.4263 index 200GB 19800 0.15 171.4 171.8 0.829 0.3983 0.3248 0.4201 0.4260 quality 400GB 19800 0.05 293.3 293.4 0.868 0.4008 0.3266 0.4222 0.4274 ∞ I(C) 757.0 - 0.4108 0.3338 0.4264 0.4324 efficiency- 10 100GB 1910 0.30 73.1 72.7 0.750 0.574 0.554 0.4549 0.5030 oriented ∞ T (C) 22.9 - 0.585 0.538 0.4701 0.4852 index 100 100GB 8800 0.30 95.3 95.4 0.750 0.3903 0.3104 0.4121 0.4125 quality ∞ T (C) 22.9 - 0.3847 0.3002 0.4078 0.3954 Table 8.4: GOV2: Index tuning results for relative index quality without index com- pression. Table 8.5 gives tuning results for relative index quality with compressed indexes. We can get close to the result quality for top-10 results of an unpruned index with the effectiveness-oriented techniques (we even get better quality for some scenarios), for both the test and the training topics. For top-100 results, the situation is slightly worse, there is a small gap to the quality of an unpruned index (which, however, may be tolerable). For the efficiency-oriented indexes, we achieve comparable or even better precisions than the unpruned term indexes, at a reasonable index size of less than 100GB. Figures 8.10 and 8.11 depict precision and NDCG values for efficiency-oriented and effectiveness-oriented index quality goals on all (l,m) combinations from Table 8.5, 8.5 Experimental Evaluation 165 Opt. size size[GB] overlap P@k on NDCG@k on goal k limit l m est. real on train train test train test 10 50GB 7010 0.55 49.9 49.7 0.776 0.584 0.588 0.4587 0.5130 70GB 19810 0.30 64.9 64.7 0.854 0.596 0.576 0.4767 0.5140 100GB 19810 0.20 98.9 98.4 0.882 0.606 0.580 0.4786 0.5139 effective- 200GB 19810 0.05 174.5 173.5 0.924 0.614 0.584 0.4858 0.5182 ness 400GB 19810 0.00 306.8 304.9 0.946 0.613 0.586 0.4854 0.5210 oriented ∞ I(C) 468.9 - 0.614 0.578 0.4875 0.5158 index 100 70GB 20000 0.30 64.9 64.8 0.786 0.3923 0.3252 0.4155 0.4248 quality 100GB 20000 0.20 99.0 98.5 0.819 0.3968 0.3250 0.4189 0.4260 200GB 19800 0.05 174.5 173.5 0.867 0.4007 0.3266 0.4223 0.4274 400GB 19800 0.00 306.7 304.8 0.903 0.4062 0.3282 0.4259 0.4290 ∞ I(C) 468.9 - 0.4108 0.3338 0.4264 0.4324 10 50GB 1910 0.30 48.6 48.4 0.750 0.572 0.556 0.4549 0.5030 70GB 1910 0.30 48.6 48.4 0.750 0.572 0.556 0.4549 0.5030 100GB 1010 0.10 91.3 91.1 0.755 0.585 0.568 0.4614 0.5064 200GB 510 0.00 176.1 176.0 0.752 0.614 0.556 0.4846 0.4957 efficiency- 400GB 510 0.00 176.1 176.0 0.752 0.614 0.556 0.4846 0.4957 oriented ∞ T (C) 14.5 - 0.585 0.538 0.4701 0.4852 index 100 70GB 8900 0.30 59.6 59.4 0.750 0.3903 0.3102 0.4123 0.4124 quality 100GB 4100 0.15 86.1 85.7 0.751 0.3874 0.2978 0.4119 0.4001 200GB 2100 0.05 130.1 129.7 0.752 0.3844 0.2940 0.4115 0.3976 400GB 1200 0.00 203.5 203.2 0.750 0.3893 0.2900 0.4127 0.3948 ∞ T (C) 14.5 - 0.3847 0.3002 0.4078 0.3954 Table 8.5: GOV2: index tuning results for relative index quality with index compres- sion. for varying numbers of retrieved results. It is evident that the relative index quality approach ensures retrieval quality on test topics even without relevance assessments. 8.5.3 Query Processing with GOV2 We compared the query processing performance using pruned indexes as an input for our merge-based technique from Section 7.3.5 with the recently proposed Block-Max WAND (BMW) algorithm [DS11] as a state-of-the-art document-at-a-time algorithm. BMW requires index lists sorted in document order where entries are grouped in blocks of fixed size; for each block, the maximal score, the maximal document ID, and its size are maintained to enable skipping complete blocks during execution. We extended BMW to support proximity scores, providing two kinds of index lists as input: • term index lists as described in Section 7.3.1, but ordered by document ID, and • proximity index lists as described in Section 7.3.1, but ordered by document ID. Both kinds of index lists are augmented by the block structure. We compress the document IDs by delta- and v-byte encoding and store all scores using v-byte encoding. The block size is 64 documents as used in [DS11]. We denote the respective index consisting of all term index lists for collection C by T(C)BM W and the index consisting of all term and proximity index lists for C by I′(C)BM W . In our implementation of BMW, skipped blocks are not read from disk if the index list is not in memory. Note 166 8. Index Tuning for High-Performance Query Processing 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.50 0.53 0.56 0.59 10 40 70 100 P @ k k 7010,0.55 19810,0.30 19810,0.20 19810,0.05 19810,0.00 20000,0.30 20000,0.20 19800,0.05 19800,0.00 I(C) 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 10 40 70 100 N D C G @ k k Figure 8.10: P@k and NDCG@k on test topics for effectiveness-oriented relative index quality with index compression. 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.59 10 40 70 100 P @ k k 1910,0.30 1010,0.10 510,0.00 8900,0.30 4100,0.15 2100,0.05 1200,0.00 T(C) 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 10 40 70 100 N D C G @ k k Figure 8.11: P@k and NDCG@k on test topics for efficiency-oriented relative index quality with index compression. 8.5 Experimental Evaluation 167 that it may seem appealing to simply store term position information in the term list entries and use this for proximity scoring. However, it would no longer be possible to skip blocks with this simple solution since there are no maximal proximity scores. On the other hand, using precomputed proximity lists could help to improve performance (as shown, for example, in Section 7.3 for NRA, a standard top-k algorithm and TopX in RR-LAST mode). Please note that BMW evaluates queries in a disjunctive manner like the n-way merge joins described in Section 7.3.5: this means that matched documents neither have to contain all query terms nor all query terms have to appear within a maximum distance to each other. Further note that any processing algorithm would show similar performance when run on the pruned lists; the goal of this section is to compare query performance with pruned indexes to dynamic pruning on unpruned indexes. To assess processing performance, we mainly use query processing times, but we also consider abstract cost measures such as the number of opened lists, the average number of entries or bytes read from disk. These abstract measures are not influenced by transient effects like caching or other processes running on the same machine and mask out the quality of the actual implementation. We consider two extreme settings: (1) With hot caches, all index lists are loaded into memory before running the first query, corresponding to the setting used in the BMW paper [DS11]. (2) With cold caches, all index lists are completely loaded from disk, which is ensured by flushing the file system cache before running each query, which corresponds to a very conservative setting. We will examine other caching scenarios later in this section. Processing times are measured with a single-threaded, Java-based implementation running on a single core of a single cluster node. These measurements were taken by running the complete batch of queries five times and taking the average. In addition, we invoke the garbage collector before running each query to avoid side effects caused by garbage collection during query execution. Whenever we use the average symbol �, we build the average over all topics of the query load under consideration. For the mere sake of completeness, we additionally provide NRA-based query performance values for runs that employ indexes without compression. In the tables, those runs are denoted T(C) and I(C), respectively. For all measurements in this section, processing the training topics accessed 300 term lists and 334 combined lists, and processing the test topics accessed 144 term lists and 153 combined lists (for both pruned and non-pruned index lists). Results with uncompressed indexes are depicted in Tables 8.6 and 8.7 for training and test topics that show the number of read index entries as well as the number of bytes and runtimes with cold and warm caches, averaged over all topics. Results with the top-k algorithm NRA on the unpruned indexes are included in the rows for I(C) and T(C), respectively. For the efficiency-oriented indexes, these results clearly demonstrate that query processing on the pruned indexes is up to two orders of magnitude more efficient than on the unpruned indexes. For top-10 results, we require less than 1,800 reads per topic on average with an index of 128GB size, which is less than one disk block per index list. For the effectiveness-oriented indexes, the pruned index requires 168 8. Index Tuning for High-Performance Query Processing Opt. size size[GB] �reads·105 �bytes·105 �thot[ms] �tcold[ms] goal k limit l m est. real train test train test train test train test 10 100GB 19010 0.40 96.4 96.9 0.63 0.61 6.01 5.81 28.39 31.61 226.89 243.57 200GB 19010 0.15 170.5 170.8 0.66 0.64 6.56 6.24 28.21 27.78 232.91 245.80 effective- 400GB 4310 0.00 396.4 396.4 0.21 0.19 2.32 2.07 9.53 8.61 177.43 182.61 ness ∞ I(C) 757.0 8.43 3.37 71.73 29.70 898.29 429.83 1368.07 1020.75 oriented 100 100GB 10200 0.35 97.4 97.6 0.38 0.36 3.76 3.53 16.43 15.43 199.59 203.76 index 200GB 17900 0.15 169.1 169.4 0.63 0.61 6.27 5.96 27.52 25.96 236.69 243.03 quality 400GB 4200 0.00 394.3 394.3 0.20 0.19 2.27 2.03 9.34 8.47 174.68 175.37 ∞ I(C) 757.0 16.81 12.91 71.73 29.70 1978.76 1628.06 2276.06 2068.93 efficiency- 10 100GB 5010 0.30 87.2 87.0 0.21 0.19 2.14 2.00 8.87 8.44 172.76 181.61 oriented 200GB 310 0.05 128.1 127.9 0.02 0.02 0.21 0.20 1.19 1.11 135.03 131.86 index ∞ T (C) 22.9 14.05 9.45 112.40 75.60 1550.40 926.13 1764.85 1311.10 quality 100 400GB 800 0.00 270.6 270.6 0.04 0.04 0.52 0.49 2.30 2.22 145.27 150.60 ∞ T (C) 22.9 20.33 15.09 112.40 75.60 3453.74 2159.41 4078.60 2669.32 Table 8.6: GOV2: query performance for absolute index quality without index com- pression. Opt. size size[GB] �reads·105 �bytes·105 �thot[ms] �tcold[ms] goal k limit l m est. real train test train test train test train test 10 100GB 12010 0.30 99.7 100.1 0.44 0.41 4.31 4.06 20.46 22.00 203.65 214.74 200GB 19810 0.15 171.4 171.8 0.69 0.66 6.77 6.44 29.81 29.77 232.32 243.00 effective- 400GB 19810 0.05 293.3 293.4 0.72 0.68 7.34 6.86 31.95 29.81 240.28 257.10 ness ∞ I(C) 757.0 8.43 3.37 71.73 29.70 898.29 429.83 1368.07 1020.75 oriented 100 100GB 18100 0.35 100.0 100.4 0.61 0.59 5.83 5.61 26.35 24.85 219.07 232.50 index 200GB 19800 0.15 171.4 171.8 0.69 0.66 6.76 6.44 30.07 27.99 245.75 259.52 quality 400GB 19800 0.05 293.3 293.4 0.72 0.68 7.34 6.86 32.07 29.47 247.53 258.77 ∞ I(C) 757.0 16.81 12.91 71.73 29.70 1978.76 1628.06 2276.06 2068.93 efficiency- 10 100GB 1910 0.30 73.1 72.7 0.09 0.08 0.96 0.87 4.05 3.77 146.11 151.57 oriented ∞ T (C) 22.9 14.05 9.45 112.40 75.60 1550.40 926.13 1764.85 1311.10 index 100 100GB 8800 0.30 95.3 95.4 0.33 0.32 3.37 3.15 14.67 13.53 187.33 191.77 quality ∞ T (C) 22.9 20.33 15.09 112.40 75.60 3453.74 2159.41 4078.60 2669.32 Table 8.7: GOV2: query performance for relative index quality without index compres- sion. up to one order of magnitude less reads than the unpruned index. For absolute index quality tuning, query performance for larger indexes is actually better, because the smaller indexes need to use long list length cutoffs, but high minscore cutoffs to meet the index size constraint, which makes query processing expensive. For relative index quality tuning, query performance for larger indexes slightly deteriorates, because the larger indexes use longer list length cutoffs but also provide higher precision values. The runtimes reported in these tables demonstrate that the theoretical cost advantage of our approach is very beneficial in practice for hot cache as well as cold cache scenarios, with average hot cache times of about 1ms for top-10 retrieval with the best efficiency- oriented index. This corresponds to two to three orders of magnitude performance advantage over standard top-k algorithm evaluation on unpruned term index lists. Unlike that, the number of read items and the runtime of our technique does not 8.5 Experimental Evaluation 169 increase when retrieving more than 10 results by the nature of the merge join. Opt. size size[GB] �reads·105 �bytes·105 �thot[ms] �tcold[ms] goal k limit l m est. real train test train test train test train test 10 50GB 6810 0.75 48.3 48.1 0.26 0.25 1.33 1.26 8.09 8.21 90.47 85.26 70GB 19010 0.30 64.6 64.5 0.64 0.61 2.93 2.82 17.70 17.98 136.49 133.44 100GB 19010 0.20 98.5 98.0 0.66 0.63 3.11 2.96 17.58 18.52 138.57 135.51 effective- 200GB 19010 0.05 173.8 172.8 0.70 0.66 3.37 3.16 18.54 18.16 151.66 139.99 ness 400GB 4310 0.00 249.7 248.8 0.21 0.19 1.18 1.06 6.31 5.90 82.11 73.19 oriented ∞ I′(C)BM W 221.0 3.96 2.58 19.13 14.57 66.65 49.88 594.42 506.89 index 100 50GB 10400 0.85 49.9 49.7 0.38 0.36 1.83 1.72 11.43 11.43 115.27 107.04 quality 70GB 19800 0.30 64.9 64.7 0.66 0.63 3.01 2.91 17.66 18.43 143.13 141.09 100GB 20000 0.20 99.0 98.5 0.69 0.66 3.22 3.08 18.28 19.77 149.89 152.08 200GB 19700 0.05 174.4 173.4 0.72 0.68 3.46 3.25 19.36 18.74 157.14 148.71 400GB 14400 0.00 295.1 293.5 0.59 0.54 2.98 2.71 16.44 15.58 149.70 144.95 ∞ I′(C)BM W 221.0 7.07 4.79 30.72 22.96 118.67 78.49 676.27 553.75 10 50GB 6310 0.75 47.9 47.7 0.24 0.23 1.26 1.19 6.92 6.83 89.12 80.72 70GB 5010 0.30 55.6 55.4 0.21 0.19 1.10 1.03 6.06 5.90 82.32 71.26 100GB 310 0.05 94.9 94.9 0.02 0.02 0.12 0.11 0.81 0.88 13.82 13.25 200GB 310 0.05 94.9 94.9 0.02 0.02 0.12 0.11 0.81 0.88 13.82 13.25 efficiency- 400GB 310 0.05 94.9 94.9 0.02 0.02 0.12 0.11 0.81 0.88 13.82 13.25 oriented ∞ T (C)BM W 10.5 5.76 3.62 25.26 18.08 73.28 61.13 546.09 466.72 index 100 50GB 10400 0.85 49.9 49.7 0.38 0.36 1.83 1.72 11.43 11.43 115.27 107.04 quality 70GB 6000 0.30 56.9 56.7 0.24 0.23 1.26 1.18 7.00 6.82 90.02 84.38 100GB 3600 0.15 84.6 84.2 0.16 0.15 0.91 0.83 5.04 4.90 72.40 61.75 200GB 900 0.00 193.8 193.5 0.05 0.05 0.32 0.30 1.89 1.86 16.42 15.86 400GB 900 0.00 193.8 193.5 0.05 0.05 0.32 0.30 1.89 1.86 16.42 15.86 ∞ T (C)BM W 10.5 8.21 5.40 34.17 24.92 102.50 68.76 600.41 495.58 Table 8.8: GOV2: query performance for absolute index quality with index compres- sion. Tables 8.8 and 8.9 show the result of the performance experiments for training and test topics. They include the number of read index entries as well as the number of bytes and runtimes with cold and hot caches, averaged over all topics. Results with BMW on unpruned indexes are included in the rows for I′(C)BM W and T(C)BM W ; here, we only count the number of index entries in blocks we load into main memory, not in all blocks of the lists. Due to anomalies in the corresponding tables in [BS12], we have measured the average cold cache times again and have updated Tables 8.8 and 8.9 accordingly. For the efficiency-oriented indexes, these results clearly demonstrate that query processing on the pruned indexes can be one order of magnitude more efficient than BMW for both hot and cold caches. For top-10 results, we read less than 1,800 index entries (approximately 12KB) per topic on average with an index of 94.9GB size, whereas BMW needs to access more than 360,000 entries per topic. Similar results can be achieved for top-100 results with an index of 193.5GB size. For the effectiveness- oriented indexes, the performance gap is smaller. Note that query performance for larger indexes is sometimes better because the smaller indexes need to use long list length cutoffs, but high minscore cutoffs to meet the index size constraint, which makes query processing expensive. For relative index quality tuning, query performance for larger indexes slightly deteriorates, because the larger indexes use longer list length cutoffs but usually also provide higher precision and NDCG values. The runtimes 170 8. Index Tuning for High-Performance Query Processing Opt. size size[GB] �reads·105 �bytes·105 �thot[ms] �tcold[ms] goal k limit l m est. real train test train test train test train test 10 50GB 7010 0.55 49.9 49.7 0.27 0.26 1.37 1.29 12.17 12.74 96.38 89.20 70GB 19810 0.30 64.9 64.7 0.66 0.63 3.01 2.91 30.24 28.53 143.48 135.55 100GB 19810 0.20 98.9 98.4 0.68 0.65 3.20 3.05 31.38 26.59 150.23 138.29 200GB 19810 0.05 174.5 173.5 0.72 0.68 3.47 3.26 32.27 30.79 157.71 137.68 effective- 400GB 19810 0.00 306.8 304.9 0.76 0.71 3.75 3.46 32.65 30.87 165.67 145.72 ness ∞ I′(C)BM W 221.0 3.96 2.58 19.13 14.57 66.65 49.88 594.92 506.89 oriented 100 50GB 10300 0.80 50.0 49.9 0.37 0.35 1.82 1.71 16.62 14.92 113.14 107.20 index 70GB 20000 0.30 64.9 64.8 0.66 0.64 3.03 2.93 34.14 26.08 145.09 136.99 quality 100GB 20000 0.20 99.0 98.5 0.69 0.66 3.22 3.07 28.74 29.66 149.69 142.79 200GB 19800 0.05 174.5 173.5 0.72 0.68 3.47 3.26 34.67 35.78 155.30 145.58 400GB 19800 0.00 306.7 304.8 0.76 0.71 3.75 3.46 33.23 29.95 168.20 150.73 ∞ I′(C)BM W 221.0 7.07 4.79 30.72 22.96 118.67 78.49 676.27 553.75 10 50GB 1910 0.30 48.6 48.4 0.09 0.08 0.52 0.47 3.79 3.03 40.36 36.04 70GB 1910 0.30 48.6 48.4 0.09 0.08 0.52 0.47 3.79 3.03 40.36 36.04 100GB 1010 0.10 91.3 91.1 0.05 0.05 0.33 0.31 2.23 1.95 27.93 27.38 200GB 510 0.00 176.1 176.0 0.03 0.03 0.19 0.18 1.72 1.13 19.62 18.27 efficiency- 400GB 510 0.00 176.1 176.0 0.03 0.03 0.19 0.18 1.72 1.13 19.62 18.27 oriented ∞ T (C)BM W 10.5 5.76 3.62 25.26 18.08 73.28 61.13 546.09 466.72 index 100 50GB 10300 0.80 50.0 49.9 0.37 0.35 1.82 1.71 16.62 14.92 113.14 107.20 quality 70GB 8900 0.30 59.6 59.4 0.34 0.32 1.69 1.59 15.17 13.72 115.14 104.44 100GB 4100 0.15 86.1 85.7 0.18 0.17 1.01 0.92 7.40 8.21 78.90 70.51 200GB 2100 0.05 130.1 129.7 0.10 0.10 0.62 0.57 5.00 5.48 53.33 47.13 400GB 1200 0.00 203.5 203.2 0.06 0.06 0.41 0.38 2.82 3.24 36.07 27.97 ∞ T (C)BM W 10.5 8.21 5.40 34.17 24.92 102.50 68.76 600.41 495.58 Table 8.9: GOV2: query performance for relative index quality with index compression. reported in these tables demonstrate that the theoretical cost advantage of our approach is very beneficial in practice for hot cache as well as cold cache scenarios, with average hot cache times of about 1ms for top-10 retrieval with the best efficiency-oriented index. Please note that we have used pruned test indexes in the tables. The (310,0.05) test index needs on average 13.82ms and 13.25ms in a cold cache scenario for training and test topics, respectively. This seems to be extremely fast and is probably influenced by the dense arrangement of the test index structures on the hard disk and the resulting non-controllable disk caching side effects. To allow a comparison, we have built the (310,0.05) full index: for training and test topics �tcold values are 75.13ms and 68.65ms, respectively. Unlike BMW, the number of read entries and the runtime of our technique does not increase when retrieving more than 10 results by the nature of the merge join (however at the price of a slightly reduced result quality). As an interesting side result, we see that the additional proximity lists can sometimes improve query performance for BMW because they allow tighter score bounds, which is similar to the earlier results in Chapter 7 for the standard top-k algorithm NRA and TopX in RR-LAST mode. For top-10 retrieval, BMW with term and proximity lists (denoted as I′(C)BM W in Table 8.8) takes on average 66ms with hot caches for the training topics and reads on average 396K entries, whereas using only term lists (T(C)BM W ) takes on average 73ms and reads on average 576K entries. With cold caches, BMW with only term indexes is better due to the expensiveness of opening more index lists (546ms vs 594ms). 8.5 Experimental Evaluation 171 cache cache hit ratio #non-cached l m size[MB] [bytes] [#lists] lists �twarm[ms] 310 0.05 8 28.98% 29.29% 161,393 39.85 310 0.05 16 37.05% 37.08% 143,613 36.04 310 0.05 32 44.36% 43.89% 128,069 32.70 310 0.05 64 50.54% 49.39% 115,525 29.67 310 0.05 1024 54.44% 52.77% 107,801 28.92 Table 8.10: Efficiency Track: real system performance, merge join, various LRU cache sizes with a (310,0.05) full index. For efficiency reasons, storing position information in the term-only index to com- pute proximity scores on the fly as a document is encountered is not an option for us. As argued in Section 7.2.2, that approach is not feasible for top-k style processing as it is not possible to compute tight score bounds for candidates which in turn disables dynamic pruning. To compute the top-k results efficiently, we need to precompute proximity informa- tion into index lists that can be sequentially scanned and compute tight score bounds for early pruning. As briefly discussed in Section 6.3, an alternative would be to first determine a set of candidate documents with ’good’ term list scores and later re-rank only the candidate documents by computing proximity scores from their position infor- mation (cf. Section 2.4.1 for an example). This requires a large enough set of candidate documents in the first step, which is potentially expensive to compute - if we choose the set too small, we may leave out potentially relevant documents and decrease result quality for the top-k results. As shown in Table 8.8 for BMW (T(C)BM W ), this can easily cause high additional processing costs. In a second line of experiments, we ran the 50,000 queries from the TREC Terabyte Efficiency Track 2005 with the fastest index configuration determined by efficiency- oriented index-tuning, the (310, 0.05) full index setting, comparing it again to BMW. In addition to the hot and cold cache settings used before, we also consider warm caches, a more realistic simulation of a running system. We implemented an LRU cache of configurable size to store the least recently used index lists. This LRU cache was emptied before running the first query; we then ran all queries sequentially. To minimize side effects caused by file system caching and garbage collector activities during query processing, we emptied the file system cache and invoked the garbage collector before executing each query. In this scenario that corresponds to a steady-state execution in a running search engine, processing with the (310, 0.05) full index takes less than 30ms on average for an LRU cache size of 64MB, compared to an average of 0.7ms for hot caches and 127ms for cold caches, respectively. Table 8.10 shows performance values for query processing with a (310, 0.05) full index for different cache sizes. It depicts the cache hit ratio both for the number of read bytes (cache hit ratio[bytes]) and number of read lists (cache hit ratio[#lists]) as well as the number of non-cached lists and the average warm cache running times. Even for a very small cache size of 8MB, we need less than 40ms on average to process 172 8. Index Tuning for High-Performance Query Processing cache cache hit ratio #non-cached k index size[MB] [bytes] [#lists] lists #read blocks �twarm[ms] 10 T (C)BM W 64 3.62% 2.15% 115,938 397,735,689 204.63 100 T (C)BM W 64 3.62% 2.15% 115,938 573,001,133 259.01 10 T (C)BM W 1024 46.39% 26.58% 86,990 397,735,689 173.16 10 I′(C)BM W 1024 44.02% 14.57% 197,638 312,500,850 206.36 Table 8.11: Efficiency Track: real system performance, BMW, various LRU cache sizes. a query, and starting with a cache size of 64MB, we need less than 30ms on average. Note that the overall number of index lists used in this experiment is 228,247. 0 50 100 150 200 250 300 350 400 1 2 3 4 5 6 7 8 9 number of query terms ø time[ms], 8MB ø time[ms], 16MB ø time[ms], 32MB ø time[ms], 64MB ø time[ms], 1GB time[ms]), 8MB time[ms]), 16MB time[ms]), 32MB time[ms]), 64MB time[ms]), 1GB Figure 8.12: Efficiency Track: real system performance for a (310,0.05) full index for various query and LRU cache sizes. Figure 8.12 depicts the average running times and their standard deviations de- pending on the number of keywords in the query and the cache size. The larger the cache size, the higher the cache hit ratio which lowers the average processing time. As expected, the average running time is monotonous in the number of query terms as more query terms potentially lead to more fetched index lists at processing time. However the standard deviation of running times for a given query length is low (and does not depend on the cache size) such that the average running time is usually a good approximation for the expected running time of a query. Table 8.11 shows performance values for query processing with BMW on T(C)BM W and I′(C)BM W indexes with varying cache sizes. The number of processed lists for this query load amounts to 118,483 for T(C)BM W which consist of 1,061,659,742 blocks in 8.5 Experimental Evaluation 173 196.9GB size, and 231,346 for I′(C)BM W indexes which consist of 1,099,312,741 blocks in 202.5GB size9. As expected, increased cache sizes help to speed up query processing (204.63ms vs 173.16ms for 64MB vs 1GB LRU cache size with T(C)BM W ). I′(C)BM W is slower than T(C)BM W at the same cache size: although the number of read blocks decreases, more non-cached lists have to be loaded. We see that the run time increases with growing k. While processing queries with T(C)BM W with 64MB LRU cache size takes 204.63ms to retrieve k=10 results, the same index requires 259.01ms to retrieve k=100 results. In contrast, the run time is independent of the result set cardinality for our pruned indexes as they are processed completely by an n-way merge join. If we compare the warm cache performance of our pruned lists processed in a merge join algorithm (Table 8.10) to that of BMW with a T(C)BM W index (Table 8.11) at an LRU cache size of 64MB, we observe that the number of non-cached lists that have to be fetched from hard disk is similar (115,525 vs 115,938). Anyway we achieve a speedup of a factor of 7 (�twarm=204.63ms) for top-10 and a factor of 9 for top-100 retrieval (�twarm=259.63ms), since the cache hit ratio measured in bytes is about 50% for our approach compared to less than 4% for the BMW approach. To achieve a similar cache hit ratio for the T(C)BM W index as for our approach, we need to increase the LRU cache size to 1GB. This means that, compared to our approach, we need 16 times as much cache at a processing speed that is 5 times slower. 8.5.4 Log-based Pruning with GOV2 We evaluated our log-based technique for pruning term pairs from the index, using the same training and test queries as before and the AOL query log. Note that all indexes in this section are uncompressed as we consider log-based pruning and compression of indexes as orthogonal ways to shrink index sizes whose effects are shown separately. Table 8.12 shows index tuning results for t=1, i.e., materializing combined lists for Opt. size size[GB] P@k on goal k limit l m est. real train test effectiveness- 10 100GB 5010 0.00 96.0 96.0 0.610 0.586 oriented 100 100GB 19800 0.10 93.6 93.6 0.3927 0.3192 index quality 400GB 20000 0.00 293.4 293.4 0.3969 0.3196 efficiency- 10 50GB 2210 0.15 47.8 47.8 0.538 0.462 oriented 100GB 1810 0.00 64.7 64.7 0.587 0.554 index 100 50GB 7200 0.35 50.0 50.2 0.3771 0.3036 quality 100GB 3800 0.00 86.2 86.2 0.3847 0.2980 Table 8.12: Index tuning results with log-based pruning (t=1) for absolute index quality. term pairs that occur at least once in the query log. It is evident that using log-based pruning helps to get smaller index sizes for the efficiency-based techniques. The index size reduces down to 47.8GB, which is approximately twice the size of the unpruned, uncompressed term index (22.9GB). The result quality of indexes created by log-based 9There are more lists here because our pruning technique may completely drop a pair when all entries in its list have a score below the minscore threshold. 174 8. Index Tuning for High-Performance Query Processing pruning remains similar as for the unpruned, uncompressed term index. However, index tuning using log-based pruning results in much longer index lists than index tuning without log-based pruning. The longer lists in turn affect runtime (cf. Table 8.13): Opt. P@k �reads·105 �thot[ms] �tcold[ms] goal k l m train test train test train test train test efficiency- 10 2210 0.15 0.538 0.462 1.22 1.97 162.4 241.6 427.1 514.2 oriented 1810 0.00 0.587 0.554 1.20 1.96 162.2 229.6 441.3 525.7 index 100 7200 0.35 0.3771 0.3036 1.38 2.12 229.4 388.8 504.3 685.7 quality 3800 0.00 0.3847 0.2980 1.29 2.03 255.6 341.8 471.0 649.7 Table 8.13: Query performance with log-based pruning (t=1) for absolute index quality. compared to the best results without log-based pruning, query processing takes an order of magnitude longer. The longer run times are due to the score-ordered part of the unpruned term lists as detailed in Section 8.4 which may be processed to a large extent in the NRA phase to preserve retrieval quality if all corresponding combined lists are missing. Anyway, it is still faster than BMW with unpruned T(C) in the cold cache setting. We could not achieve the quality goal for the effectiveness-based methods as there were not enough combined lists left to boost quality enough; we did not evaluate performance for these settings. Log-based pruning therefore mostly serves to reduce index size in situations with strong resource constraints where indexes need to be loaded from disk. Here, it can still improve execution cost, while result quality stays comparable to a term-only index. 8.5.5 Summary of Conclusions and Limitations of the Approach Finding (l,m) parameters to build pruned GOV2 indexes is reasonably fast and takes about five hours, while building a final full index is dependent on the resulting parame- ters and takes less than five hours for (l,m)=(310,0.05). Already very short list prefixes of term and combined lists are sufficient to yield a result quality comparable to the one of unpruned term lists. This comes at the expense of more opened lists but saves on the number of read bytes and tuples, respectively. As shown for BMW, processing unpruned term indexes using dynamic pruning techniques is more expensive than pro- cessing pruned term and combined lists in an n-way merge join. P@k’ and NDCG@k’ values of pruned indexes tuned for k results do not degrade much for larger result set cardinalities k′. We show that, in the absence of relevance assessments, we can use the overlap between top-k results on pruned term and combined lists and the top-k results on unpruned lists as a substitute for the relevance assessments. The relative index quality approach ensures retrieval quality on test topics even without relevance assessments for the training topics – it works better for early P@k and NDCG@k, but does not degrade much for larger k. Processing performance for training and test topics is excellent compared to the BMW algorithm. For a bigger query load, the 50,000 TREC Terabyte Efficiency Track queries from 2005, our experiments show the viability of our approach at an LRU cache size of 64MB: although the number of non-cached lists for our pruned index and the T(C)BM W index 8.5 Experimental Evaluation 175 is comparable, our processing speed is 7 and 9 times faster for top-10 and top-100 retrieval, respectively, as the longer lists of the T(C)BM W index require more cache space. To achieve a similar cache ratio for the T(C)BM W index as for our pruned index, we would need to increase the LRU cache size to 1GB, at a processing speed which is still 5 times slower than for our approach. Log-based pruning helps to get smaller resulting indexes at similar result quality at the expense of increased list lengths and an increased number of loaded lists which incurs increased running time during query processing. 8.5.6 Results with ClueWeb09 size[GB] P@k �reads·105 �bytes·105 �thot[ms] �tcold[ms] Opt. goal k (l, m) est. real train test train test train test train test train test effect.-o. 10 (1410,1.0) 640.2 636.2 0.200 - 0.04 0.04 0.24 0.18 2.32 1.25 67.35 56.38 index qual. 100 (4600,0.3) 979.0 977.9 0.1344 - 0.14 0.11 0.75 0.56 5.35 3.33 83.69 61.44 effic.-o. 10 (410,1.0) 503.6 499.8 0.190 0.2292 0.01 0.01 0.07 0.06 0.77 0.47 79.61 71.15 index qual. 100 (400,1.0) 464.6 462.8 0.1170 - 0.01 0.01 0.07 0.06 0.63 0.44 69.05 55.60 Table 8.14: ClueWeb09: index tuning results for absolute index quality and evaluation of query performance, size limit set to S=1TB. To demonstrate the scalability of our index tuning approach, we carried out experi- ments on the ClueWeb09 collection similar to the ones shown in Section 8.5.2 for GOV2. For all experiments we keep the index size limit of S=1TB fixed which corresponds to about 17% of the size of the uncompressed spam-reduced English part of the ClueWeb09 collection (cf. Section 3.2.1 for details). We have published the corresponding results in [BS10]. Again, we consider two baselines, the unpruned T(C) and I(C) indexes; however, as building the unpruned I(C) index exceeded our disk capacity, we had to limit each index list to the first 20 million entries. We expect that this restriction will—if at all—have only a negligibly small influence on the result quality. The result quality of results created with these baselines was assessed using the available assessments for the training topics; here, P@10 was 0.180 for T(C) and 0.198 for I(C), and P@100 was 0.1110 for T(C) and 0.1324 for I(C). For the test topics, we submitted runs created for the baseline indexes to the 2010 TREC Web Track, Ad Hoc Task, which yielded a P@10 of 0.2250 for both T(C) and I(C) indexes, which was somehow unexpected (we had expected I(C) to yield a higher precision than T(C)). This may be partially caused by sparser relevance assessments for the Web Track Topics, partially by giving a lower assessment priority to our two baseline runs compared to our third run (which was created with one of the pruned indexes). For P@100 our expectations are met: I(C) yields a precision of 0.1358 compared to T(C) which yields a precision of 0.1294. Table 8.14 shows the results for absolute index quality tuning on the training topics for compressed ClueWeb09 indexes. With index parameters tuned for efficiency and top-10 document retrieval, the best index configuration turns out to be (410, 1.00) at 176 8. Index Tuning for High-Performance Query Processing an index size of less than 500GB. Processing training topics with this pruned index requires 1,302 reads on average per topic and takes less than 1ms for hot and 80ms for cold caches, providing a result quality comparable to T(C). Processing test topics with the same index is even slightly faster due to shorter index lists (1,045 reads on average), reflected in improved hot cache times and cold cache times of about 70ms. The corresponding run has been submitted to the 2010 TREC Web Track, Ad Hoc Task as well, with a P@10 of 0.2292, which is slightly higher than the P@10 of T(C) and slightly higher than the P@10 value for the training topics. As NDCG@k has not been used as retrieval quality metric in the 2010 TREC Web Track, Ad Hoc Task, we only report precision values. Efficiency-oriented indexes for top-100 document retrieval require less than 500GB disk space and provide running times of less than 1ms for hot caches and around 70ms for cold caches on the training topics at a result quality comparable to the T(C) baseline run. Query processing on the test topics is again faster, with 55ms on average with cold caches. Our effectiveness-oriented index tuned for top-10 retrieval (1410, 1.00) requires 640GB and provides a retrieval quality comparable to I(C). Query execution takes about 2ms for the training topics and slightly more than 1ms for the test topics for hot caches. Effectiveness-oriented indexes for top-100 retrieval require less than 1TB disk space and thus stay within our index size limit, providing again a result quality comparable to I(C). Here, query execution takes about 5ms and 3ms for hot caches, whereas cold cache times range below 85ms and 65ms for training and test topics, respectively. The results show that the size estimator also works effectively on the ClueWeb09 collection with only minor overestimation. size[GB] �reads �bytes �thot �tcold Opt. goal k (l, m) est. real overlap P@k ·104 ·105 [ms] [ms] effect.-o. 10 (9810,1.00) 927.9 923.6 0.986 0.198 2.89 1.45 8.46 152.17 index qual. 100 (19000,0.80) 1008.3 1006.5 0.972 0.1330 5.52 2.72 13.77 135.93 effic.-o. 10 (110,1.00) 395.3 391.6 0.856 0.160 0.04 0.02 0.28 70.89 index qual. 100 (200,1.00) 408.1 407.6 0.781 0.1010 0.06 0.04 0.43 82.95 Table 8.15: ClueWeb09: index tuning results for relative index quality and evaluation of query performance, size limit set to S=1TB. Table 8.15 shows the results for relative index quality tuning on ClueWeb09 with the training topics. While the effectiveness-oriented approaches result in indexes which deliver result quality comparable to I(C) (at the price of longer lists compared to ab- solute index quality), result quality with the efficiency-oriented indexes falls shortly behind BM25 score quality, but the difference would still be tolerable in applications. We assume that this effect can at least partly be attributed to the fact that relevance assessments from TREC 2009 are very sparse compared to those from earlier years; unassessed documents contribute to the overlap with the groundtruth, but do not in- crease precision values if they are in the result list of a query, even though a user may consider them relevant. Although the indexed part of the ClueWeb09 collection is one order of magnitude 8.5 Experimental Evaluation 177 larger in size than GOV2 (6TB vs. 426GB uncompressed), the required index space does not grow as fast as the collection (e.g., index size grows from 94.9GB to 499.8GB for the efficiency setting (310, 0.05) on GOV2 compared to (410, 1.00) on ClueWeb09). For absolute index quality tuning, the indexes tend to have shorter list lengths on ClueWeb09 such that query processing is often even faster on ClueWeb09 indexes. 8.5.7 Results with INEX 2009 This part is based on our participation in INEX 2009 [BS09] which describes our efforts that apply our tuning framework to the INEX 2009 test bed for the Efficiency Track. We tune the index structures for different choices of result size k. To allow comparison as to retrieval quality with non-pruned index structures, we also depict our results from the Ad Hoc Track. The scoring model we used in INEX 2009 corresponds to the one we used in INEX 2008 [BST08], this time retrieving article elements only. Details about the scoring model can be found in Section 5.2.4. Ad Hoc Track For our contribution to the Ad Hoc Track, we removed all tags from the XML documents in the official INEX 2009 collection and worked on their textual content only. The last two runs have been submitted to INEX 2009, the first is the non-submitted baseline: • MPII-COArBM’: a content-only (CO) run that considers the stemmed terms in the title of a topic (including the terms in phrases, but not their sequence) except terms in negations and stop words. We restrict the retrieval to the top-level article elements and compute the 1,500 articles with the highest scoreBM25 value as described in our contribution to INEX 2008 [BST08]. Note that this approach corresponds to standard document-level retrieval. This run is the actual non- submitted baseline to enable a comparison to the submitted runs which all use proximity information. The corresponding run in Section 5.2.6 has been named TopX-CO-Baseline-articleOnly. • MPII-COArBP: a CO run which aims to retrieve the 1,500 articles with the highest scoreBM25+scoreprox values, where scoreprox is calculated based on all possible stemmed term pairs in the title of a topic (including the terms in phrases, but not their sequence) except terms in negations and stop words. • MPII-COArBPP: a CO run which is similar to MPII-COArBP but calculates the scoreprox part based on a selection of stemmed term pairs. Stemmed term pairs are selected as follows: we consider all stemmed tokens in phrases that occur both in the phrasetitle and in the title and are no stop words. The modified phrases in the phrasetitle are considered one at a time to combine term pairs usable to calculate scoreprox. If the phrasetitle is empty, we use approach MPII-COArBP. The results in Table 8.16 show that computing our proximity score with a subset of term pairs based on information taken from the phrasetitles (MPII-COArBPP) does not 178 8. Index Tuning for High-Performance Query Processing run iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP MPII-COArBM’ 0.5483 0.5398 0.5112 0.4523 0.2392 MPII-COArBP 0.5603 0.5516(26) 0.5361 0.4692 0.2575 MPII-COArBPP 0.5563 0.5477(28) 0.5283 0.4681 0.2566 Table 8.16: Results for the Ad Hoc Track: interpolated precision at different recall levels (ranks for iP[0.01] are in parentheses) and mean average interpolated precision. improve the iP values compared to using all term pairs (MPII-COArBP). As expected, MPII-COArBP leads to a slight improvement over MPII-COArBM’. Efficiency Track In the following, we describe our effort in INEX 2009 to tune our index structures for efficient query processing, taking into account the expected retrieval quality and index size. After that we briefly explain the approaches used by the other participants in the Efficiency Track and conclude with the results of the Efficiency Track. Like for index tuning with the GOV2 and ClueWeb09 collection, we aim to prune TLs and CLs after a fixed number of entries per list (plus an optional minimum score requirement for CLs) and employ them as input to a merge join. To measure retrieval quality, one usually compares the retrieval results with a set of relevance assessments. As at the time of index tuning we did not have any relevance assessments and we aim at maximum query processing speed, we tuned for efficiency- oriented relative index quality. To this end, for each number of results k required by INEX (k ∈{15,150,1500}), we first built up a groundtruth as a substitute for relevance assessments. That groundtruth consists of the top-k results obtained through process- ing the I(C) index. Note that this corresponds to the k highest scoring results of MPII-COArBP. We have found in Section 8.5.2 for the GOV2 collection that it was reasonable to use an overlap of α=75% between the top-k documents obtained by query processing on pruned TLs and CLs and the top-k documents of the groundtruth. This is enough to achieve the retrieval quality of T(C), i.e., BM25 retrieval quality. (Note that the overlap is computed by the amount of overlapping documents and is not based on the number of characters returned.) The optimization process follows the description given in Section 8.3.1. Please note that for our submission to INEX 2009 we used an early implementation of the tuning framework described in Section 8.3.2 which supported only uncompressed indexes: for all list lengths l ranging between 10 and 20,000 (step size of 100) and minimal score cutoffs between 0 and 1 (step size 0.05), we estimate the index size first by hashcode- based sampling 1% of all terms and term pairs. In our experiments, we restrict the processing of the query load to those indexes that meet the index size constraint set to S=100 GB. Table 8.17 presents the tuning results based on Type A queries with efficiency- 8.5 Experimental Evaluation 179 k (l̄, m̄) overlap est. size[GB] Nt(I) Nc(I) st · Nt(I)[MB] sc · Nc(I)[GB] 15 (210,0.00) 0.7629 62.8 66.4·106 3.55·109 507 52.9 150 (610,0.00) 0.7639 74.3 91.3·106 4.31·109 697 64.2 1500 (1810,0.00) 0.754 84.4 123.3·106 4.97·109 941 74.1 Table 8.17: Tuning results based on Type A queries with efficiency-oriented relative index tuning, uncompressed indexes. oriented relative index tuning for the three result cardinalities k. The size of the access structure at · Kt(I) + ac · Kc(I) is estimated to 9.4 GB for all choices of k with Kc(I) = 7.62 · 108 and an average term pair length of 12.17 in the sample. run (l, m) �thot[ms] �tcold[ms] iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP MPII-eff-15 (210,0.00) 8.8 216.5 0.575 0.559 0.511 0.400 0.177 MPII-eff-150 (610,0.00) 13.2 242.5 0.574 0.560 0.531 0.466 0.233 MPII-eff-1500 (1810,0.00) 27.1 287.0 0.566 0.553 0.532 0.464 0.248 Table 8.18: Efficiency Track results, type A queries. Table 8.18 shows the results of the tuned index structures for type A queries. For performance reasons, tuning was carried out using the type A queries only. To process type B queries, we used the same pruned indexes. MPII-eff-k depicts the optimal list lengths for different choices of k, the average cold and hot cache running times, and interpolated precision values at different recall levels. While measuring the cold cache running times, we have emptied the filesystem cache after each query execution, not just after each batch. To collect the hot cache running times, in a first round we fill the cache by processing the complete query load and measure the running times in the second round. The difference between the cold and hot cache running times can be considered as I/O time. Queries are processed using the pruned index structures which run (l, m) �thot[ms] �tcold[ms] iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP MPII-eff-15 (210,0.00) 604.3 7,630.4 0.374 0.356 0.304 0.272 0.099 MPII-eff-150 (610,0.00) 922.7 10,235.3 0.391 0.379 0.338 0.315 0.157 MPII-eff-1500 (1810,0.00) 1,492.3 12,979.9 0.391 0.379 0.337 0.316 0.162 Table 8.19: Efficiency Track results, type B queries. have been reordered by docid to allow merge join query processing. The pruned index is created by Hadoop and, in that early version, stored in a MapFile which is accessed by Hadoop in a non-optimized way during query execution: hence, there is still room for performance improvements. These performance improvements have been realized in later implementations (see experiments with other test beds) by means of our own file-based inverted list implementation and access methods detailed in Section 8.2. It turns out that already very short list prefixes are sufficient to lead to a result quality comparable to MPII-COArBP at early recall levels (until iP[0.01]) and to MPII-COArBM’ at later recall levels. Table 8.19 shows the results of the tuned index structures for type B queries. It is 180 8. Index Tuning for High-Performance Query Processing clear that in our setting type B queries that consist of partly more than 100 keywords cannot be executed as fast as type A queries. Many thousands of possible pruned CLs per query have to be fetched from hard disk before the evaluation can start. Other participants In the following, we will describe the approaches pursued by the other participants in the Efficiency Track of INEX 2009. Spirix: SPIRIX [WK09] is a P2P system which uses distributed search techniques for XML retrieval and splits collection, index, and search load over the P2P network. The employed P2P protocol is based on a Distributed Hash Table (DHT). The authors exploit XML structure to reduce the number of messages sent between peers. To compute the structural similarity with indexed articles for CAS queries, the authors use four groups of functions, which are used in different combinations for ranking and routing. The authors have implemented adaptions of several scoring models, namely the BM25, BM25E, and tf · idf model. SPIRIX has participated in the Ad Hoc and Efficiency track where the precision values were competitive with centralized solutions. The authors claim that they can reduce the total amount of different structures to 1539. There is no information about the number of structures in the full index such that the extent of the reduction remains unclear. The system significantly improves on early precision measures when it uses structural similarity. However an improvement from iP[0.01]=59% to 59.9% comes at the price of about 50 times slower query processing. In contrast to the other participants in the Efficiency Track who aim at providing fast query execution times, the authors redefine efficiency as getting the P2P system to scale, i.e., load balancing on large collections. MPII-TopX2: MPII-TopX2 used in INEX 2009 [TAS09] is based on the earlier reim- plementation of TopX for INEX 2008 [TAS08] and extends it by a new distributed XML indexing component. It supports a CAS-specific distributed index structure with a par- allelization of all indexing steps. The overall time for indexing, which is done in a 3-pass process, amounts to 20 hours on a single node system, and 4 hours on a cluster with 16 nodes. Retrieval modes include the Article mode (retrieves only article elements), CO Mode (retrieves any kind of elements), and CAS Mode (supports path queries with NEXI or XPath2.0 syntax). Entire lists or their prefixes, respectively, can be cached; the decoded and decompressed data structures can be reused by MPII-TopX2. Keys are tag-term pairs with a term propagation upwards in the XML tree. tf and ef values are computed for each tag-term pair. Due to the large collection size, collection-wide statistics are approximated. The scoring model in use is an XML-specific extension to BM25. The index uses an inverted block structure and compresses the blocks into a customized compact binary format. The Otago System: The system developed at the University of Otago [TJG09] uses a dictionary of terms organized in two levels: the first level stores the first four bytes and 8.5 Experimental Evaluation 181 the length of every term string, and the position to retrieve the term block that belongs to the term prefix. Terms with the same four bytes prefix are stored in the same term block which stores terms statistics: these include ctf and df values, the offset to locate the postings list, the length of the postings list, the uncompressed length of the postings list, and the position to locate the term suffix which is stored at the end of the block. At start-up, only the first level dictionary is loaded into memory. Query processing allows to set two parameters, namely lower-k and upper-K. While lower-k specifies how many documents to return, upper-K specifies how many documents to read from each tf-sorted postings list. (If there are ties, the postings with the same tf-value as the Kth posting are also evaluated.) When upper-K is specified, the complete postings list is decompressed, but only the documents with the highest tf-values are processed. This is similar to impact-layered indexes presented in Section 6.2.1. Each tf-layer stores the document ids in increasing order and compresses them by delta-encoding. Postings are compressed by v-byte encoding. The authors employ a special version of the quick sort algorithm that partitions the accumulators by their score so that only the top-partition has to be sorted. The employed scoring model is a modified version of BM25. Only operating system caching is used with the disk cache flushed before each run. A memory layer of 5.3GB is allocated with a usage of 97%. Lower-k is chosen as 15, 150, and 1,500 as required by the Efficiency Track. Each choice of lower-k is combined with upper-K set to 1, 15, 150, 1,500, 15,000, 150,000, and 1,500,000 which generates 21 runs. The runs which yield the highest MAiP of 29% and 30%, respectively, set lower-k to 1,500 and use an upper-K of at least 15,000. Early precision (iP[0.01]) values are good unless upper-K is chosen small (i.e., too few entries per postings list have been read); peak values between 58% and 60% are achieved for upper-K choices of at least 15,000. The average run time is split into I/O and CPU part. For Type A topics, the I/O costs are more or less constant (between 56ms and 58ms on average per topic), whereas CPU costs increase with increasing Upper-K (around 20ms for Upper-K≤1,500, up to about 65ms for upper-K=1,500,000). For Type B topics, the I/O costs are again very similar, as for all choices of lower-k and upper-K the same number of postings is retrieved from disk, causing the same disk I/O (between 325ms and 350ms). The CPU costs are similar for upper-K values of at most 150 (around 30ms) independent of the lower-k choice. Due to the increased number of postings lists for Type B topics, compared to Type A topics, the CPU time increases way more (up to 217ms for upper-K=1,500,000). The best MAiP value of 18% is achieved for lower-k=1,500 and upper-K=150,000. Run time is dominated by the I/O costs. Lower-k values above 15,000 lead to an increased run time to sort the top-partition of the accumulators. Efficiency Track Results Figures 8.13 and 8.14 describe the performance of the submitted runs in terms of efficiency and effectiveness (MAiP metrics) for type A and type B topics, respectively. Figures 8.15 and 8.16 describe the performance of the submitted runs in terms of efficiency and effectiveness (iP[0.01] metrics) for type A and type B topics, respectively. 182 8. Index Tuning for High-Performance Query Processing Figure 8.13: MAiP values: type A queries. Figure 8.14: MAiP values: type B queries. 8.6 Hybrid Index Structure for Efficient Text Retrieval 183 Figure 8.15: iP values: type A queries. Spirix is very slow because it uses a distributed P2P search setting. The Otago system is fastest for type B queries since it employs highly optimized C++ code using impact layered indexes with a modified version of BM25 scores. MPII-TopX2 makes use of an inverted block structure and compresses the blocks into a customized compact binary format. Our own runs labelled MPII-Prox use merge joins with pruned indexes, tuned as described before. Squares in the left rectangle depict our hot cache runs which represent the best case where every list comes from the cache, squares in the right rectangle show cold cache runs which represent the worst case where every list lookup causes I/O costs. Our approach performs index pruning in a retrieval quality- aware manner to realize performance improvements and smaller indexes at the same time. Our best tuned index structures provide the best CPU times for type A queries among all Efficiency Track participants while still providing at least BM25 retrieval quality. Due to the number of query terms, Type B queries which consist of partly more than 100 keywords cannot be processed equally performant as type A queries: the number of pair lists to be fetched from harddisk for type B queries before they can be evaluated can easily be in the order of thousands. 8.6 Hybrid Index Structure for Efficient Text Retrieval This section is based on our work published in [BS11]. Query processing with pre- computed term pair lists can improve efficiency for some queries, but suffers from the 184 8. Index Tuning for High-Performance Query Processing Figure 8.16: iP values: type B queries. quadratic number of index lists that need to be read. Here, we present a novel hybrid index structure that aims at decreasing the number of index lists retrieved at query processing time, trading off a reduced number of index lists for an increased number of bytes to read. 8.6.1 Introduction While precomputed indexes for term pairs can greatly improve performance for short queries, they are not that efficient for long queries or when lists are not available in a cache, but need to be read from disk. This disadvantage is rooted in the quadratic number of term pair lists that need to be accessed for every query. Especially with the pruning methods proposed earlier in this chapter that store only a small number of entries per term pair list, query processing time is dominated by the time to locate and open index lists. Reducing the number of index lists for processing a query can therefore significantly improve efficiency, even if more data must be read from each list. We base on and extend the index framework for TLs and CLs presented in Section 8.2. Experimental results indicated that it is enough to heuristically keep only the best few thousand entries in each list to achieve good result quality. 8.6 Hybrid Index Structure for Efficient Text Retrieval 185 8.6.2 Hybrid Index Framework To accelerate query processing, especially for medium-sized queries, it is necessary to reduce the number of lists accessed by each query. For a query with 5 terms, up to 10 CLs and 5 TLs need to be opened. The hybrid index framework that we have proposed in [BS11] can reduce this to at most 10 lists in the best case, reducing the number of lists to open by 33%. We achieve this by combining the CL for a term pair (t1, t2) with the TLs for t1 and t2, yielding an extended combined index list (CLExt) that now contains the best documents for both the term pair and the two single terms. We can expect that many documents will be included in two or three of the lists, so that the number of entries in the resulting CLExt will be less than the aggregated number of entries of the three source lists. Within the CLExt, we store all entries in the same format, replacing unknown scores by 0, and sort all entries by their docid. (1,9.3) (12,7.2) (5,5.0) (2,4.5) (1,5.9) (2,8.6) (4,9.1) (25,4.6) (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (12,0.5,7.2,3.0) (9,0.2,1.7,2.0) (1,0.0,9.3,5.9) (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (5,0.0,5.0,0.0) (9,0.2,1.7,2.0) (12,0.5,7.2,3.0) (25,0.0,0.0,4.6) CL(bike, trails) as ce n d in g d o ci d TL(bike) TL(trails) CLExt(bike, trails) as ce n d in g d o ci d as ce n d in g d o ci d Figure 8.17: Hybrid index CLExt. merge join top-k results (heap) (1,0.0,9.3,5.9) (2,3.0,4.5,8.6) (4,0.7,1.5,9.1) (5,0.0,5.0,0.0) (9,0.2,1.7,2.0) (25,0.0,0.0,4.6) (12,0.5,7.2,3.0) as ce n d in g d o ci d Figure 8.18: Merge join with hybrid in- dex CLExt for query {bike, trails, map}. Figure 8.17 shows how to combine two TLs and one CL into one CLExt and Fig- ure 8.18 how a merge join works with the hybrid index CLExt. At query processing time, only CLExts need to be read, reducing the number of index lists by n (for queries with n terms). For queries with 3 terms, the number of lists is only 3 compared to 6 in the existing TL+CL approach. For queries with a larger number of terms, the technique is less effective since there is still a relatively large number of CLExts to read, and information from one TL is now included in several CLExts, so some of the information read during query processing is not needed. We will see later that the break-even point is around 8 terms per query. Note that TLs 186 8. Index Tuning for High-Performance Query Processing need to be kept in the index for queries that consist of just a single term. If we build a hybrid index as we just explained, the size of that index will be a lot larger than the size of the index with just TLs and CLs. While this comes as a surprise at first view, it has a simple explanation: many pairs of terms hardly occur together in the same document’s text window of size W=10, so the corresponding CL is very short, but they frequently occur in isolation, so the (prefix of the) TL of each term in the index is long. The CLExt for such pairs is therefore orders of magnitude larger than the CL for the same pair. We can lower the required space for the hybrid index type size build time CL 93.2 GB <5h TL 1.6 GB CLExt 3.1 TB 70h CLExtQLog 131.7 GB 4.5h Table 8.20: Index sizes and build times for full (310,0.05) indexes. index by using additional information on how frequently pairs are used, for example from a query log. We then build CLExts only for term pairs that are used frequently enough; for all other pairs we keep the old CL scheme. This drastically reduces the size of the hybrid index, while still providing reasonable performance improvements. With the TREC GOV2 collection, generating CLExts only for term pairs that occur at least once in the AOL query log reduced the on-disk size of the CLExts from over 3TB to 131.7GB; the on-disk size of all CLs in the standard index was 93.2GB. Table 8.20 shows index sizes and build times for different index types. 8.6.3 Experimental Evaluation For the experimental evaluation we have built full compressed indexes. We evaluated our proposed hybrid index with the GOV2 collection, using the 150 Ad Hoc topics from the TREC 2004–2006 Terabyte Track, Ad Hoc Tasks and the first 10,000 queries from the Terabyte Track, Efficiency Task (EffTrack) 2005 [CSS05] as test beds. All TLs and CLs are pruned to at most 310 entries, and entries in CLs have an acc-score of at least 0.05; experiments in Section 8.5.2 have shown that this is enough to yield a similar quality for top-10 documents as produced by unpruned TLs. We report average cold-cache runtimes (averaged over six independent runs) and access costs for top-10 retrieval with the original index (TL+CL) and the hybrid index with log-based pruning (TL+CLExtQLog); file-system caches were emptied before running each query, which is a very conservative setting. Note that runtimes and cost are largely independent of the number of retrieved results. For the Terabyte Track, Ad Hoc Tasks queries, using the hybrid index improved runtime from 59ms to 49ms per query over the original index; for the EffTrack queries, the improvement was even better (55ms vs 42ms per query). This clearly shows that our hybrid index can greatly improve cold-cache performance. We will now evaluate the impact for queries of different length, and the influence of log-based pruning. 8.6 Hybrid Index Structure for Efficient Text Retrieval 187 0 50 100 150 200 250 300 350 400 2 3 4 5 6 7 8 number of query terms Figure 8.19: Average runtimes for Terabyte and EffTrack queries. Figure 8.19 reports average query times for the two test beds, grouped by the number of terms per query. Improvements are best for short queries, but we see im- provements up to 7 terms. The chart also indicates the standard deviations which are pretty low. 0 20000 40000 60000 80000 100000 120000 2 3 4 5 6 7 8 number of query terms 0 5 10 15 20 25 30 35 40 Figure 8.20: Average cost in bytes and average number of opened lists, for the EffTrack queries. Figure 8.20 details the average number of bytes read per query for the EffTrack. The hybrid index reads up to twice as many bytes from disk, but is (as we saw before) still faster because it needs to open fewer lists (also depicted in this figure by triangles and diamonds). Figure 8.21 shows the influence of log-based pruning on runtime. We computed, 188 8. Index Tuning for High-Performance Query Processing Figure 8.21: Effect of query term pair coverage in the AOL query log on runtime, for the EffTrack queries. for each EffTrack query, the fraction of term pairs covered in the log, and grouped queries into five buckets from low coverage (0%-20%) to high coverage (80%-100%). Our method gives benefit only for queries with a term pair coverage of at least 60%; however, these are the most frequent queries in this load (indicated by the black dots). For the remaining queries, our method does not create a performance penalty. 8.7 Conclusion We clearly demonstrated that indexing terms and term pairs, together with tunable list pruning, is a viable method to improve either result quality or, providing a similar qual- ity as pure term indexes, processing performance. Results with effectiveness-oriented in- dexes are comparable to the best results using unpruned indexes and efficiency-oriented index configurations yield almost one order of magnitude performance gain compared to a state-of-the-art top-k algorithm. We have demonstrated that our hybrid index structure significantly improves cold-cache query processing times of almost 25% on standard benchmark queries from TREC Terabyte and Efficiency Tracks by decreasing the number of fetched index lists, at the price of reading more from each list. The highest improvements are achieved for short queries. Chapter 9 Conclusion and Outlook 9.1 Conclusion In the presence of growing data, the need for efficient query processing under result quality and index size control becomes more and more a challenge to search engines. This work has shown how to use proximity scores to make query processing effective and efficient with focus on either of the optimization goals. This thesis made the following important contributions: • We have presented a comprehensive comparative analysis of proximity score mod- els and a rigorous analysis of the potential of phrases and have adapted a leading proximity score model for XML data. • We have discussed the feasibility of all presented proximity score models for top- k query processing and have presented a novel index combining a content and proximity score that helps to accelerate top-k query processing and improves result quality. • We have presented a novel, distributed index tuning framework for term and term pair index lists that optimizes pruning parameters by means of well-defined optimization criteria under disk space constraints. Indexes can be tuned with emphasis on efficiency or effectiveness: the resulting indexes yield fast processing at high result quality. • We have shown that pruned index lists processed with a merge join outperform top-k query processing with unpruned lists at a high result quality. • Moreover, we have presented a hybrid index structure for improved cold cache run times. 9.2 Outlook There are still some interesting remaining open challenges that deserve future attention. 189 190 9. Conclusion and Outlook As our index tuning framework only uses term and term pair lists (e.g., combined lists), a possible extension would be to precompute (selected) term n-tuple lists for n > 2. Furthermore, extending index lists with more features could generate better re- trieval quality at lower query processing costs: some query-independent features like the PageRank score may be easily integrated into term indexes (cf. Section 7.3.1) whereas other features may pose real challenges for integration into the existing framework and ask for novel index structures. The evaluation and integration of further pruning methods such as the document- centric pruning by Büttcher and Clarke (see Section 6.2.3 for more details) into our index tuning framework may be worth investigation. Possible improvements as to index construction, index tuning, and index mainte- nance are as follows: improving the construction of the final index by reducing the number of temporary index entries, and improving the estimation stage which cur- rently needs to parse the complete document collection would reduce the time required to build up an index and to find optimal pruning parameters. Further consideration of the impact of pruned indexes on cache effectiveness and a careful index layout that groups frequently co-occurring lists close to each other will be a key extension for further improving processing time with cold caches. Index maintenance such as supporting index updates would be especially beneficial for dynamic data such as email collections. So far the index has to be completely rebuilt and the optimization process is repeated completely if new documents are added to a collection. For dynamic data, it is required to develop means to incrementally add new documents and regularly optimize tuning parameters without the need to completely rebuild the index. Finally, our hybrid index structure may be improved in the following ways: our hybrid index structure significantly improves cold-cache query processing times by de- creasing the number of fetched index lists, at the price of reading more from each list. As the highest performance improvements are achieved for short queries, future work may concentrate on improving performance for long queries, for example by precomputing lists for frequently used phrases or removing non-important pair lists. Extending our index tuning framework to optimize pruning parameters for our hybrid index structure would certainly enrich our framework. Appendix A Retrieval Quality and Sensitivity 191 192 A. Retrieval Quality and Sensitivity 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 451-500 451-500+2 501-550 501-550+2 ALL N D C G a t 1 0 topic set WEB BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) Web Track (WT10g): best NDCG@10 values per scoring model 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 451-500 451-500+2 501-550 501-550+2 ALL N D C G a t 1 00 topic set WEB BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) Web Track (WT10g): best NDCG@100 values for per scoring model Figure A.1: Web Tracks test beds (WT10g): best NDCG values 193 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 451-500 451-500+2 501-550 501-550+2 ALL pr ec is io n at 1 0 topic set WEB BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) Web Track (WT10g): best P@10 values per scoring model 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 451-500 451-500+2 501-550 501-550+2 ALL pr ec is io n at 1 00 topic set WEB BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) Web Track (WT10g): best P@100 values for per scoring model Figure A.2: Web Tracks (WT10g): best precision values 194 A. Retrieval Quality and Sensitivity 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 451-500 451-500+2 501-550 501-550+2 ALL M A P topic set WEB BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat Figure A.3: Web Track (WT10g): best MAP values for each scoring model 195 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 301-350 351-400 401-450 601-650 651-700 ALL N D C G a t 1 0 topic set ROBUST BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) Robust Track: best NDCG@10 values for each scoring model 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 301-350 351-400 401-450 601-650 651-700 ALL N D C G a t 1 00 topic set ROBUST BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) Robust Track: best NDCG@100 values for each scoring model Figure A.4: Robust Track: best NDCG values 196 A. Retrieval Quality and Sensitivity 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 301-350 351-400 401-450 601-650 651-700 ALL pr ec is io n at 1 0 topic set ROBUST BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) Robust Track: best P@10 values per scoring model 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 301-350 351-400 401-450 601-650 651-700 ALL pr ec is io n at 1 00 topic set ROBUST BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) Robust Track: best P@100 values per scoring model Figure A.5: Robust Track: best precision values 197 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 301-350 351-400 401-450 601-650 651-700 ALL M A P topic set ROBUST BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat Figure A.6: Robust Track: best MAP values for each scoring model 198 A. Retrieval Quality and Sensitivity 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 701-750 751-800 801-850 ALL N D C G a t 1 0 topic set TERABYTE BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai ES Song et al. de Kretser,Moffat (a) Terabyte Track: best NDCG@10 values per scoring model 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 701-750 751-800 801-850 ALL N D C G a t 1 00 topic set TERABYTE BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai ES Song et al. de Kretser,Moffat (b) Terabyte Track: best NDCG@100 values per scoring model Figure A.7: Terabyte Track: best NDCG values 199 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 701-750 751-800 801-850 ALL pr ec is io n at 1 0 topic set TERABYTE BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai ES Song et al. de Kretser,Moffat (a) Terabyte Track: best P@10 values per scoring model 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 701-750 751-800 801-850 ALL pr ec is io n at 1 00 topic set TERABYTE BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai ES Song et al. de Kretser,Moffat (b) Terabyte Track: best P@100 values per scoring model Figure A.8: Terabyte Track: best precision values 200 A. Retrieval Quality and Sensitivity 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 701-750 751-800 801-850 ALL M A P topic set TERABYTE BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai ES Song et al. de Kretser,Moffat Figure A.9: Terabyte Track: best MAP values for each scoring model 201 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 2009 2010 ALL N D C G a t 1 0 topic set INEX BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) INEX: best NDCG@10 values per scoring model 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 2009 2010 ALL N D C G a t 1 00 topic set INEX BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) INEX: best NDCG@100 values per scoring model Figure A.10: INEX: best NDCG values 202 A. Retrieval Quality and Sensitivity 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 2009 2010 ALL pr ec is io n at 1 0 topic set INEX BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (a) INEX: best P@10 values per scoring model 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 2009 2010 ALL pr ec is io n at 1 00 topic set INEX BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat (b) INEX: best P@100 values per scoring model Figure A.11: INEX: best precision values 203 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 2009 2010 ALL M A P topic set INEX BM25 Buettcher et al. Rasolofo,Savoy LM,Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai ES Song et al. de Kretser,Moffat Figure A.12: INEX: best MAP values for each scoring model 204 A. Retrieval Quality and Sensitivity BM25 Büttcher et al. Rasolofo,Savoy LM, Dirichlet Zhao,Yun Tao,Zhai Lv, Zhai Song et al. De Kretser,Moffat 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Sp re ad Entropy (a) WEB: sensitivity of scoring models for MAP BM25 Büttcher et al. Rasolofo,Savoy LM, Dirichlet Zhao,Yun Tao,Zhai Lv,Zhai Song et al. De Kretser,Moffat 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Sp re ad Entropy (b) WEB: sensitivity of scoring models for NDCG@10 Figure A.13: WEB: sensitivity of scoring models 205 BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Lv, Zhai Song et al. De Kretser,Moffat 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Sp re ad Entropy (a) ROBUST: sensitivity of scoring models for MAP BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Lv, Zhai Song et al. De Kretser,Moffat 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Sp re ad Entropy (b) ROBUST: sensitivity of scoring models for NDCG@10 Figure A.14: ROBUST: sensitivity of scoring models 206 A. Retrieval Quality and Sensitivity BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Song et al. De Kretser,Moffat 0.00 0.05 0.10 0.15 0.20 0.25 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Sp re ad Entropy (a) TERABYTE: sensitivity of scoring models for MAP BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Song et al. De Kretser,Moffat 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 Sp re ad Entropy (b) TERABYTE: sensitivity of scoring models for NDCG@10 Figure A.15: TERABYTE: sensitivity of scoring models 207 BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Lv, Zhai Song et al. De Kretser, Moffat 0.00 0.05 0.10 0.15 0.20 0.25 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Sp re ad Entropy (a) INEX: sensitivity of scoring models for MAP BM25 Büttcher et al. Rasolofo, Savoy LM, Dirichlet Zhao, Yun Tao, Zhai Lv, Zhai Song et al. De Kretser, Moffat 0.00 0.05 0.10 0.15 0.20 0.25 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Sp re ad Entropy (b) INEX: sensitivity of scoring models for NDCG@10 Figure A.16: INEX: sensitivity of scoring models Appendix B TREC num: title num: title 701: U.S. oil industry history 726: Hubble telescope repairs 702: Pearl farming 727: Church arson 703: U.S. against International Criminal Court 728: whales save endangered 704: Green party political views 729: Whistle blower department of defense 705: Iraq foreign debt reduction 730: Gastric bypass complications 706: Controlling type II diabetes 731: Kurds history 707: Aspirin cancer prevention 732: U.S. cheese production 708: Decorative slate sources 733: Airline overbooking 709: Horse racing jockey weight 734: Recycling successes 710: Prostate cancer treatments 735: Afghan women condition 711: Train station security measures 736: location BSE infections 712: Pyramid scheme 737: Enron California energy crisis 713: Chesapeake Bay Maryland clean 738: Anthrax hoaxes 714: License restrictions older drivers 739: Habitat for Humanity 715: Schizophrenia drugs 740: regulate assisted living Maryland 716: Spammer arrest sue 741: Artificial Intelligence 717: Gifted talented student programs 742: hedge funds fraud protection 718: Controlling acid rain 743: Freighter ship registration 719: Cruise ship damage sea life 744: Counterfeit ID punishments 720: Federal welfare reform 745: Doomsday cults 721: Census data applications 746: Outsource job India 722: Iran terrorism 747: Library computer oversight 723: Executive privilege 748: Nuclear reactor types 724: Iran Contra 749: Puerto Rico state 725: Low white blood cell count 750: John Edwards womens issues Table B.1: TREC 2004 Terabyte Track, Ad Hoc Task topics. 209 210 B. TREC num: title num: title 751: Scrabble Players 776: Magnet schools success 752: Dam removal 777: hybrid alternative fuel cars 753: bullying prevention programs 778: golden ratio 754: domestic adoption laws 779: Javelinas range and description 755: Scottish Highland Games 780: Arable land 756: Volcanic Activity 781: Squirrel control and protections 757: Murals 782: Orange varieties seasons 758: Embryonic stem cells 783: school mercury poisoning 759: civil war battle reenactments 784: mersenne primes 760: american muslim mosques schools 785: Ivory-billed woodpecker 761: Problems of Hmong Immigrants 786: Yew trees 762: History of Physicians in America 787: Sunflower Cultivation 763: Hunting deaths 788: Reverse mortgages 764: Increase mass transit use 789: abandoned mine reclamation 765: ephedra ma huang deaths 790: women’s rights in Saudi Arabia 766: diamond smuggling 791: Gullah geechee language culture 767: Pharmacist License requirements 792: Social Security means test 768: Women in state legislatures 793: Bagpipe Bands 769: Kroll Associates Employees 794: pet therapy 770: Kyrgyzstan-United States relations 795: notable cocker spaniels 771: deformed leopard frogs 796: Blue Grass Music Festival history 772: flag display rules 797: reintroduction of gray wolves 773: Pennsylvania slot machine gambling 798: Massachusetts textile mills 774: Causes of Homelessness 799: Animals in Alzheimer’s research 775: Commercial candy makers 800: Ovarian Cancer Treatment Table B.2: TREC 2005 Terabyte Track, Ad Hoc Task topics. 211 num: title num: title 801: Kudzu Pueraria lobata 826: Florida Seminole Indians 802: Volcano eruptions global temperature 827: Hidden Markov Modeling HMM 803: May Day 828: secret shoppers 804: ban on human cloning 829: Spanish Civil War support 805: Identity Theft Passport 830: model railroads 806: Doctors Without Borders 831: Dulles Airport security 807: Sugar tariff-rate quotas 832: labor union activity 808: North Korean Counterfeiting 833: Iceland government 809: wetlands wastewater treatment 834: Global positioning system earthquakes 810: timeshare resales 835: Big Dig pork 811: handwriting recognition 836: illegal immigrant wages 812: total knee replacement surgery 837: Eskimo History 813: Atlantic Intracoastal Waterway 838: urban suburban coyotes 814: Johnstown flood 839: textile dyeing techniques 815: Coast Guard rescues 840: Geysers 816: USAID assistance to Galapagos 841: camel North America 817: sports stadium naming rights 842: David McCullough 818: Chaco Culture National Park 843: Pol Pot 819: 1890 Census 844: segmental duplications 820: imported fire ants 845: New Jersey tomato 821: Internet work-at-home scams 846: heredity and obesity 822: Custer’s Last Stand 847: Portugal World War II 823: Continuing care retirement communities 848: radio station call letters 824: Civil Air Patrol 849: Scalable Vector Graphics 825: National Guard Involvement in Iraq 850: Mississippi River flood Table B.3: TREC 2006 Terabyte Track, Ad Hoc Task topics. 212 B. TREC topic number: query topic number: query 1:obama family tree 26:lower heart rate 2:french lick resort and casino 27:starbucks 3:getting organized 28:inuyasha 4:toilet 29:ps 2 games 5:mitchell college 30:diabetes education 6:kcs 31:atari 7:air travel information 32:website design hosting 8:appraisals 33:elliptical trainer 9:used car parts 34:cell phones 10:cheap internet 35:hoboken 11:gmat prep classes 36:gps 12:djs 37:pampered chef 13:map 38:dogs for adoption 14:dinosaurs 39:disneyland hotel 15:espn sports 40:michworks 16:arizona game and fish 41:orange county convention center 17:poker tournaments 42:the music man 18:wedding budget calculator 43:the secret garden 19:the current 44:map of the united states 20:defender 45:solar panels 21:volvo 46:alexian brothers hospital 22:rick warren 47:indexed annuity 23:yahoo 48:wilson antenna 24:diversity 49:flame designs 25:euclid 50:dog heat Table B.4: TREC 2009 Web Track, Ad Hoc Task topics. 213 topic number: query topic number: query 51:horse hooves 76:raised gardens 52:avp 77:bobcat 53:discovery channel store 78:dieting 54:president of the united states 79:voyager 55:iron 80:keyboard reviews 56:uss yorktown charleston sc 81:afghanistan 57:ct jobs 82:joints 58:penguins 83:memory 59:how to build a fence 84:continental plates 60:bellevue 85:milwaukee journal sentinel 61:worm 86:bart sf 62:texas border patrol 87:who invented music 63:flushing 88:forearm pain 64:moths 89:ocd 65:korean language 90:mgb 66:income tax return online 91:er tv show 67:vldl levels 92:the wall 68:pvc 93:raffles 69:sewing instructions 94:titan 70:to be or not to be that is the question 95:earn money at home1 71:living in india 96:rice 72:the sun 97:south africa 73:neil young 98:sat 74:kiwi 99:satellite 75:tornadoes 100:rincon puerto rico1 Table B.5: TREC 2010 Web Track, Ad Hoc Task topics (1 indicates non-assessed topics). Appendix C INEX topic id title 289 emperor ”Napoleon I” Polish 290 ”genetic algorithm” 291 Olympian god or goddess 292 Italian Flemish painting Renaissance -French -German 293 wifi security encryption 294 user interface design usability guidelines 295 software intellectual property patent license 296 ”Borussia Dortmund” + European Championship Intercontinental Cup 297 ”cool jazz” ”West coast” musician 298 George Orwell life books essays 1984 Eric Arthur Blair Animal Farm 300 Airbus A380 ordered 301 Algebraic Vector Space Model generalized vector space model Latent Semantic Indexing Topic-Based vector space model extended Boolean model enhanced topic based Salton SMART 302 ”web services” security standards 303 fractal applications -art 304 allergy treatments 305 ”revision control system” 306 theories Studies genre classification structuralist Plato forms Aristotle forms 308 wedding traditions and customs 309 ”Ken Doherty” finals tournament 310 Novikov self-consistency principle and time travel 311 global warming cause and effects 312 recessive genes and hereditary disease or genetic disorder 313 ”immanuel kant” ”moral philosophy” ”categorical imperative” 314 food additive toxin carcinogen ”E number” 315 spider hunting insect 316 differents disciplines and movements for gymnastics sport 317 tourism paris visit museum cathedral Table C.1: INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 1 215 216 C. INEX topic id title 318 the atlantic ocean islands and the slave trade 319 ”northern lights” ”polar lights” ”aurora borealis” ”solar wind” ”magnetic field” earth 320 paris transport ”Gare de Lyon” ”Gare du Nord” 321 buildings designed Antoni Gaudi Barcelona architect 322 castles kasteel in the netherlands 323 founder ikea 324 composition of planet rings 325 ”Cirque du Soleil” shows 326 Scotland tourism 327 cloning animals accepted ”United States of America” 328 NBA European basketball player 329 ”national dress” +Scottish 330 ”nobel prize” laureate physics dutch Netherlands 331 figure tulips 332 NCAA basketball tournament ”march madness” 333 steve wozniak steve jobs 334 ”Silk Road” China 335 prepare acorn eat 336 species of monotreme 337 security algorithms in computer networks 338 high blood pressure effect 339 Toy Story 340 Reinforcement Learning + Q-Learning 341 microkernel operating systems 342 ”birthday party” ”nick cave” 343 fantasy novel goodkind book 344 XML database 345 Sex Pistols concert audience Manchester music scene 346 +unrealscript language api tutorial 347 +”state machine” figure moore mealy 348 drinking water abstraction +germany 349 proprietary implementation +protocol +wireless +security 350 Animal flight 351 ”Chinese wedding” custom tradition 352 Faster-than-light travel 353 pseudocode for in-place sorting algorithm 354 novel adaptations for science fiction films 355 +”Best Actress” +”Academy Award” -Supporting -nominated winner film 356 use of natural language processing in information retrieval 357 babylonia babylonian assyriology 358 ontologies information retrieval semantic indexing 360 solar energy for domestic electricity and heating 361 ”Europe after the second world war” + democracy 362 effect nuclear power plant accident Table C.2: INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 2 217 topic id title 363 Bob Dylan Eric Clapton 364 +mushroom poisonous poisoning 365 economy peru international investment tourism 366 Fourier transform applications 368 Hymenoptera +Apocrita -Symphyta +reproduction queen bees wasps hornets 369 Pillars of Hercules + Mythology 371 escaped convict ”William Buckley” 372 Purpose of voodoo rituals. 373 Australia’s involvement in Echelon spy network 374 Aid following the 2004 Tsunami. 375 states countries nuclear proliferation nonproliferation treaty npt 376 ”diabetes mellitus” ”type 2” symptoms 378 +rules ”team sports” +indoor +ball world -football -basketball -handball - voleyball 379 ”Helms-Burton law” ”United States” embargo against Cuba consequences econ- omy 380 Symptoms: headache, fatigue, nausea 381 ubiquitous computing and application 382 ”greek mythology” aphrodite 383 Informations about the city of Lyon in France 384 politics political albert einstein 385 arnold schwarzenegger stars cast 386 fencing +weapon 387 bridge types 388 rhinoplasty 390 Insomnia ”what are the causes” +sleep 391 Cricket ”How to Play” 392 Australian Aboriginals ”stolen generation” 395 September 11 ”conspiracy theories” 399 ”mobile phone” country UMTS 400 ”non violent” revolution country 401 movie award ”eddie murphy” ”jim carrey” ”robin williams” 402 country european capital 403 color television analog standard description 404 french france singer 405 ”The Old Man and the Sea” 406 book architecture 407 ”Football World Cup” +”Miracle of Bern” 409 Hybrid Vehicles -biology ”fuel efficiency” ”fuel sources” model engine 410 Routers and Switches +computer -travel -light network types history 411 +GSM, +CDMA, system,standard,clear battery coverage roaming price. 413 Coordinates and Population of capital cities of Europe Table C.3: INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 3 218 C. INEX topic id title topic id title 544 meaning of life 602 Webster’s Dictionary 545 dance style 603 Tata Motors Company in India 546 19th century imperialism 607 law legislation act +nuclear -family 547 Greek revolution 1821 609 mechanism RAID storage 550 dna testing -forensic -maternity - paternity 610 Nikola Tesla inventions patents 551 pollen allergy 611 rotary engines in cars 552 keyboard instrument -electronic 613 wireless network security 553 spanish classical +guitar players 616 +Egypt museum pyramid 555 +Amsterdam picture image 617 +acne treatment side effects 556 vegetarian person -she -woman 624 ”open source” information retrieval systems 557 electromagnetic waves 626 ”Bayes Filter” +application 559 vodka producing countries 628 MBA school in Canada 561 Portuguese typical dishes 629 science fiction film 562 algerian war 634 vauban 563 Virginia Woolf novels 635 Linux Operating System 565 discovery ”by chance” serendipity 636 Image File Formats 570 introduced animals 637 Java Programming Language 574 guitar tapping 641 Museum Picasso France 576 aircraft formation 642 social networks mining 577 genetically modified food safety 643 wikipedia vandalism 578 childbirth tradition 644 virtual museums 579 diet descriptions 646 records management” +metadata - system 580 ”european basketball players” +nba 647 time travel theories 581 wine tasting 649 flower meaning 582 famous bouddhist places 650 ale 585 +”International brigades” Spanish Civil War 656 bilingualism children ”language ac- quisition” 586 ”magnetic levitation” technology 657 scrabble game rules 587 autistic spectrum disorder 659 technological singularity concept and implications 592 berbers of north africa 666 party primaries in the United States 595 car company 667 Wine regions in Europe 596 Jennifer Lopez 668 Codebreaking at Bletchley Park 597 expert on database 669 coin collecting 598 mahler symphony song 673 intrusion detection 600 Japanese culture food 675 Environmental Impacts of Earth- quakes 601 Townships of Michigan 677 terracotta figures +horse Table C.4: Assessed INEX CO topics, Ad Hoc Track 2008 219 topic id title phrasetitle 2009001 Nobel prize ”Nobel prize” 2009002 best movie ”best movie” 2009003 yoga exercise ”yoga exercise” 2009004 mean average precision reciprocal rank references precision recall pro- ceedings journal ”mean average precision” ”reciprocal rank” ”precision recall” ”recall preci- sion” 2009005 chemists physicists scientists al- chemists periodic table elements ”periodic table” 2009006 opera singer italian spanish -soprano ”opera singer” 2009007 financial and social man made catas- trophes adversity misfortune -”natural disaster” -”natural disaster” ”financial misfor- tune” financial disaster” ”financial catastrophe” ”financial adversity” ”so- cial disaster” ”social catastrophe” 2009008 israeli director actor actress film festi- val ”israeli director” ”israeli actor” ”israeli actress” ”film festival” 2009009 election +victory australian labor party state council -federal ”election victory” ”state election” ”council election” ”australian labor party” 2009010 applications bayesian networks bioin- formatics ”bayesian networks” 2009011 olive oil health benefit ”olive oil” ”health benefit” 2009012 vitiligo pigment disorder cause treat- ment ”treatment of vitiligo”, ”cause of ”vi- tiligo” ”pigment disorder” 2009013 native american indian wars against colonial americans ”native american” ”american indian” ”wars against colonial americans” 2009014 content based image retrieval ”content based” ”image retrieval” ”con- tent based image retrieval” 2009015 Voice over IP none 2009016 cycle road skill race ”road bike” ”road race” 2009017 rent buy home none 2009018 Dwyane Wade ”Dwyane Wade” 2009019 Latent semantic indexing ”latent semantic indexing” 2009020 IBM computer ”IBM computer” 2009021 wonder girls ”wonder girls” 2009022 Szechwan dish food cuisine ”Szechwan dish” ”Szechwan food” ”Szechwan cuisine” 2009023 ”plays of Shakespeare”+Macbeth ”plays of Shakespeare” 2009024 cloud computing ”cloud computing” 2009025 scenic spot in Beijing ”scenic spot” 2009026 generalife gardens none 2009027 Zhang Yimou ”Zhang Yimou” 2009028 fastest speed bike scooter car motor- cycle none 2009029 personality type career famous ”personality type” 2009030 popular dog cartoon character ”cartoon character” Table C.5: INEX 2009 - Type A queries, part 1 220 C. INEX topic id title phrasetitle 2009031 sabre none 2009032 evidence theory dempster schafer ”evidence theory” ”dempster schafer” 2009033 Al-Andalus taifa kingdoms ”taifa kingdoms” 2009034 the evolution of the moon none 2009035 Bermuda Triangle ”Bermuda Triangle” 2009036 notting hill film actors ”notting hill” ”film actors” 2009037 movies directed tarantino ”tarantino movie” 2009038 french colony africa independence ”french colony” 2009039 roman architecture ”roman architecture” 2009040 steam engine ”steam engine” 2009041 The Scythians ”the Scythians” 2009042 sun java ”sun java” 2009043 NASA missions ”NASA missions” 2009044 OpenGL Shading Language GLSL ”OpenGL Shading Language” 2009045 new age musician ”new age” 2009046 Penrose tiles tiling theory ”Penrose tiles” ”Tiling theory” 2009047 ”Kali’s child” criticisms reviews Psy- choanalysis of Ramakrishna’s mysti- cism ”Kali’s child” ”Psychoanalysis of Ra- makrishna’s mysticism” 2009048 biometric technique ”biometric technique” 2009049 Chicago Symphony Orchestra ”Chicago Symphony Orchestra” 2009050 valentine’s day ”valentine’s day” 2009051 Rabindranath Tagore Bengali litera- ture ”Rabindranath Tagore” ”Bengali liter- ature” 2009052 newspaper spain headquarter Madrid none 2009053 finland car industry manufacturer saab sisu ”car industry” ”car manufacturer” 2009054 tampere region tourist attractions ”tampere region” ”tourist attraction” 2009055 european union expansion ”european union” 2009056 higher education around the world ”higher education” 2009057 movie Slumdog Millionaire directed by Danny Boyle ”Slumdog Millionaire” ”Danny Boyle” 2009058 Tiananmen Square protest 1989 ”Tiananmen Square” ”protest 1989” 2009059 failure tolerance in distributed sys- tems ”failure tolerance” ”distributed sys- tems” 2009060 hard disk technology ”hard disk” Table C.6: INEX 2009 - Type A queries, part 2 221 topic id title phrasetitle 2009061 france second world war normandy ”second world war” 2009062 social network group selection ”group selection in social network” ”so- cial network” ”group selection” 2009063 D-Day normandy invasion ”normandy invasion” 2009064 stock exhange insider trading crime ”stock exhange” ”insider trading” 2009065 sunflowers Vincent van Gogh ”Vincent van Gogh” 2009066 folk metal groups finland ”folk metal” 2009067 probabilistic models in information re- trieval ”probabilistic models” ”information re- trieval” 2009068 China great wall ”Great Wall” 2009069 Singer in Britain’s Got Talent ”Britain’s Got Talent” 2009070 health care reform plan ”health care reform” ”health care plan” 2009071 earthquake prediction ”earthquake prediction” 2009072 +professor ”information retrieval” ”computer science” ”information retrieval” ”computer sci- ence” 2009073 web link network analysis ”web link analysis” ”link analysis” ”network analysis” 2009074 web ranking scoring algorithm ”web ranking” ”scoring algorithm” 2009075 tourism in tunisia none 2009076 sociology and social issues and aspects in science fiction ”social aspects” ”social issues” ”science fiction” 2009077 torrent client technology ”Torrent technology” 2009078 supervised machine learning algo- rithm ”supervised machine learning algo- rithm” ”machine learning” 2009079 dangerous paraben bisphenol-A none 2009080 international game show formats ”game show” ”show formats” 2009081 Maya calendar ”Maya calendar” 2009082 south african nature reserve ”south african” ”nature reserve” 2009083 therapeutic food ”therapeutic food” 2009084 food allergy ”food allergy” 2009085 operating system +mutual +exclusion ”operating system” +”mutual exclu- sion” 2009086 airbus a380 none 2009087 history bordeaux none 2009088 ”hatha yoga” deity asana ”hatha yoga” 2009089 world wide web history ”world wide web” 2009090 Telephone history none Table C.7: INEX 2009 - Type A queries, part 3 222 C. INEX topic id title phrasetitle 2009091 Himalaya trekking peak none 2009092 ski +waxing -water -wave ”ski waxing” 2009093 French revolution ”French revolution” 2009094 global warming human activity ”global warming” ”human activity” 2009095 Weka software none 2009096 Eiffel none 2009097 location marcel duchamp work ”Marcel Duchamp” 2009098 Pandemic Death none 2009099 movie houdini none 2009100 search algorithm with plural keywords ”search algorithm” ”plural keywords” 2009101 alchemy in Asia including Japan China and India ”alchemy in Asia” 2009102 historical ninja stars ”ninja stars” 2009103 photograph world earliest ”earliest photograph” 2009104 lunar mare formation mechanism ”lunar mare” ”formation mechanism” 2009105 Musicians Jazz ”Jazz musicians” 2009106 +”amy macdonald” +love +song ”amy macdonald” ”love song” 2009107 design science sustainability renew- able energy synergy ”design science” ”design science sus- tainability” ”renewable energy” 2009108 sustainability indicators metrics ”sustainability indicator” ”sustainabil- ity metric” 2009109 circus acts skills ”circus act” ”circus skills” 2009110 paul is dead hoax theory +”paul is dead” 2009111 europe solar power facility ”solar power” ”facility in Europe” 2009112 rally car female OR woman driver ”rally car” ”female driver” ”woman driver” 2009113 Toy Story Buzz Lightyear 3D render- ing Computer Generated Imagery ”Toy Story” ”Buzz Lightyear” ”3D rendering” ”Computer Generated Im- agery” 2009114 self-portrait ”self portrait” 2009115 virtual museums ”virtual museum” Table C.8: INEX 2009 - Type A queries, part 4 List of Figures 2.1 Non-relevant document for query surface area of a triangular pyramid. . 10 2.2 A poem with position information. . . . . . . . . . . . . . . . . . . . . . 16 2.3 Plots according to formulas in [dKM99]. . . . . . . . . . . . . . . . . . . 23 2.4 Arc and circle replaced to fit the plots in [dKM99, dKM04]. . . . . . . . 23 2.5 Example: triangle-shaped contribution function. . . . . . . . . . . . . . 24 2.6 Example: aggregated score scorex. . . . . . . . . . . . . . . . . . . . . . 25 2.7 Example: highest aggregated score scorex located at a non-query term location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 detectEspans pseudocode. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.9 Three variants of the MRF model for our running example query, i.e., Sq=(sea,shell,song). We depict (left) the full indepence (FI) variant, (middle) the sequential dependence (SD) variant, (right) the full depen- dence (FD) variant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1 An XML document and its linearization. . . . . . . . . . . . . . . . . . . 83 5.2 Example: illustration for metric P[#characters]. . . . . . . . . . . . . . . 89 5.3 Comparison of the three runs: P[# characters] values. . . . . . . . . . . 91 7.1 Score-ordered term, proximity, and combined index lists which can be used to process the query {bike, trails} in several processing strategies. 113 7.2 TL+CL approaches: cost. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3 TL+CL approaches: P@10. . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 TL+CL(� varied): cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.5 TL+CL(� varied): P@10. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.6 Example: query={bike, trails, map}, merge join with processing strategy TL+CL using pruned term lists and combined lists. . . . . . . . . . . . 122 8.1 Index and data files for TLs. . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2 Compressed TLs in docid-order. . . . . . . . . . . . . . . . . . . . . . . . 145 8.3 Compressed CLs in docid-order. . . . . . . . . . . . . . . . . . . . . . . . 146 223 224 LIST OF FIGURES 8.4 Relative index size with varying list length and minscore cutoffs. . . . . 148 8.5 Effect of log-based pruning on query performance (on training topics). . 155 8.6 P@k and NDCG@k on test topics for effectiveness- and efficiency- oriented absolute index quality without index compression. . . . . . . . 160 8.7 P@k and NDCG@k on test topics for effectiveness-oriented absolute in- dex quality with index compression. . . . . . . . . . . . . . . . . . . . . 162 8.8 P@k and NDCG@k on test topics for efficiency-oriented absolute index quality with index compression. . . . . . . . . . . . . . . . . . . . . . . . 162 8.9 P@k and NDCG@k on test topics for effectiveness- and efficiency- oriented relative index quality without index compression. . . . . . . . . 164 8.10 P@k and NDCG@k on test topics for effectiveness-oriented relative index quality with index compression. . . . . . . . . . . . . . . . . . . . . . . . 166 8.11 P@k and NDCG@k on test topics for efficiency-oriented relative index quality with index compression. . . . . . . . . . . . . . . . . . . . . . . . 166 8.12 Efficiency Track: real system performance for a (310,0.05) full index for various query and LRU cache sizes. . . . . . . . . . . . . . . . . . . . . . 172 8.13 MAiP values: type A queries. . . . . . . . . . . . . . . . . . . . . . . . . 182 8.14 MAiP values: type B queries. . . . . . . . . . . . . . . . . . . . . . . . . 182 8.15 iP values: type A queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.16 iP values: type B queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8.17 Hybrid index CLExt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.18 Merge join with hybrid index CLExt for query {bike, trails, map}. . . . 185 8.19 Average runtimes for Terabyte and EffTrack queries. . . . . . . . . . . . 187 8.20 Average cost in bytes and average number of opened lists, for the Eff- Track queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.21 Effect of query term pair coverage in the AOL query log on runtime, for the EffTrack queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 A.1 Web Tracks test beds (WT10g): best NDCG values . . . . . . . . . . . . 192 A.2 Web Tracks (WT10g): best precision values . . . . . . . . . . . . . . . . 193 A.3 Web Track (WT10g): best MAP values for each scoring model . . . . . 194 A.4 Robust Track: best NDCG values . . . . . . . . . . . . . . . . . . . . . . 195 A.5 Robust Track: best precision values . . . . . . . . . . . . . . . . . . . . . 196 A.6 Robust Track: best MAP values for each scoring model . . . . . . . . . 197 A.7 Terabyte Track: best NDCG values . . . . . . . . . . . . . . . . . . . . . 198 A.8 Terabyte Track: best precision values . . . . . . . . . . . . . . . . . . . . 199 A.9 Terabyte Track: best MAP values for each scoring model . . . . . . . . 200 A.10 INEX: best NDCG values . . . . . . . . . . . . . . . . . . . . . . . . . . 201 LIST OF FIGURES 225 A.11 INEX: best precision values . . . . . . . . . . . . . . . . . . . . . . . . . 202 A.12 INEX: best MAP values for each scoring model . . . . . . . . . . . . . . 203 A.13 WEB: sensitivity of scoring models . . . . . . . . . . . . . . . . . . . . . 204 A.14 ROBUST: sensitivity of scoring models . . . . . . . . . . . . . . . . . . . 205 A.15 TERABYTE: sensitivity of scoring models . . . . . . . . . . . . . . . . . 206 A.16 INEX: sensitivity of scoring models . . . . . . . . . . . . . . . . . . . . . 207 List of Tables 2.1 Overview: BM25 variations. . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Espan goodness features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 Model feature sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Overview: Features used in each scoring model. Additional remarks: 1 needs also the number of documents N in the collection, 2 requires ctf and tf if Jelinek-Mercer or Dirichlet prior smoothing are used, 3 deter- mines the set of features dependent on the employed setting (e.g., df and tf for unigrams/bigrams, respectively plus lists of important phrases, etc.), 4 may use tf values not only for terms but also for n-grams and unordered occurrences of n-gram terms, and 5’s set of features may differ dependent on the learned proximity score. . . . . . . . . . . . . . . . . . 45 3.1 Some TREC test beds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 BM25: optimal tuning parameter setting with NDCG@10 and MAP values. 68 4.2 Büttcher et al.’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Rasolofo and Savoy’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . 70 4.4 Language Model with Dirichlet smoothing: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . 70 4.5 Zhao and Yun’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Tao and Zhai’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . . . . 72 4.7 Song et al.’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . . . . 73 4.8 De Kretser and Moffat’s scoring model: optimal tuning parameter setting with NDCG@10 and MAP values. . . . . . . . . . . . . . . . . . . . . . 73 4.9 Intercollection generalization results for various scoring models. . . . . . 75 226 LIST OF TABLES 227 5.1 Results for document-level retrieval with stopword removal. . . . . . . . 86 5.2 Results for document-level retrieval without stopword removal. . . . . . 86 5.3 Results for element-level retrieval with stopword removal. . . . . . . . . 87 5.4 Results for element-level retrieval without stopword removal. . . . . . . 87 5.5 Results: Focused Task INEX 2008, stopword removal, no stemming. . . 89 5.6 Boosting weights BM25F. . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.7 P@10 for user-identified phrases. . . . . . . . . . . . . . . . . . . . . . . 93 5.8 P@10 for different configurations and query loads, first part. . . . . . . . 93 5.9 P@10 for different configurations and query loads, second part. . . . . . 94 7.1 Experimental results for top-10 retrieval of 100 Ad Hoc topics from the 2004 and 2005 TREC Terabyte Track, Ad Hoc Tasks. . . . . . . . . . . 115 7.2 Index sizes in items and required space for unpruned indexes. . . . . . . 115 7.3 Index sizes (million items) with different length limits, with and without minimum acc-score requirement. . . . . . . . . . . . . . . . . . . . . . . 117 7.4 Index sizes (disk space) with different length limits, with and without minimum acc-score requirement. . . . . . . . . . . . . . . . . . . . . . . 117 7.5 Experimental results for top-10 retrieval with pruned lists. . . . . . . . . 118 7.6 Retrieval quality for top-10 and top-100 retrieval with pruned lists. . . . 120 7.7 Retrieval quality for top-100 retrieval with pruned TL+CTL and TL+CL settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.8 Experimental results for top-100 retrieval with unpruned and pruned lists.121 7.9 Costs for top-100 retrieval with unpruned and pruned lists. . . . . . . . 121 7.10 Comparison: TopX with unpruned lists vs merge join on pruned lists. . 124 8.1 GOV2: index tuning results for absolute index quality without index compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.2 GOV2: index tuning results for absolute index quality with index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.3 Relative result quality for different values of α. . . . . . . . . . . . . . . 163 8.4 GOV2: Index tuning results for relative index quality without index compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.5 GOV2: index tuning results for relative index quality with index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.6 GOV2: query performance for absolute index quality without index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.7 GOV2: query performance for relative index quality without index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 228 LIST OF TABLES 8.8 GOV2: query performance for absolute index quality with index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8.9 GOV2: query performance for relative index quality with index com- pression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.10 Efficiency Track: real system performance, merge join, various LRU cache sizes with a (310,0.05) full index. . . . . . . . . . . . . . . . . . . . 171 8.11 Efficiency Track: real system performance, BMW, various LRU cache sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 8.12 Index tuning results with log-based pruning (t=1) for absolute index quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.13 Query performance with log-based pruning (t=1) for absolute index quality.174 8.14 ClueWeb09: index tuning results for absolute index quality and evalua- tion of query performance, size limit set to S=1TB. . . . . . . . . . . . 175 8.15 ClueWeb09: index tuning results for relative index quality and evaluation of query performance, size limit set to S=1TB. . . . . . . . . . . . . . . 176 8.16 Results for the Ad Hoc Track: interpolated precision at different recall levels (ranks for iP[0.01] are in parentheses) and mean average interpo- lated precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8.17 Tuning results based on Type A queries with efficiency-oriented relative index tuning, uncompressed indexes. . . . . . . . . . . . . . . . . . . . . 179 8.18 Efficiency Track results, type A queries. . . . . . . . . . . . . . . . . . . 179 8.19 Efficiency Track results, type B queries. . . . . . . . . . . . . . . . . . . 179 8.20 Index sizes and build times for full (310,0.05) indexes. . . . . . . . . . . 186 B.1 TREC 2004 Terabyte Track, Ad Hoc Task topics. . . . . . . . . . . . . . 209 B.2 TREC 2005 Terabyte Track, Ad Hoc Task topics. . . . . . . . . . . . . . 210 B.3 TREC 2006 Terabyte Track, Ad Hoc Task topics. . . . . . . . . . . . . . 211 B.4 TREC 2009 Web Track, Ad Hoc Task topics. . . . . . . . . . . . . . . . 212 B.5 TREC 2010 Web Track, Ad Hoc Task topics (1 indicates non-assessed topics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 C.1 INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 1 215 C.2 INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 2 216 C.3 INEX CO topics with relevance assessments, Ad Hoc Track 2006, part 3 217 C.4 Assessed INEX CO topics, Ad Hoc Track 2008 . . . . . . . . . . . . . . 218 C.5 INEX 2009 - Type A queries, part 1 . . . . . . . . . . . . . . . . . . . . 219 C.6 INEX 2009 - Type A queries, part 2 . . . . . . . . . . . . . . . . . . . . 220 C.7 INEX 2009 - Type A queries, part 3 . . . . . . . . . . . . . . . . . . . . 221 LIST OF TABLES 229 C.8 INEX 2009 - Type A queries, part 4 . . . . . . . . . . . . . . . . . . . . 222 Bibliography [AAS+09] James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel, editors. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009. ACM, 2009. [AM06] Vo Ngoc Anh and Alistair Moffat. Pruned query evaluation using pre- computed impacts. In Efthimiadis et al. [EDHJ06], pages 372–379. [BAYS06] Chavdar Botev, Sihem Amer-Yahia, and Jayavel Shanmugasundaram. Expressiveness and performance of full-text search languages. In EDBT, pages 349–367, 2006. [BBS10] Andreas Broschart, Klaus Berberich, and Ralf Schenkel. Evaluating the potential of explicit phrases for retrieval quality. In Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan M. Rüger, and Keith van Rijsbergen, editors, ECIR, volume 5993 of Lecture Notes in Computer Science, pages 623–626. Springer, 2010. [BC05] Stefan Büttcher and Charles L. A. Clarke. Indexing time vs. query time: trade-offs in dynamic information retrieval systems. In Otthein Herzog, Hans-Jörg Schek, Norbert Fuhr, Abdur Chowdhury, and Wilfried Teiken, editors, CIKM, pages 317–318. ACM, 2005. [BC06] Stefan Büttcher and Charles L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowl- edge Management, pages 182–189, 2006. [BCH03a] Peter Bailey, Nick Craswell, and David Hawking. Engineering a multi- purpose test collection for web retrieval experiments. Information Pro- cessing and Management, 39(6):853–871, 2003. [BCH+03b] Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Ja- son Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426–434. ACM, 2003. 230 BIBLIOGRAPHY 231 [BCL06] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. Term proxim- ity scoring for ad-hoc retrieval on very large text collections. In Efthimi- adis et al. [EDHJ06], pages 621–622. [BCS06] Stefan Büttcher, Charles L. A. Clarke, and Ian Soboroff. The TREC 2006 Terabyte Track. In Ellen M. Voorhees and Lori P. Buckland, ed- itors, TREC, volume Special Publication 500-272. National Institute of Standards and Technology (NIST), 2006. [Bei07] Michel Beigbeder. ENSM-SE at INEX 2007: Scoring with proximity. In Preproceedings of the 6th INEX Workshop, pages 53–55, 2007. [Bei10] Michel Beigbeder. Focused retrieval with proximity scoring. In Sung Y. Shin, Sascha Ossowski, Michael Schumacher, Mathew J. Palakal, and Chih-Cheng Hung, editors, SAC, pages 1755–1759. ACM, 2010. [BFC10] Michael Bendersky, David Fisher, and W. Bruce Croft. UMass at TREC 2010 Web Track: Term dependence, spam filtering and quality bias. In Web Track Notebook of the 19th Text REtrieval Conference, 2010. [BGM02] Nicolas Bruno, Luis Gravano, and Amélie Marian. Evaluating top-k queries over web-accessible databases. In ICDE 2002, pages 369–380, 2002. [BMS+06] Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. Io-top-k: Index-access optimized top-k query processing. In Umeshwar Dayal, Kyu-Young Whang, David B. Lomet, Gustavo Alonso, Guy M. Lohman, Martin L. Kersten, Sang Kyun Cha, and Young-Kuk Kim, editors, VLDB, pages 475–486. ACM, 2006. [BP98] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertex- tual web search engine. Computer Networks, 30(1-7):107–117, 1998. [BRL06] Christopher J. C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In Bernhard Schölkopf, John C. Platt, and Thomas Hoffman, editors, NIPS, pages 193–200. MIT Press, 2006. [BS08a] Andreas Broschart and Ralf Schenkel. Effiziente Textsuche mit Positionsinformation. In Hagen Höpfner and Friederike Klan, editors, Grundlagen von Datenbanken, volume 01/2008 of Technical Report, pages 101–105. School of Information Technology, International Univer- sity in Germany, 2008. [BS08b] Andreas Broschart and Ralf Schenkel. Proximity-aware scoring for xml retrieval. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, pages 845–846. ACM, 2008. 232 BIBLIOGRAPHY [BS09] Andreas Broschart and Ralf Schenkel. Index tuning for efficient proximity-enhanced query processing. In Geva et al. [GKT10], pages 213–217. [BS10] Andreas Broschart and Ralf Schenkel. MMCI at the TREC 2010 Web Track. In Ellen M. Voorhees and Lori P. Buckland, editors, TREC. National Institute of Standards and Technology (NIST), 2010. [BS11] Andreas Broschart and Ralf Schenkel. A novel hybrid index structure for efficient text retrieval. In Ma et al. [MNBY+11], pages 1175–1176. [BS12] Andreas Broschart and Ralf Schenkel. High-performance processing of text queries with tunable pruned term and term pair indexes. ACM Transactions on Information Systems, 30(1):5:1–5:32, 2012. [BSM95] Chris Buckley, Amit Singhal, and Mandar Mitra. New retrieval ap- proaches using SMART: TREC 4. In TREC, 1995. [BST08] Andreas Broschart, Ralf Schenkel, and Martin Theobald. Experiments with proximity-aware scoring for XML retrieval at INEX 2008. In Geva et al. [GKT09], pages 29–32. [BSTW07] Andreas Broschart, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. TopX @ INEX 2007. In Fuhr et al. [FKLT08], pages 49–56. [BWZ02] Dirk Bahle, Hugh E. Williams, and Justin Zobel. Efficient phrase query- ing with an auxiliary index. In SIGIR, pages 215–221. ACM, 2002. [CCB95] James P. Callan, W. Bruce Croft, and John Broglio. TREC and Tipster experiments with Inquery. Information Processing and Management, 31(3):327–343, 1995. [CCKS07] Surajit Chaudhuri, Kenneth Ward Church, Arnd Christian König, and Liying Sui. Heavy-tailed distributions and multi-keyword queries. In Kraaij et al. [KdVC+07], pages 663–670. [CCT97] Charles L. A. Clarke, Gordon V. Cormack, and Elizabeth A. Tudhope. Relevance ranking for one to three term queries. In RIAO, pages 388– 401, 1997. [CHKZ01] W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zo- bel, editors. SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA. ACM, 2001. [CMS10] Bruce Croft, Donald Metzler, and Trevor Strohman. Search Engines - Information Retrieval in Practice. Addison Wesley, 2010. BIBLIOGRAPHY 233 [CO07] Ronan Cummins and Colm O’Riordan. An axiomatic study of learned weighting schemes. In SIGIR Learning to Rank Workshop, 2007. [CO09] Ronan Cummins and Colm O’Riordan. Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR, pages 251–258, 2009. [CP06] Matthew Chang and Chung Keung Poon. Efficient phrase querying with common phrase index. In Mounia Lalmas, Andy MacFarlane, Stefan M. Rüger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky, editors, ECIR, volume 3936 of Lecture Notes in Computer Science, pages 61–71. Springer, 2006. [CSS05] Charles L. A. Clarke, Falk Scholer, and Ian Soboroff. The TREC 2005 Terabyte Track. In Ellen M. Voorhees and Lori P. Buckland, editors, TREC, volume Special Publication 500-266. National Institute of Stan- dards and Technology (NIST), 2005. [CTL91] W. Bruce Croft, Howard R. Turtle, and David D. Lewis. The use of phrases and structured queries in information retrieval. In SIGIR, pages 32–45, 1991. [CwH02] Kevin Chen-Chuan Chang and Seung won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In Michael J. Franklin, Bongki Moon, and Anastassia Ailamaki, editors, SIGMOD Conference, pages 346–357. ACM, 2002. [DG06a] Ludovic Denoyer and Patrick Gallinari. The Wikipedia XML Cor- pus. In Norbert Fuhr, Mounia Lalmas, and Andrew Trotman, editors, INEX, volume 4518 of Lecture Notes in Computer Science, pages 12–19. Springer, 2006. [DG06b] Ludovic Denoyer and Patrick Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 40(1):64–69, 2006. [DG08] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data pro- cessing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [dKM99] Owen de Kretser and Alistair Moffat. Effective document presentation with a locality-based similarity heuristic. In SIGIR, pages 113–120. ACM, 1999. [dKM04] Owen de Kretser and Alistair Moffat. Seft: a search engine for text. Software - Practice and Experience, 34(10):1011–1023, 2004. 234 BIBLIOGRAPHY [dMNZBY00] Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo A. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, 2000. [DS11] Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. In Ma et al. [MNBY+11], pages 993–1002. [EDHJ06] Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin, editors. SIGIR 2006: Proceedings of the 29th Annual Interna- tional ACM SIGIR Conference on Research and Development in Infor- mation Retrieval, Seattle, Washington, USA, August 6-11, 2006. ACM, 2006. [Fag87] Joel L. Fagan. Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In SIGIR, pages 91–101, 1987. [Fag99] Ronald Fagin. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58(1):83–99, 1999. [Fag02] Ronald Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31(2):109–118, 2002. [FKLT08] Norbert Fuhr, Jaap Kamps, Mounia Lalmas, and Andrew Trotman, edi- tors. Focused Access to XML Documents, 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers, volume 4862 of Lecture Notes in Computer Science. Springer, 2008. [FLN03] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 66(4):614–656, 2003. [GBK00] Ulrich Güntzer, Wolf-Tilo Balke, and Werner Kießling. Optimizing multi-feature queries for image databases. In Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 419–428. Morgan Kaufmann, 2000. [GBK01] Ulrich Güntzer, Wolf-Tilo Balke, and Werner Kießling. Towards efficient multi-feature queries in heterogeneous environments. In ITCC, pages 622–628. IEEE Computer Society, 2001. [GKT09] Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors. Advances in Focused Retrieval, 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, Dagstuhl Castle, Germany, BIBLIOGRAPHY 235 December 15-18, 2008. Revised and Selected Papers, volume 5631 of Lec- ture Notes in Computer Science. Springer, 2009. [GKT10] Shlomo Geva, Jaap Kamps, and Andrew Trotman, editors. Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers, volume 6203 of Lecture Notes in Computer Science. Springer, 2010. [Haw00] David Hawking. Overview of the TREC-9 Web Track. In TREC, 2000. [HBLH94] William Hersh, Chris Buckley, T. J. Leone, and David Hickam. Ohsumed: an interactive retrieval evaluation and new large test collec- tion for research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval, SIGIR ’94, pages 192–201, New York, NY, USA, 1994. Springer- Verlag New York, Inc. [HCT98] David Hawking, Nick Craswell, and Paul B. Thistlewaite. Overview of trec-7 very large collection track. In TREC, pages 40–52, 1998. [Hie98] Djoerd Hiemstra. A linguistically motivated probabilistic model of in- formation retrieval. In Christos Nikolaou and Constantine Stephanidis, editors, ECDL, volume 1513 of Lecture Notes in Computer Science, pages 569–584. Springer, 1998. [HVCB99] David Hawking, Ellen M. Voorhees, Nick Craswell, and Peter Bailey. Overview of the TREC-8 Web Track. In TREC, 1999. [IBS08] Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 40(4):11:1–11:58, 2008. [JK02] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based eval- uation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. [JM80] Frederick Jelinek and Robert L. Mercer. Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, pages 381–397, 1980. [KdVC+07] Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando, editors. SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007. ACM, 2007. 236 BIBLIOGRAPHY [KGT+08] Jaap Kamps, Shlomo Geva, Andrew Trotman, Alan Woodley, and Marijn Koolen. Overview of the INEX 2008 Ad Hoc Track. In Geva et al. [GKT09], pages 1–28. [KPK+07] Jaap Kamps, Jovan Pehcevski, Gabriella Kazai, Mounia Lalmas, and Stephen Robertson. INEX 2007 evaluation measures. In Fuhr et al. [FKLT08], pages 24–33. [KPSV09] Ravi Kumar, Kunal Punera, Torsten Suel, and Sergei Vassilvitskii. Top- aggregation using intersections of ranked inputs. In Ricardo A. Baeza- Yates, Paolo Boldi, Berthier A. Ribeiro-Neto, and Berkant Barla Cam- bazoglu, editors, WSDM, pages 222–231. ACM, 2009. [Liu11] Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer, Berlin Heidelberg, 2011. [LLYM04] Shuang Liu, Fang Liu, Clement T. Yu, and Weiyi Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SIGIR, pages 266–272, 2004. [LS05] Xiaohui Long and Torsten Suel. Three-level caching for efficient query processing in large web search engines. In Allan Ellis and Tatsuya Hagino, editors, WWW, pages 257–266. ACM, 2005. [LT07] Mounia Lalmas and Anastasios Tombros. INEX 2002 - 2006: Under- standing XML retrieval evaluation. In Costantino Thanos, Francesca Borri, and Leonardo Candela, editors, DELOS Conference, volume 4877 of Lecture Notes in Computer Science, pages 187–196. Springer, 2007. [LZ01] John D. Lafferty and ChengXiang Zhai. Document language models, query models, and risk minimization for information retrieval. In Croft et al. [CHKZ01], pages 111–119. [LZ09] Yuanhua Lv and ChengXiang Zhai. Positional language models for in- formation retrieval. In Allan et al. [AAS+09], pages 299–306. [MBG04] Amélie Marian, Nicolas Bruno, and Luis Gravano. Evaluating top- queries over web-accessible databases. ACM Transactions on Database Systems, 29(2):319–362, 2004. [MBSC97] Mandar Mitra, Chris Buckley, Amit Singhal, and Claire Cardie. An analysis of statistical and syntactic phrases. In RIAO, pages 200–217, 1997. [MC05] Donald Metzler and W. Bruce Croft. A markov random field model for term dependencies. In Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait, editors, SIGIR, pages 472– 479. ACM, 2005. BIBLIOGRAPHY 237 [MdR05] Gilad Mishne and Maarten de Rijke. Boosting web retrieval through query operations. In David E. Losada and Juan M. Fernández-Luna, editors, ECIR, volume 3408 of Lecture Notes in Computer Science, pages 502–516. Springer, 2005. [Met06a] Donald Metzler. Estimation, sensitivity, and generalization in parame- terized retrieval models. In Philip S. Yu, Vassilis J. Tsotras, Edward A. Fox, and Bing Liu, editors, CIKM, pages 812–813. ACM, 2006. [Met06b] Donald Metzler. Estimation, sensitivity, and generalization in parame- terized retrieval models (extended version). Technical report, University of Massachusetts, 2006. [MNBY+11] Wei-Ying Ma, Jian-Yun Nie, Ricardo A. Baeza-Yates, Tat-Seng Chua, and W. Bruce Croft, editors. Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Re- trieval, SIGIR 2011, Beijing, China, July 25-29, 2011. ACM, 2011. [Mon04] Christof Monz. Minimal span weighting retrieval for question answering. In Rob Gaizauskas, Mark Greenwood, and Mark Hepple, editors, Pro- ceedings of the SIGIR Workshop on Information Retrieval for Question Answering, pages 23–30, 2004. [MOT11] Craig MacDonald, Iadh Ounis, and Nicola Tonellotto. Upper-bound approximations for dynamic pruning. ACM Transactions on Information Systems, 29(4):17:1–17:28, 2011. [MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Cambridge University Press, 2008. [MSC06] Donald Metzler, Trevor Strohman, and W. Bruce Croft. Indri trec note- book 2006: Lessons learned from three terabyte tracks. In TREC, 2006. [MSTC04] Donald Metzler, Trevor Strohman, Howard R. Turtle, and W. Bruce Croft. Indri at trec 2004: Terabyte track. In Ellen M. Voorhees and Lori P. Buckland, editors, TREC, volume Special Publication 500-261. National Institute of Standards and Technology (NIST), 2004. [NC10] Dong Nguyen and Jamie Callan. Combination of evidence for effective web search. In Proceedings of the 19th Text REtrieval Conference, 2010. [PA97] R. Papka and J. Allan. Why bigger windows are better than small ones. Technical report, CIIR, 1997. [PC98] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In SIGIR, pages 275–281. ACM, 1998. 238 BIBLIOGRAPHY [PLM08] Riccardo Poli, William B. Langdon, and Nicholas Freitag McPhee. A Field Guide to Genetic Programming. lulu.com, 2008. [PRL+07] Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, and Karl Aberer. Scalable peer-to-peer web retrieval with highly discriminative keys. In ICDE, pages 1096–1105. IEEE, 2007. [RJ76] S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146, 1976. [RS03] Yves Rasolofo and Jacques Savoy. Term proximity scoring for keyword- based retrieval systems. In Fabrizio Sebastiani, editor, ECIR, volume 2633 of Lecture Notes in Computer Science, pages 207–218. Springer, 2003. [RW94] Stephen E. Robertson and Steve Walker. Some simple effective approx- imations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR, pages 232–241. ACM/Springer, 1994. [RWHB+95] Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Mike Gatford, and A. Payne. Okapi at TREC-4. In TREC, 1995. [RZT04] Stephen E. Robertson, Hugo Zaragoza, and Michael J. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM, pages 42–49, 2004. [SBH+07] Ralf Schenkel, Andreas Broschart, Seung Won Hwang, Martin Theobald, and Gerhard Weikum. Efficient text proximity search. In Nivio Ziviani and Ricardo A. Baeza-Yates, editors, SPIRE, volume 4726 of Lecture Notes in Computer Science, pages 287–299. Springer, 2007. [SC07] Trevor Strohman and W. Bruce Croft. Efficient document retrieval in main memory. In Kraaij et al. [KdVC+07], pages 175–182. [SCC+01] Aya Soffer, David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, and Yoëlle S. Maarek. Static index pruning for in- formation retrieval systems. In Croft et al. [CHKZ01], pages 43–50. [SI00] Satoshi Sekine and Hitoshi Isahara. IREX: IR and IE evaluation-based project in Japanese. In The Second International Conference on Lan- guage Resources and Evaluation, 2000. [SKK10] Krysta Marie Svore, Pallika H. Kanani, and Nazan Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In Fabio Crestani, Stéphane Marchand-Maillet, Hsin-Hsi Chen, Efthimis N. BIBLIOGRAPHY 239 Efthimiadis, and Jacques Savoy, editors, SIGIR, pages 154–161. ACM, 2010. [SSK07] Ralf Schenkel, Fabian M. Suchanek, and Gjergji Kasneci. Yawn: A semantically annotated wikipedia xml corpus. In Alfons Kemper, Harald Schöning, Thomas Rose, Matthias Jarke, Thomas Seidl, Christoph Quix, and Christoph Brochhaus, editors, BTW, volume 103 of LNI, pages 277– 291. GI, 2007. [SSLM+09] Michal Shmueli-Scheuer, Chen Li, Yosi Mass, Haggai Roitman, Ralf Schenkel, and Gerhard Weikum. Best-effort top-k query processing under budgetary constraints. In ICDE, pages 928–939. IEEE, 2009. [STW+08] Ruihua Song, Michael J. Taylor, Ji-Rong Wen, Hsiao-Wuen Hon, and Yong Yu. Viewing term proximity from a different perspective. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, ECIR, volume 4956 of Lecture Notes in Computer Sci- ence, pages 346–357. Springer, 2008. [SWY75] Gerard Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [TAS08] Martin Theobald, Mohammed AbuJarour, and Ralf Schenkel. TopX 2.0 at the INEX 2008 Efficiency Track. In Geva et al. [GKT09], pages 224– 236. [TAS09] Martin Theobald, Ablimit Aji, and Ralf Schenkel. TopX 2.0 at the INEX 2009 Ad-Hoc and Efficiency Tracks. In Geva et al. [GKT10], pages 218– 228. [TJG09] Andrew Trotman, Xiangfei Jia, and Shlomo Geva. Fast and effective focused retrieval. In Geva et al. [GKT10], pages 229–241. [TMO10] Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. Efficient dynamic pruning with proximity support. In LSDS-IR Workshop, pages 31–35, 2010. [TSW05] Martin Theobald, Ralf Schenkel, and Gerhard Weikum. An efficient and versatile query engine for TopX search. In Klemens Böhm, Chris- tian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson, and Beng Chin Ooi, editors, VLDB, pages 625–636. ACM, 2005. [TWS04] Martin Theobald, Gerhard Weikum, and Ralf Schenkel. Top-k query evaluation with probabilistic guarantees. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blake- ley, and K. Bernhard Schiefer, editors, VLDB, pages 648–659. Morgan Kaufmann, 2004. 240 BIBLIOGRAPHY [TZ07] Tao Tao and ChengXiang Zhai. An exploration of proximity measures in information retrieval. In Kraaij et al. [KdVC+07], pages 295–302. [UIF+08] Yukio Uematsu, Takafumi Inoue, Kengo Fujioka, Ryoji Kataoka, and Hayato Ohwada. Proximity scoring using sentence-based inverted in- dex for practical full-text search. In Birte Christensen-Dalsgaard, Do- natella Castelli, Bolette Ammitzbøll Jurik, and Joan Lippincott, editors, ECDL, volume 5173 of Lecture Notes in Computer Science, pages 308– 319. Springer, 2008. [VH97] Ellen M. Voorhees and Donna Harman. Overview of the Fifth Text REtrieval Conference (TREC-5). In Proceedings of the 5th Text REtrieval Conference, pages 1–28, 1997. [VH00] Ellen M. Voorhees and Donna Harman. Overview of the Eighth Text REtrieval Conference (TREC-8), 2000. [W+99] Hugh E. Williams et al. What’s next? index structures for efficient phrase querying. In Australasian Database Conference, pages 141–152, 1999. [Was05] Larry Wasserman. All of Statistics. Springer, 2005. [Whi09] Tom White. Hadoop - The definite guide. O’Reilly, 2009. [WK09] Judith Winter and Gerold Kühne. Achieving high precisions with peer- to-peer is possible! In Geva et al. [GKT10], pages 242–253. [WLM11] Lidan Wang, Jimmy J. Lin, and Donald Metzler. A cascade ranking model for efficient ranked retrieval. In Ma et al. [MNBY+11], pages 105–114. [WMB99] I.H. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufman, San Francisco, 1999. [WZB04] Hugh E. Williams, Justin Zobel, and Dirk Bahle. Fast phrase querying with combined indexes. ACM Transactions on Information Systems, 22(4):573–594, 2004. [XL07] Jun Xu and Hang Li. AdaRank: a boosting algorithm for information retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SI- GIR 2007, pages 391–398, New York, NY, USA, 2007. ACM. [YDS09a] Hao Yan, Shuai Ding, and Torsten Suel. Compressing term positions in web indexes. In Allan et al. [AAS+09], pages 147–154. BIBLIOGRAPHY 241 [YDS09b] Hao Yan, Shuai Ding, and Torsten Suel. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th international conference on World wide web, WWW ’09, pages 401–410, New York, NY, USA, 2009. ACM. [YSZ+10] Hao Yan, Shuming Shi, Fan Zhang, Torsten Suel, and Ji-Rong Wen. Efficient term proximity search with term-pair indexes. In Jimmy Huang, Nick Koudas, Gareth Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An, editors, CIKM, pages 1229–1238. ACM, 2010. [Z+07] Wei Zhang et al. Recognition and classification of noun phrases in queries for effective retrieval. In CIKM, pages 711–720, 2007. [ZL04] ChengXiang Zhai and John D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179–214, 2004. [ZM98] Justin Zobel and Alistair Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18–34, 1998. [ZM06] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2):1–56, 2006. [ZSLW07] Mingjie Zhu, Shuming Shi, Mingjing Li, and Ji-Rong Wen. Effective top- k computation in retrieving structured documents with term-proximity support. In Mário J. Silva, Alberto H. F. Laender, Ricardo A. Baeza- Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and André O. Falcão, editors, CIKM, pages 771–780. ACM, 2007. [ZSYW08] Mingjie Zhu, Shuming Shi, Nenghai Yu, and Ji-Rong Wen. Can phrase indexing help to process non-phrase queries? In James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Alek- sander Kolcz, Key-Sun Choi, and Abdur Chowdhury, editors, CIKM, pages 679–688. ACM, 2008. [ZY09] Jinglei Zhao and Yeogirl Yun. A proximity language model for informa- tion retrieval. In Allan et al. [AAS+09], pages 291–298.