High-Performance Annotation Tagging 
over Solr Full-text Indexes 

Michele Artini,  
Claudio Atzori,  

Sandro La Bruzzo,  
Paolo Manghi, 

Marko Mikulicic, 
and Alessia Bardi  

 
INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    
 

22 

ABSTRACT 

In this work, we focus on the problem of annotation tagging over information spaces of objects stored 
in a full-text index. In such a scenario, data curators assign tags to objects with the purpose of 
classification, while generic end users will perceive tags as searchable and browsable object 
properties. To carry out their activities, data curators need annotation tagging tools that allow them 
to bulk tag or untag large sets of objects in temporary work sessions where they can virtually and in 
real time experiment with the effect of their actions before making the changes visible to end users. 
The implementation of these tools over full-text indexes is a challenge because bulk object updates in 
this context are far from being real-time and in critical cases may slow down index performance. We 
devised TagTick, a tool that offers to data curators a fully functional annotation tagging 
environment over the full-text index Apache Solr, regarded as a de facto standard in this area. 
TagTick consists of a TagTick Virtualizer module, which extends the API of Solr to support real-time, 
virtual, bulk-tagging operations, and a TagTick User Interface module, which offers end-user 
functionalities for annotation tagging. The tool scales optimally with the number and size of bulk tag 
operations without compromising the index performance. 

INTRODUCTION 

Tags are generally conceived as nonhierarchical terms (or keywords) assigned to an information 
object (e.g., a digital image, a document, a metadata record) in order to enrich its description 
beyond the one provided by object properties. The enrichment is intended to improve the way end 
users (or machines) can search, browse, evaluate, and select the objects they are looking for. 
Examples are qualificative terms, i.e. terms associating the object to a class (e.g., biology, computer 
science, literature) or qualitative terms, i.e. terms associating the object to a given measure of 
value (e.g., rank in a range, opinion).1 Approaches differ in the way tags are generated. In some 
cases users (or machines)2 freely and collaboratively produce tags,3 thereby generating so-called 

 
Michele Artini (michele.artini@isti.cnr), Claudio Atzori (claudio.atzori@isti.cnr.it),  
Sandro La Bruzzo (msandro.labruzzo@isti.cnr), Paolo Manghi (paolo.manghi@iti.cnr.it), and 
Mark Mikulicic (mmark.mikulicic@isti.cnr.it) are researchers at Istituto di Scienza e Tecnologie 
dell’Informazione “Alessandro Faedo,” Consiglio Nazionale delle Richerce, Pisa, Italy. Alessia 
Bardi (mallessia.bardi@for.unipit.it) is a researcher at the Dipartimento di Ingegneria 
dell’Informazione, Università di Pisa, Italy. 

mailto:michele.artini@isti.cnr.it
mailto:claudio.atzori@isti.cnr.it
mailto:msandro.labruzzo@isti.cnr
mailto:paolo.manghi@iti.cnr.it
mailto:mmark.mikulicic@isti.cnr.it
mailto:mallessia.bardi@for.unipit.itmailto:


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 23 

folksonomies. The natural heterogeneity of folksonomies calls for solutions to harmonise and 
make more effective their usage, such as tag clouds.4 In other approaches users can pick tags from 
a given set of values (e.g., vocabulary, ontology, range) or else find hybrid solutions, where a 
degree of freedom is still permitted.5,6 A further differentiation is introduced by semantically 
enriched tags, which are tags contextualized by a label or prefix that provides an interpretation for 
the tag.7 For example, in the digital library world, the annotation of scientific article objects with 
subject tags could be done according to the tag values of the tag interpretations of ACM scientific 
disciplines and “Dewey Decimal Classification,” whose term ontologies are different.8  

The action of tagging is commonly intended as the practice of end users or machines of assigning 
or removing tags to the objects of an information space. An information space is a digital space a 
user community populates with information objects for the purpose of enabling content sharing 
and providing integrated access to different but related collections of information objects.9 The 
effect of tagging information objects in an information space may be private, i.e., visible to the 
users who tagged the objects or to a group of users sharing the same right, or public, i.e., visible to 
all users.10 Many well-known websites allow end users to tag web resources. For example 
Delicious11 (http://delicious.com) allows users to tag web links with free and public keywords; 
Stack Overflow (http://stackoverflow.com), which lets users ask and answer questions about 
programming, allows tagging of question threads with free and public keywords; Gmail 12 
(http://mail.gmail.com) allows users to tag emails—at the same time, tags are also transparently 
used to encode email folders. In the digital library context, the portal Europeana 
(http://www.europeana.eu) allows authenticated end users to tag metadata records with free 
keywords to create a private set of annotations.  

In this work we shall focus on annotation tagging—that is, tagging used as a manual data curation 
technique to classify (i.e., attach semantics to) the objects of an information space. In such a 
scenario, tags are defined as controlled vocabularies whose purpose is classification.13,14 Unlike 
semantic annotation scenarios, where semantic tags may be semiautomatically generated and 
assigned to objects,15 in annotation tagging authorized data curators are equipped with search 
tools to identify the sets of objects they believe should belong or not belong to a given category 
(identified by a tag), and to eventually perform the tagging or untagging actions required to apply 
the intended classification. In general, such operations may assign or remove tags to and from an 
arbitrarily large subset of objects of the Information Space. It is therefore hard to predict the 
quality and consistency of the combined effect of a number of such actions. As a consequence, data 
curators must rely on virtual tagging functionalities which allow them to bulk (un)tag sets of 
objects in temporary work sessions, where they can in real-time preview and experiment 
(do/undo) the effects of their actions before making the changes visible to end users. Examples of 
scenarios that may require annotation tagging can be found in many fields of application. This is 
the case, for example, in several data infrastructures funded by the European Commission FP7 
program, which share the common goal of populating very large information spaces by 
aggregating textual metadata records collected from several data sources. Examples are the data 

http://delicious.com/
http://stackoverflow.com/
http://mail.gmail.com/
http://www.europeana.eu/


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    24 

infrastructures for DRIVER,16 Heritage of the People’s Europe (HOPE),17 European Film Gateway 
(EFG and EFG1914),18 OpenAIRE19 (http://www.openaire.eu), and Europeana. In such contexts, 
the aggregated records are potentially heterogeneous, not sharing common classification schemes, 
and annotation tagging becomes a powerful mean to make the Information Space more effectively 
consumable by end users.  

There at two significant challenges to be tackled in the realization of annotation tagging tools. First 
is the need to support bulk-tagging actions in almost real time so that data curators need not wait 
long for their actions to complete. Second, bulk-tagging actions need to be virtualized over the 
information space, so that data curators can verify the quality of their actions before committing 
them, and access to the information space is unaffected by such actions. Naturally, the feasibility 
and quality of annotation tagging tools strictly depends on the data management system adopted 
to index and search objects of the information space. In general, not to compromise information 
space availability, bulk-updates are based on offline, efficient strategies, which minimize the 
update’s delay,20 or virtualisation techniques, which perform the update in such a way that users 
have the impression this was completed.21  

In this work, we target the specific problem of annotation tagging of information spaces whose 
objects are documents in a Solr full-text index (v3.6).22 Solr is an open-source Apache project 
delivering a full-text index whose instances are capable of scaling up to millions of records, can 
benefit from horizontal clustering, replica handling, and production-quality performance for 
concurrent queries and bulk updates. The index is widely adopted in the literature and often in 
contexts where annotation tagging is required, such as the aforementioned aggregative data 
infrastructures. The implementation of virtual and bulk-tagging facilities over Solr information 
spaces is a challenge, since bulk updates of Solr objects are fast, but far from being real-time when 
large sets of objects are involved. In general, independently of the configuration, a re-indexing of 
millions of objects may take up to some hours, while for real-time previews even minutes would 
not be acceptable. Moreover, in critical cases, update actions may also slow down index perfor-
mance and compromise access to the information space.  

In this paper, we present TagTick, a tool that implements facilities for annotation tagging over Solr 
with no remarkable degradation of performances with respect to the original index. TagTick 
consists of two main modules: the TagTick Virtualizer, which implements functionalities for real-
time bulk (un)tagging in the context of work sessions for Solr, and the TagTick User Interface, 
which implements user interfaces for data curators to create, operate and commit work sessions, 
so as to produce newly tagged information spaces. TagTick software can be demoed and 
downloaded from http://nemis.isti.cnr.it/product/tagtick-authoritative-tagging-apache-solr. 

ANNOTATION TAGGING 

Annotation tagging is a process operated by data curators whose aim is improving the end user’s 
search experience over an Information Space. Specifically, the activity consists in assigning 
searchable and browsable tags to objects in order to classify and logically structure the 

http://www.openaire.eu/
http://nemis.isti.cnr.it/product/tagtick-authoritative-tagging-apache-solr


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 25 

Information Space into further (and possibly overlapping) meta-classes of objects. Moreover, 
when ontologies published on the Web are used, for example ontologies available as linked data 
such as the GeoNames ontology (http://www.geonames.org/ontology/documentation.html) or 
the DBPedia ontology (http://dbpedia.org/Ontology), then tags are means to link objects in the 
information space to external resources. In this section, we shall describe the functional 
requirements of annotation tagging in order to introduce assumptions and nomenclature to be 
used in the remainder of the paper.  

Information Space: Objects, Classes, Tags, and Queries  

We define an information space as a set of objects of different classes C1 . . . Ck. Each class Ci has a 
structure (l1 : V1, . . . ,ln : Vn), where lj’s are object property labels and Vj are the types of the 
property values. Types can be value domains, such as strings, integers, dates, or controlled 
vocabularies of terms.  

In its general definition, annotation tagging has to do with semantically enriched tagging, where a 
tag consists of a pair (i, t), made of a tag interpretation i and a tag value t from a term ontology T; 
as an example of interpretation consider the ACM subject classification scheme (e.g., i = ACM), 
where T is the set of ACM terms.  

In this context, tagging is de-coupled from the Information Space and can be configured a-
posteriori. Typically, given an Information Space, data curators set up the annotation tagging 
environment by: (i) defining the interpretation/ontology pairs to be used for classification, and (ii) 
assigning to each class C the interpretations to be used to tag its objects. As a result, class 
structures are enriched with a set of interpretations (i1:T1 . . . im:Tm), where ij are tag interpretation 
labels and Tj the relative ontologies. Unless differently specified, an object may be assigned 
multiple tag values for the same tag interpretation, e.g. scientific publication objects may cover 
different scientific ACM disciplines.  

Finally, the information space can reply to queries q formed according to the abstract syntax 
intable 1, where Op is a generic Boolean operator (dependent on the underlying data management 
system, e.g. “=,” “<,” “>”) and C∈{C1, . . . ,Ck}. Tag predicates (i = t) and class predicates (class = C) 
represent exact matches, which mean “the object is tagged with the tag (i, t)” and “the object 
belongs to class C.”  

q ∷=(q And q) 
        | (q Or q) 
        | (l Op v) 
        | (i = t) 
        | (class = C) 
        | v 
        | ε 

Table 1. Solr Query Language. 

http://www.geonames.org/ontology/documentation.html
http://dbpedia.org/Ontology


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    26 

Virtual and Real-time Tagging  

In annotation tagging data curators apply bulk (un)tagging actions with respect to a tag (i, t) over 
arbitrarily large sets of objects returned by queries q over the information space. Due to the 
potential impact that such operations may have over the information space, tools for annotation 
tagging should allow data curators to perform their actions in a protected environment called 
work session. In such an environment curators can test sequences of bulk (un)tagging actions and 
incrementally shape up an information space preview: they may view the history of such actions, 
undo some of them, add new actions, and pose queries to test the quality of their actions. To offer 
a usable annotation tagging tool, it is mandatory for such actions to be performed in (almost) real-
time. For example, curators should not wait more than a few seconds to test the result of tagging 1 
million objects, an action which they might undo immediately after. Moreover, such actions should 
not conflict (e.g., slow performance) with the activities of end users running queries on the 
information space. Finally, when data curators believe the preview has reached its maturity, they 
can commit the work session, i.e., materialise the preview in the information space, and make the 
changes visible to end users.  

APACHE SOLR AND ANNOTATION TAGGING  

As mentioned in the introduction, our focus is on annotation tagging for Apache Solr (v3.6). This 
section describes the main information space features and functionalities of the Solr full-text index 
search platform. In particular, it explains the issues arising when using its native APIs to im-
plement bulk real-time tagging as described previously.  

Solr Information Spaces: Objects, Classes, Tags, and Queries  

Solr is one of the most popular full-text indexes. It is an Apache open source Java project that 
offers a scalable, high performance and cross-platform solution for efficient indexing and querying 
of information spaces made of millions of objects (documents in Solr jargon).23 A Solr index stores 
a set of objects, each consisting in a flat list of possibly replicated and unordered fields associated 
to a value. Each object is referable by a unique identifier generated by the index at indexing time.  

The information spaces described previously can be modelled straightforwardly in Solr. Each 
object in the index contains field-value pairs relative to the properties and tag interpretations of 
all classes they belong to. Moreover, we shall assume that all objects share one field named class 
whose values indicate the classes (e.g. C1, . . . ,Ck) to which the object belongs. Such an assumption 
does not restrict the application domain, since classes are typically encoded in Solr by a dedicated 
field. 

The Solr API provides methods to search objects by general keywords, field values, field ranges, 
fuzzy terms and other advanced search options, plus methods for the bulk addition and deletion of 
objects. In our study, we shall restrict to the search method query(q, qf), where q and qf are CQL 
queries respectively referred as the “main query” and the “filter query”. In particular, in order to 
match the query language requirements described previously, we shall assume that q and qf are 


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 27 

expressed according to the CQL subset matching the query language in table 1.  

getDocset :RS→DS returns the docset relative to a result set 

intersectDocsets :DS × DS→DS returns the intersection of two docset 

intersectSize :DS × DS→Integer returns the size of the intersection of two 
docsets 

unifyDocsets :DS × DS→DS returns the union of two docsets 

andNotDocsets :DS × DS→DS given two docsets ds1 and ds2 returns the 
docset {d | d ∈ ds1 ⋀ ¬ d ∈  ds2} 

searchOnDocsets :Q × DS→RS executes a query q over a docset and returns 
the relative resultset 

Table 2. Solr Docset Management Low-Level Interface. 

To describe the semantics of query(q, qf) it is important to make a distinction between the Solr 
notions of result set and docset. In Solr, the execution of a query returns a result set (i.e., 
QueryResponse in Solr jargon) that logically contains all objects matching the query. In practice, a 
result set is conceived to be returned at the time of execution to offer instant access to the query 
result, which is meantime computed and stored in memory into a low-level Solr data structure 
called docset. Docsets are internal Solr data structures, which contain lists of object identifiers and 
allow for efficient operations such as union and intersection of very large sets of objects to 
optimize query execution. Table 1 illustrates some of the methods used internally by Solr to 
handle docsets. Method names have been chosen to be self-explanatory and therefore do not 
match the ones used in the libraries of Solr. 

  
INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    28 

⟦query(q,qf)⟧Solr= �
{d|ID(d)∈ ⟦q⟧DS} if (qf=null)

searchOnDocset(q, ⟦qf⟧Cache(ϕ)) if (qf ≠null)
 

⟦qf⟧Cache(ϕ)= �
  ds                                      if (ϕ(qf)=ds)

⟦qf⟧Cache(ϕ[qf ← ⟦qf⟧DS]) if (ϕ(qf) = ⊥)
 

⟦(q1 And q2)⟧DS= ⟦q1⟧DS ∩ ⟦q2⟧DS 

⟦(q1 Or q2)⟧DS= ⟦q1⟧DS ∪ ⟦q2⟧DS 

⟦(l op v)⟧DS= {ID(d)| d.l op v} 

⟦(i=t)⟧DS= {ID(d)| d.i op t} 

Table 3. Semantic Functions. 

 
Informally, query(q, qf) returns the result set of objects matching the query q intersected with the 
objects matching the filter query qf, i.e. its semantics is equivalent to the one of the command 
query(q And qf, null). In practice, the usage of a filter query qf is intended to efficiently reduce the 
scope of q to the set of objects whose identifiers are in the docset of qf. To this aim, Solr keeps in 
memory a filter cache ϕ:Q →DS. The first time a filter query qf is received, Solr executes it and 
stores the relative docset ds in ϕ, where it can be accessed to optimize the execution of query(q, 
qf). Once the docset ϕ(qf) = ds is available, query(q, qf) invokes the low-level method 
searchOnDocset(q, ds) (see table 1). The method executes q to obtain its docset, efficiently 
intersects such docset with ds, and populates the result set relative to the query. Due to the 
efficiency of docset intersection and in-memory data structures, query execution time is closely 
limited to the one necessary to execute q. Table 3 shows the semantic functions ⟦.⟧Solr :Q x Q →RS , 
⟦.⟧DS :Q →DS, ⟦.⟧Cache :Q x ℘(Q x DS) → DS. 

The first yields the result set of query(q, qf); the second the docset relative to a query q (where d is 
an object); and the third resolves queries into docsets by means of a filter cache ϕ.  

Limits to Virtual and Real-Time Tagging in Solr  

Whilst Solr is a well-known and established solution for full-text indexing over very large 
information spaces, it poses challenges for higher-level applications willing to expose to users 
private, modifiable views of the same index. This is the case for annotation tagging tools, which 
must provide data curators with work sessions where they can update with tagging and untagging 
actions a logical view of the information space, while still providing end users with search facilities 
over the last committed Information Space. Since Solr API does not natively provide “view 
management” primitives, the only approach would be that of materializing tagging and untagging 


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 29 

actions in the index while making sure that such changes are not visible to end users. Prefixing 
tags with work session identifiers, cloning of tagged objects, or keeping index replicas may be 
valuable techniques, but all fail to deliver the real-time requirement described previously. This is 
due to the fact that when very large sets of objects are involved the re-indexing phase is generally 
far from being real-time. In general, independently of the configuration, processing such requests 
may take up to some hours for millions of objects, while for real-time previews even minutes 
would not be acceptable.  

TAGTICK VIRTUALIZER: VIRTUAL REAL-TIME TAGGING FOR SOLR  

This section presents the TagTick Virtualizer module, as the solution devised to overcome the 
inability of Apache Solr to support out-of-the-box real-time virtual views over Information 
Spaces. The Virtualizer API, shown in table 4, supports methods for creating, deleting and 
committing work sessions, and, in the context of a work session: (1) performing 
tagging/untagging actions and (2) querying the information space modified by such actions. In 
the following we will describe both functional semantics and implementation of the API, given 
in terms of a formal symbolic notation. The semantics defines the expected behaviour of the API 
and is provided in terms of the semantics of Solr. The implementation defines the realization of 
the API methods in terms of the low-level docset management library of Solr. The right side of 
figure 1 illustrates the layering of functionalities required to implement the TagTick Virtualizer 
module. As shown, the realization of the module required exposing the Solr low-level docset 
library through an API.  

  
Figure 1. TagTick Virtualizer: The Architecture. 

 
TagTick Virtualizer API: the intended semantics  

The commands createSession() creates a new session s, intended as a sequence of (un)tagging 


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    30 

actions over an initial Information Space I. The command and deleteSession(s) removes the session 
s from the environment. We denote the virtual information space obtained by modifying I with the 
actions in s as I(s); note that: I(𝜖) = I.  
 

createSession() creates and returns a work session s 

deleteSession(s) deletes a work session s 

commitSession(s) commits a work session s 

action(A, rs, (i, t), s) applies the action A with (i, t)to all objects in 
rs in s 

virtQuery(q, s) executes q over the Information Space I(s) 

 
Table 4. TagTick Virtualizer API: The Methods. 

 
The command action(A, rs, (i, t), s), depending on the value of A being tag or untag, applies the 
relative action for the tag (i, t) to all objects in rs and in the context of the session s. (Un)tagging 
actions occur in the context of a session s, hence update the scope of the Information Space I(s). 
The construction of such rs takes place in the annotation tagging tool user interface and may 
require several queries before all objects to be bulk (un)tagged are collected. Annotation tagging 
tools may for example provide web-basket mechanisms to support curators in this process.  
 
The command commitSession(s) makes the virtual Information Space I(s) persistent, i.e., 
materializes the bulk updates collected in session s. Once this operation is completed, the session s 
is deleted.  
 
The command virtQuery(q, s) executes a virtual search whose semantics is that of the Solr’s 
method query(q, null) executed over I(s). More formally, let’s extend the semantic function ⟦.⟧Solr 
to include the information space scope of the execution, that is: ⟦query(q,qf)⟧Solr

I  is the semantics 
of query(q, qf) over a given Information Space I. Then, we can define:  

⟦virtQuery(q, s)⟧TV = ⟦query(q, null)⟧Solr
I(s)   

 
TagTick Virtualizer API: The implementation  

To provide its functionalities in real time, the TagTick Virtualizer avoids any form of update action 
into the index. The module emulates the application of bulk (un)tagging actions over the 
information space by exploiting Solr low-level library for docset management, whose methods are 
shown in table 2. The underlying intuition is based on two considerations: (1) the action action(A, 
rs, (i, t), s) can be encoded in memory as an association between the tag (i, t) and the objects in the 
docset ds relative to rs in the context of s; and (2) the subset of objects ds should be returned to the 


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 31 

query (i = t) if executed over I in the scope of s (i.e., as if I was updated with such an action). By 
following this approach, the module may rewrite and execute calls of the form  
virtQuery(q And (i = t)) into calls searchOnDocset(q, ds), thereby emulating the real-time execution 
of the query over the information space I(s). More generally, any query of the form q And qtag 
predicates, where qtag predicates is a query combining tag predicates relative to tags touched in the 
session, can be rewritten as searchOnDocset(q, ds). In such cases, ds is obtained by combining the 
docsets relative to tag predicates by means of the low-level methods intersectDocsets and 
unifyDocsets.  

The TagTick Virtualizer module implements the aforementioned session cache by means of an in-
memory map ρ = S × I × T →DS, which caches the tagging status of all active work sessions. To this 
aim, ρ maps triples (s, i, t) onto docsets ds that are defined as the set of objects tagged with the tag 
(i, t) in the context of s at the time of the request. The TagTick Virtualizer is stateless with regard 
to the specific tags and sessions identifiers it is called to handle; such information is typically held 
in applications using the module to take advantage of real-time, virtual tagging mechanisms.  

Tagging and untagging actions  

The method action(A, rs, (i, t), s) has the effect of changing the status ρ to reflect the action of 
tagging or untagging the objects in the result set rs with the tag (i, t) in the session s. Table 5 
describes the effect of the command over the status ρ in terms of the semantic function 

⟦.⟧M:C × ℘(S×I×T)→℘(S×I×T) 

 that takes a command C and a status ρ and returns the status ρ affected by C.  

In order to optimize the memory heap, ρ is populated following a lazy approach, according to 
which a new entry for the key (s, i, t) is created when the first tagging or untagging action with 
respect to the tag (i, t) is performed in the scope of s. When the user adds or removes a tag (i, t) for 

the first time in the session s (case ρ(s, i, t)= ⊥), the value of the entry ρ(s, i, t) is initialized to the 
docset relative to the query i = t: 

ds = getDocset(⟦query((i=t),null)⟧Solr
I  

The function init(ρ, s, i, t) returns such new ρ over which the tag or untag action is eventually 

executed. If the action involves a tag (i, t) for which an entry ρ(s, i, t)= ds exists (case ρ(s, i, t) ≠ ⊥), 
the commands return the new ρ obtained by adding or removing the docset getDocset(rs) to or 
from ds. Such actions are performed in memory with minimal execution time.  

 
INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    32 

⟦action(A, rs, (i, t), s)⟧M(ρ)=  

�
updateTag(ρ, rs, (i, t), s) if(A=tag And ρ(s, i, t)≠ ⊥)

updateUntag(ρ, rs, (i, t), s) if(A=untag And ρ(s, i, t)≠ ⊥)
⟦action(A, rs, (i, t), s)⟧M(init(ρ, s, i, t)) if(ρ(s, i, t) = ⊥)

 
init(ρ, s, i, t)= ρ[ρ(s, i, t)←getDocset(⟦query(i=t, null)⟧Solr] 

updateTag(ρ, rs, (i, t), s)= ρ[ρ(s, i, t)← ρ(s, i, t) ∪ getDocset(rs)] 

updateUntag(ρ, rs, (i, t), s)= ρ[ρ(s, i, t)← ρ(s, i, t) ∖ getDocset(rs)] 

Table 5. Semantics of tag/untag commands. 

 
Queries over a Virtual Information Space  

As mentioned above, the command virtQuery(q, s) is implemented by executing the low-level 
method searchOnDocset(q', ds). Informally, 𝑞′ is the subpart of q whose predicates are not affected 
by actions in s, while ds is the subset of objects matching tag predicates affected by actions in s, to 
be calculated by means of the map ρ. To make this a real statement, two main issues must be ad-
dressed. The first one is syntactical: how to extract from q the sub-query q' and the subquery to be 
filtered by ρ to generate ds. The second issue is semantic: the misalignment between the objects in 
the original Information Space I, where searchOnDocset is executed, and the ones in I(s), to be 
virtually queried over and returned by virtQuery.  

Syntactic issue: 
 To obtain 𝑞′ and ds from q, the TagTick Virtualizer module includes a Query Rewriter module 
that is in charge of rewriting q as a query:  

                                         q' And qtags in  session (1) 

Both queries are compliant to the query grammar in table 1, but the second is a query that 
groups all tag predicates in q which are affected by s. The reason of this restriction is due to the 
fact that the method searchOnDocset(q’, ds) performs an intersection between the docset ds and 
the docset obtained from the execution of q�. In principle, qtags in session may contain arbitrary 
combinations of tag predicates (i = t) combined with And and Or operators. To get a better 
understanding, refer to the examples in table 6, where we assumed to have two tag 
interpretations A with terms {a1, a2} and B with terms {b1, b2} where ρ(s, A, a1) and ρ(s, B, b1) 
are defined in ρ; note that keyword searches, e.g., “napoleon,” are not run over tag values. The 
first two queries can be executed, while the last one is invalid. Indeed there is no way to factor 
out the tag predicate (A = a1) so that it can be separated and joint with the rest of the query 
using an And operator.  


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 33 

Clearly, the ability of the Query Rewriter module to rewrite the query independently of its 
complexity may be crucial to increase the usability level of TagTick Virtualizer. In its current 
implementation, the TagTick Virtualizer assumes that q is provided to virtQuery as already 
satisfying the expected query structure (1). As we shall see in the next section, this assumption 
is very reasonable in the realization of our annotation tagging tool TagTick and, more generally, 
in the definition of tools for annotation tagging. Indeed, such tools typically allow data curators 
to run Google-like free-keyword queries to be refined by a set of tags selected from a list. Such 
queries fall in our assumption and also match the average requirements of this application domain.  

𝑞 = "napoleon" 𝐴𝑛𝑑 (𝐴 = 𝑎1 𝑂𝑟 𝐵 =  𝑏1) 𝑤ℎ𝑒𝑟𝑒: 
𝑞′ = "napoleon" 
𝑞𝑡𝑎𝑔 𝑖𝑛 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 = (𝐴 = 𝑎1 𝑂𝑟 𝐵 =  𝑏1) 

𝑞 = (𝐴 = 𝑎2 𝑂𝑟 "napoleon") 𝐴𝑛𝑑 (𝐴 = 𝑎1 𝑂𝑟 𝐵 =  𝑏1) 𝑤ℎ𝑒𝑟𝑒: 
𝑞′ = (𝐴 = 𝑎2 𝑂𝑟 "napoleon")  

𝑞𝑡𝑎𝑔 𝑖𝑛 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 = (𝐴 = 𝑎1 𝑂𝑟 𝐵 =  𝑏1) 

𝑞 = (𝐴 = 𝑎1 𝑂𝑟 𝐵 =  𝑏2) 𝐴𝑛𝑑 napoleon 

Table 6. Query rewriting. 

Semantic issue: 
 The command searchOnDocset(q�, ds) does not match the expected semantics of virtQuery(q, s). 
The reason is that searchOnDocset is executed over the original information space I and objects in 
the returned result set may not reflect the new tagging imposed by actions in s. For example, 
consider an untagging action for the tag (i, t) and the result set rs in s. Although the objects in rs 
would never be returned for a query virtQuery((i = t),s), they could be returned for queries 
regarding other properties and in this case they would still display the tag (i, t). To solve this 
problem, the function patchResultset : RS → RS in table 7 intercepts the result set returned by 
searchOnDocset and “patches” its objects, by properly removing or adding tags according to the 
actions in s. To this aim, the function exploits the low-level function intersectSize, which efficiently 
computes and returns the size of the intersection between two docsets. For each object d in a 
given result set rs, the function verifies if d belongs to the docsets ρ(s, i, t) relative to the tags 
touched by the session s: if this is the case (intersectSize returns 1), the object should be enriched 
with the tag (add(d, (i, t))), otherwise the tag should be removed from the object (remove(d, (i, t))).  

 
INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    34 





=
=

=

=
≠

0  ))(},({ )),(,(
1  ))(},({ )),(,(

)),(,(

))},(,({),,(
^),,(

rsgetDocsetdizeintersectSiftidremove
rsgetDocsetdizeintersectSiftidadd

tidentpatchDocum

tidentpatchDocumsrrstsetpatchResul
tisr dÎrs
 

 
Table 7. Patching result sets. 

The TagTick Virtualizer implements also patching of results for browse queries. A Solr browse 
query is a CQL query q followed by the list of object properties l for which a group-by operation (in 
the sense of relational databases) is demanded. The query returns two responses: the query result 
set rs and the group-by statistics (l, v, k(l, v)) calculated over the result set and for the given 
properties, where k(l, v) is the number of objects featuring the value v for the property l in rs. As in 
the case of standard queries, the semantic issue affects browse queries when a group-by is applied 
over a tag interpretation i touched in the current work session. Indeed, the relative stats would be 
calculated over the information space I rather than the intended I(s). To solve this issue, when a 
browse query demands for stats over a tag interpretation i, the relative triples (i, t, k(i, t)) are 
patched as follows:  

1. If (i, t, k(i, t)) is such that ρ(s, i, t )= ⊥, i.e. the tag was not affected by the session, then k(i, t) is left 
unchanged;  

2. If (i, t, k(i, t)) is such that ρ(s, i, t )= ds, then  

k(i,t)= intersectSize(ds, getDocset(rs)) 

The operation returns the number of objects currently tagged with (i, t) which are also present in 
the result set rs.  

Query execution: 
The implementation of virtQuery can therefore be defined as  

⟦virtQuery(q,s)⟧TV = patchResultset(searchOnDocset(q�, ds), ρ, s) 

where q is rewritten in terms of q� and qtags in session by the Query Rewriter module, and ds is the 
docset obtained by applying the function  

⟦.⟧VT:Q × S × ℘(S × I × T) →DS 

defined in Table 8 to qtags in session. The function, given a query of tag predicates, a session identifier, 
and the status map ρ returns the docset of objects satisfying the query in the session’s scope.  

 
HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 35 

⟦𝑞1 𝑂𝑟 𝑞2⟧𝑉(𝑠,𝜌) = unifyDocsets(⟦𝑞1⟧𝑉(𝑠,𝜌), ⟦𝑞2⟧𝑉(𝑠,𝜌))  

⟦𝑞1 𝐴𝑛𝑑 𝑞2⟧𝑉(𝑠,𝜌) = intersectDocsets(⟦𝑞1⟧𝑉(𝑠,𝜌), ⟦𝑞2⟧𝑉(𝑠,𝜌))  

⟦(i=t)⟧V(s, ρ) = ρ(s, i, t) 

Table 8. Evaluation of qtags in session in session s. 

The definition of ρ, the Query Rewriter module, the semantics of the commands action and 
virtQuery, the definition of searchOnDocset, and the function ⟦.⟧V guarantee the validity of the 
following claim, crucial for the correctness of the TagTick Virtualizer:  

Claim (Search correctness)  

Given an information space I, a map ρ, and a session s, for any query q such that  

1. q = q� And qtags in session 

2. ds = �qtags in session�V(s, ρ) 

we can claim that 

⟦virtQuery(q, s)⟧TV = ⟦query(q, null)⟧Solr
I(s)  

hence the implementation of the command virtQuery matches its expected semantics. 

Making a Virtual Information Space Persistent  

The commitSession(s) command is responsible for updating the initial Information Space I to the 
changes applied in s, i.e. add and remove tags to objects in I according to the actions in s. To this 
aim, the module relies on the map ρ, which associates each tag (i, t) to the set of objects virtually 
tagged by (i, t) in s, and on the low-level function andNotDocsets. By properly matching the set of 
objects tagged by (i, t) in I and I(s) the function derives the sets of objects to tag and untag in I. 
Overall, the execution of commitSession(s) consists in:  

1. Identifying the set of tags affected by tagging or untagging actions in the session s: 

changedTags(s) = {(i, t)|ρ(s, i, t) ≠ ⊥} 

2. For each (i, t) ∈ changedTags(s)  

a) fetching the result set relative to all objects in I with tag (i = t): 

rs = query((i = t), null); 

b) keeping in memory the relative docset ds = getDocset(rs);  

c) calculating in memory the set of objects in I to be untagged by  
(i = t): 


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    36 

          toBeUntagged = andNotDocsets(ds, ρ(s, i, t));  

d) calculating in memory the set of objects in I to be tagged with  
(i = t) 

           toBeTagged = intersectDocset(ρ(s, i, t), ds);  

e) update the index to tag and untag all objects in the two sets; and 

f) Remove session s.  

The TagTick Virtualizer module is also responsible for the management of conflicts on commits 
and to avoid index inconsistencies. To this aim, only the first commit action is executed, and once 
the relative actions are materialized into the index, all other sessions are invalidated, i.e., deleted.  

TagTick User Interface: Annotation Tagging for Solr  

The TagTick User Interface module implements the functionalities presented in previously over a 
Solr index equipped with the TagTick Virtualizer module described in the section on Solr and 
annotation tagging (see figure 2). The user interface offers to authenticated data curators an 
annotation tagging environment where they can open work sessions, do and undo sequences of 
(un)tagging actions, and eventually commit the session into the current Solr information space. 
When data curators log out from the tool, the modules stores on disk their pending work sessions 
and the relative (un)tagging actions. Such sessions will be restored at the next access to the 
interface, to allow data curators continuing their work.  

 
Figure 2. TagTick: User Interface. 

The TagTick User Interface is a general-purpose module that can be configured to adapt to the 
classes and to the structure of objects residing in the index. To this aim, the modules acquires this 
information from XML configuration files where data curators can specify:  


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 37 

1. The names of the different classes, the values used to encode such classes in the index, and the 
index field used to contain such values;  

2. The list of tag interpretations together with the relative ontologies: in the current 
implementation ontologies are flat sets of terms, which can be optionally populated by 
curators during the tagging step; and 

3. The intended use of interpretations: the association between classes and interpretations.  

Once instantiated, the TagTick User Interface allows users to search for objects of all classes by 
means of free keywords and to refine such searches by class and by the tags relative to such class. 
This combination of predicates, which matches the query structure 𝑞� = 𝑞 𝐴𝑛𝑑 𝑞𝑡𝑎𝑔𝑠 𝑖𝑛 𝑠𝑒𝑠𝑠𝑖𝑜𝑛 ex-
pected by the TagTick Virtualizer, is then executed by the module and the results presented in the 
interface. Users can then add or remove tags to the objects—the interface makes sure that the 
right interpretations are used for the given class.  

As an example, we shall consider the real-case instantiation of TagTick in the context of the HOPE 
project, whose aim is to deliver a data infrastructure capable of aggregating metadata records 
describing multimedia objects relative to labour history and located across several data sources.24 
Such objects are collected, cleaned, and enriched to form an information space stored into a Solr 
index. The index stores two main classes of objects: descriptive units and digital resources. 
Descriptive unit objects contain properties describing cultural heritage objects (e.g., a pin). Digital 
resource objects instead describe the digital material representing the cultural heritage objects 
(e.g., the pictures of a pin). TagTick is currently used in the project HOPE to classify the aggregated 
objects according to two tag interpretations: “historical themes,” to tag descriptive units with an 
ontology of terms describing historical periods, and “export mode,” to tag digital resources with an 
ontology which describes the different social sites (e.g., YouTube, Facebook, Flickr) from which 
the resource must be made available from. In particular, figure 3 illustrates the HOPE TagTick user 
interface. In the screenshot, a set of descriptive units obtained by a query is being added a new tag 
“Communism . . .” of the tag interpretation “historical themes.” The TagTick User Interface offers 
the possibility to access the history of actions, in order to visualize their sequence, and possibly to 
undo their effects. Figure 4 shows the history of actions that led to the actual tag virtualization in 
the current work session. Curators can only rollback the last action they accomplished. This is 
because virtual tagging actions may be depending on each other; e.g., an action is based on a query 
that includes tag predicates whose tag has been affected by previous actions. Other approaches 
may infer the interdependencies between the queries behind the tagging actions and expose 
dependency-based undo options.  


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    38 

 
Figure 3. TagTick User Interface: Bulk Tagging Action. 

 
Figure 4. TagTick User Interface: Managing History of Actions. 

STRESS TESTS  

The motivations behind the realization of TagTick are to be found in annotation tagging 
requirements of bulk and real-time tagging. In general, the indexing speed of Solr highly depends 
on the underlying hardware, on the number of threads used for feeding, on the average size of 
the objects and their property values, and on the kind of text analysis to be adopted.25 

 
However, 

even assuming the most convenient scenario, bulk indexing in Solr is comparably slow with 
respect to other technologies, such as relational databases,26 and far from being real-time. In this 
section, we present the result of stress tests conceived to provide concrete measures of query 
performance, i.e., the real-time effect, the scalability of the tool, and how many tagging actions 
can be handled in the same session. The idea of the tests is to re-create worst scenarios and give 
evidence of the ability of TagTick to cope and scale with response time and memory 


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 39 

consumption.  

The experiments were run on a machine with processor Intel(R) Xeon(R) CPU E5630 @ 2.53GHz 
(4 cores), a total of memory 4 GB, and available disk of 100 GB (used at around 52 percent). The 
machine installs an Ubuntu 10.04.2 LTS operating system, with a Java Virtual Machine 
configured as -Xmx1800m -XX:MaxPermSize = 512m. In simpler terms, a medium-low server for 
a production index. The index was fed with 10 million objects randomly generated and with the 
following structure: 

[ identifier: String,  
title: String,  
description: String,  
publisher: String,  
URL: String,  
creator: String,  
date: Date,  
country: String,  
subject: Terms]  

 
The tag interpretation subject can be assigned values from an ontology Terms of scientific 
subjects, such as “Agricultural biotechnology,” “Automation,” “Biofuels,” “Biotechnology,” 
“Business aspects.” The objects are initially generated without tags.  

Each test defines a new session s with K tagging actions of the form  

action(tag, virtQuery(identifier <> ID,null), t, s) 

where ID is a random identifier and t is a random tag (subject, term). In practice, the action adds 
the tag t to all objects in the index, thereby generating docsets of size 10 million. Once the K 
actions are executed, the test returns the following measures:  

1. The size of the heap space required to store K tags in memory.  

2. The minimal, average, and maximum time required to reply to two kinds of stress queries to 
the index (calculated out of 100 queries):  

a. The query identifier <> ID AND(i,t)∈s(i = t): the query returns the objects in the index 
which feature all tags touched by the session.  

b.  The query identifier <>ID OR(i,t)∈s(i = t): the query returns the objects in the index which 
feature at least one of the tags assigned in the session. 

In both cases, since tagging actions where applied to all objects in the index, the result will 
contain the full index. However, in one case the response will be calculated by intersecting 
docsets, while in the other case by unifying them. Note that by selecting a random identifier 
value (ID), the test makes sure that low-level Solr optimization by caching is not fired, as this 
would compromise the validity of the test.  


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    40 

3. The minimal, average, and maximum time required to reply to browse queries which involve 
all tags used in the session (calculated out of 100 queries). 

4. The time required to reconstruct the session in memory whenever the data curator logs into 
TagTick.  

The results presented in figure 5 show that the average time for the execution of search and 
browse queries always remain under 2 seconds, which we can consider under the “real-time” 
threshold from the point of view of the users. User tests have been conducted in the context of the 
HOPE project, where curators were positively impressed by the tool. HOPE curators can today 
apply sequences of tagging operations over millions of aggregated records by means of a few 
clicks. Moreover, independently of the number of tagging operations, queries over the tagged 
records take about 2 seconds to complete.  

The execution time has a major increase from 0 tags to 1 tag. This behavior is expected because 
when there is 1 tag in the session, the 10 million records must be “patched.” From 1 tag onwards 
the execution time increases as well, but not at the same rate as in the previous case. This means 
that in the average case patching 10 million records with 100 tags does not cost much more than 
tagging them with 1 tag. 

 
Figure 5. Stress Test for TagTick Search and Browse Functionality. 

The results in figure 6 show that the amount of memory to be used does not exceed the limits 
expected on reasonable servers running a production system. The time required to reconstruct 
the sessions is generally long, starting from 20 seconds for 50 tags up to 1.5 minutes for 200 tags. 


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 41 

On the other hand, this is a one-time operation, required only when logging in to the tool.  

 
Figure 6. Stress Test for Heap Size Growth and Session Restore Time. 

CONCLUSIONS  

In this paper, we presented TagTick, a tool devised to enable annotation tagging functionalities 
over Solr instances. The tool allows a data curator to safely apply and test bulk tagging and 
untagging actions over the index in almost real time and without compromising the activities of 
end users searching the index at the same time. This is possible thanks to the TagTick Virtualizer 
module, which implements a layer over Solr that enables real-time and virtual tagging by keeping 
in memory the inverted list of objects associated to a (un)tagging action. The layer is capable of 
parsing user queries to intercept the usage of tags kept in memory and, in this case, to manipulate 
the query response to deliver the set of objects expected after tagging.  

Future developments may regard the ability to enable more complex query parsing to handle 
rewriting of a larger set of queries beyond Google-like queries currently handled by the tool. 
Another interesting challenge is tag propagation. Curators may be interested in having the action 
of (un)tagging an object to be propagated to objects that are somehow related with the object. 
Handling this problem requires the inclusion into the information space model of relationships 
between classes of objects and the extension of the TagTick Virtualizer module for the 
specification and management of propagation policies.  

ACKNOWLEDGEMENTS  

The work presented in this paper has been partially funded by the European Commission FP7 
eContentplus-2009 Best Practice Networks project HOPE (Heritage of the People’s Europe, 
http://www.peoplesheritage.eu), grant agreement 250549.  

 
http://www.peoplesheritage.eu/


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    42 

REFERENCES  

 
1.  Arkaitz Zubiaga, Christian Körner, and Markus Strohmaier, “Tags vs Shelves: From Social 

Tagging to Social Classification,” in Proceedings of the 22nd ACM conference on Hypertext and 
Hypermedia, 93–102 (New York: ACM, 2011), http://dx.doi.org/10.1145/1995966.1995981. 

2.  Meng Wang et al., “Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer 
Joint Exploration,” ACM Computer Survey 44, no. 4 (September 2012): 25:1–24, 
http://dx.doi.org/10.1145/2333112.2333120.  

3.  Lin Chen et al., “Tag-Based Web Photo Retrieval Improved by Batch Mode Re-tagging,” in 2010 
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2010), 3440–46, 
http://dx.doi.org/10.1109/ CVPR.2010.5539988.  

4.  Emanuele Quintarelli, Andrea Resmini, and Luca Rosati, “Information Architecture: Facetag: 
Integrating Bottom-Up and Top-Down Classification in a Social Tagging System,” Bulletin of 
the American Society for Information Science & Technology 33, no. 5 (2007): 10–15, 
http://dx.doi.org/10.1002/bult.2007.1720330506.  

5.  Stijn Christiaens, “Metadata Mechanisms: From Ontology to Folksonomy . . . and Back,” in 
Lecture Notes in Computer Science: On the Move to Meaningful Internet Systems 2006: OTM 
2006 Workshops (Berlin  Heidelberg: Springer-Verlag, 2006).  

6.  M. Mahoui et al., “Collaborative Tagging of Art Digital Libraries: Who Should Be Tagging?” in 
Theory and Practice of Digital Libraries, ed. Panayiotis Zaphiris et al., 162–72, vol. 7489, 
Lecture Notes in Computer Science (Springer Berlin Heidelberg, 2012), 
http://dx.doi.org/10.1007/978-3-642-33290-6_18. 

7.  Alexandre Passant and Philippe Laublet, “Meaning Of A Tag: A Collaborative Approach to 
Bridge the Gap Between Tagging and Linked Data,” in Proceedings of the Linked Data on the 
Web (LDOW2008) Workshop at WWW2008,  
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.142.6915.  

8.  Michael Khoo et al., “Towards Digital Repository Interoperability: The Document Indexing and 
Semantic Tagging Interface for Libraries (DISTIL),” in Theory and Practice of Digital Libraries, 
ed. Panayiotis Zaphiris et al., 439–44, vol. 7489, Lecture Notes in Computer Science (Springer 
Berlin Heidelberg, 2012), http://dx.doi.org/10.1007/978-3-642-33290-6_49.  

9.  Leonardo Candela, et al, “Setting the Foundations of Digital Libraries: The DELOS Manifesto.” 
D-Lib Magazine 13, no. 3/4, March/April 2007, http://dx.doi.org/10.1045/march2007-
castelli.  

10.  Jennifer Trant, “Studying Social Tagging and Folksonomy: A Review and Framework,” Journal 
of Digital Information (January 2009), http://hdl.handle.net/10150/105375.  

http://dx.doi.org/10.1145/1995966.1995981
http://dx.doi.org/10.1145/2333112.2333120
http://dx.doi.org/10.1109/%20CVPR.2010.5539988
http://dx.doi.org/10.1002/bult.2007.1720330506
http://dx.doi.org/10.1007/978-3-642-33290-6_18
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.142.6915
http://dx.doi.org/10.1007/978-3-642-33290-6_49
http://dx.doi.org/10.1045/march2007-castelli
http://dx.doi.org/10.1045/march2007-castelli
http://hdl.handle.net/10150/105375


HIGH-PERFORMANCE ANNOTATION TAGGING OVER SOLR FULL-TEXT INDEXES | ARTINI ET AL 43 

 
11.  Cameron Marlow et al., “HT06, tagging Paper, Taxonomy, Flickr, Academic Article, To Read,” in 

Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, 31–40 (New York: 
ACM, 2006), http://dx.doi.org/10.1145/1149941.1149949.  

12.  Andrea Civan et al., “Better to Organize Personal Information by Folders or By Tags? The Devil 
is In the Details,” Proceedings of the American Society for Information Science and Technology 
45, no. 1 (2008): 1–13, http://dx.doi.org/10.1002/meet.2008.1450450214.  

13.  Marianne Lykke et al., “Tagging Behaviour with Support from Controlled Vocabulary,” in 
Facest of Knowledge Organization, ed. Alan Gilchrist and Judi Vernau, 41–50 (Bingley, UK: 
Emerald Group, 2012)  

14. Guus Schreiber et al., “Semantic Annotation and Search of Cultural-Heritage Collections: The 
MultimediaN E-Culture Demonstrator,” Web Semantics: Science, Services and Agents on the 
World Wide Web 6, no. 4 (2008): 243–49, http://dx.doi.org/10.1016/j.websem.2008.08.001.  

15.  Diana Maynard and Mark A. Greenwood, “Large Scale Semantic Annotation, Indexing and 
Search at the National Archives,” in Proceedings of LREC vol. 12 (2012).  

16.  Martin Feijen, “DRIVER: Building the Network for Accessing Digital Repositories Across 
Europe,” Ariadne 53 (October 2007), http://www.ariadne.ac.uk/issue53/feijen-et-al/. 

17.  Heritage of the People’s Europe (HOPE), http://www.peoplesheritage.eu/.  

18.  European Film Gateway Project, http://www.europeanfilmgateway.eu.  

19.  Paolo Manghi et al.,  “OpenAIREplus: The European Scholarly Communication Data 
Infrastructure,” D-Lib Magazine 18, no. 9–10 (September 2012), 
http://dx.doi.org/10.1045/september2012-manghi. 

20.  Panagiotis Antonopoulos et al., “Efficient Updates for Web-Scale Indexes over the Cloud,” in 
2012 IEEE 28th International Conference on Data Engineering Workshops (ICDEW), 135–42, 
April 2012, http://dx.doi.org/10.1109/ICDEW.2012.51.  

21. Chun Chen et al., “TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets,” in 
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, 649–
60 (New York: ACM, 2011), http://dx.doi.org/10.1145/1989323.1989391.  

22.  Rafal Kuc, Apache Solr 4 Cookbook (Birmingham, UK: Packt, 2013).  

23.  David Smiley and Eric Pugh, Apache Solr 3 Enterprise Search Server (Birmingham, UK: Packt, 
2011). 

24.  The HOPE Portal: The Social History Portal, http://www.socialhistoryportal.org/timeline-
map-collections. 

http://dx.doi.org/10.1145/1149941.1149949
http://dx.doi.org/10.1002/meet.2008.1450450214
http://dx.doi.org/10.1016/j.websem.2008.08.001
http://www.ariadne.ac.uk/issue53/feijen-et-al/
http://www.peoplesheritage.eu/
http://www.europeanfilmgateway.eu/
http://dx.doi.org/10.1045/september2012-manghi
http://dx.doi.org/10.1109/ICDEW.2012.51
http://dx.doi.org/10.1145/1989323.1989391
http://www.socialhistoryportal.org/timeline-map-collections
http://www.socialhistoryportal.org/timeline-map-collections


INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014    44 

 
25.  Assuming to operate a stand-alone instance of Solr, hence not relying on Solr sharding 

techniques with parallel feeding. 

26.  WhyUseSolr—Solr Wiki, http://wiki.apache.org/solr/WhyUseSolr.