Communications Wikidata: From “an” Identifier to “the” Identifier Theo van Veen INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 72 Theo van Veen (theovanveen@gmail.com) is Researcher (retired), Koninklijke Bibliotheek. ABSTRACT Library catalogues may be connected to the linked data cloud through various types of thesauri. For name authority thesauri in particular I would like to suggest a fundamental break with the current distributed linked data paradigm: to make a transition from a multitude of different identifiers to using a single, universal identifier for all relevant named entities, in the form of the Wikidata identifier. Wikidata (https://wikidata.org) seems to be evolving into a major authority hub that is lowering barriers to access the web of data for everyone. Using the Wikidata identifier of notable entities as a common identifier for connecting resources has significant benefits compared to traversing the ever-growing linked data cloud. When the use of Wikidata reaches a critical mass, for some institutions, Wikidata could even serve as an authority control mechanism. INTRODUCTION Library catalogs, at national as well as institutional levels, make use of thesauri for authority control of named entities, such as persons, locations, and events. Authority records in thesauri contain information to distinguish between entities with the same name, combine pseudonyms and name variants for a single entity, and offer additional contextual information. Links to a thesaurus from within a catalog often take the form of an authority control number, and serve as identifiers for an entity within the scope of the catalog. Authority records in a catalog can be part of the linked data cloud when including links to thesauri such as VIAF (https://viaf.org/), ISNI (http://www.isni.org/), or ORCID (https://orcid.org/). However, using different identifier systems can lead to having many identifiers for a single entity. A single identifier system, not restricted to the library world and bibliographic metadata, could facilitate globally unique identifiers for each authority and therefore improve discovery of resources within a catalog. The need for reconciliation of identifiers has been pointed out before.1 What is now being suggested is to use the Wikidata identifier as “the” identifier. Wikidata is not domain specific, has a large user community, and offers appropriate APIs for linking to its data. It provides access to a wealth of entity properties, it links to more than 2,000 other knowledge bases, it is used by Google, and the number of organisations that link to Wikidata is quantifiably growing with tremendous speed.2 The idea of using Wikidata as an authority linking hub was recently proposed by Joachim Neubert.3 But why not go one step further and bring the Wikidata identifier to the surface directly as “the” resource identifier, or official authority record? This has been argued before and the implications of this argument will be considered in more detail in the remainder of this article. 4 INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 73 Figure 1. From linking everything to everything to linking directly to Wikidata. Figure 1 illustrates the differences between a few possible situations that should be distinguished. On the left, the “everything links to everything” situation shows Wikidata as one of the many hubs in the linked data cloud. In the middle, the “Wikidata as authority hub” situation is shown, where name authorities are linked to Wikidata. On the right is the arrangement proposed in this article, where library systems and other systems for which this may apply share Wikidata as a common identifier mechanism. Of course, there is a need for systems that feed Wikidata with trusted information and provide Wikidata with a backlink to a rich resource description for entities. In practice, however, many backlinks do not provide rich additional information and in such cases a direct link to Wikidata would be sufficient for the identification of entities. Figure 2 shows these two situations and other possible variations by means of dashed lines, i.e. systems that feed Wikidata, but use the Wikidata identifier as resource identifier for the outside world vs. systems that link directly to Wikidata, but keep a local thesaurus for administrative purposes. It is certainly not the intention to encourage institutions to give up their own resource descriptions or resource identifiers locally, especially not when they are an original or rich source of information about an entity. A distinction can be made between the URL of the description of an entity and the URL of the entity itself. When following the URL of a real-world entity in a browser, it is good practice to redirect to the corresponding description of the entity. This is known as the “HTTPRange-14” issue.5 This article will not go into any detail about this distinction other than to note that it makes sense to have a single global identifier for an entity while accepting different descriptions of that entity linked from various sources. WIKIDATA | VAN VEEN 74 https://doi.org/10.6017/ital.v38i2.10886 Figure 2. Feeding properties connecting collections to Wikidata (left) and direct linking to Wikidata using resource identifier (right). The dashed lines show additional connecting possibilities. THE MOTIVATING USE CASE The idea of using the Wikidata identifier as a universal identifier was born at the research department of the National Library of the Netherlands (KB) while working on a project aimed at automatically enriching newspaper articles with links to knowledge bases for named entities occurring in the text.6 These links include the Wikidata identifier and, where available, the Dutch and English DBpedia (http://dbpedia.org) identifiers, the VIAF number, the Geonames number (http://geonames.org), the KB thesaurus record number, and the identifier used by the Parliamentary Documentation Centre (https://www.parlementairdocumentatiecentrum.nl/). The identifying parts of these links are indexed along with the article text in order to enable semantic search, including search based on Wikidata properties. For demonstration purposes the enriched “newspapers+” collection was made available through the KB Research Portal, which gives access to most of the regular KB collections (figure 3). 7 In the newspaper project, linked named entities in search results are clickable to obtain more information. As most users are not expected to know SPARQL, the query language for the semantic web, the system offers a user-friendly method for semantic search: a query string entered between square brackets, for example “[roman emperor]”, is expanded by a “best guess” SPARQL query in Wikidata, in this case resulting in entities having the property “position held=roman emperor.”. These in turn are used to do a search for articles containing one or more mentions of a Roman emperor, even if the text “roman emperor” is not present in the article. In another example, when a user searches for the term “[beatles]” the “best guess” search yields articles mentioning entities with the property “member of=The Beatles”. For ambiguous items, as in the case of “Guernica,”, which can be the place in Spain or Picasso’s painting, the one with the highest number of occurrences in the newspapers is selected by default, but the user may select another one. For INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 75 the default or selected item, the user can select a specific property from a list of Wikidata properties available for that specific item. The possibilities of this semantic search functionality may inspire others to use the Wikidata identifier for globally known entities in other systems as well. Figure 3. Screenshot of the KB Research Portal with a newspaper article as result of searching “[architect=Willem Dudok]”. The results are articles about buildings of which Willem Dudok is the architect. The name of the building meeting the query [architect=Willem Dudok] is highlighted. USAGE SCENARIOS Two usage scenarios can be considered in more detail: (1) manually following links between Wikidata descriptions and other resource descriptions, and (2) a federated SPARQL query can be performed by the system to automatically bring up linked entities. In the first scenario, in which resource identifiers link to Wikidata, the user can follow the link to all resource descriptions having a backlink in Wikidata. But why would a user follow such a link? Reasons may include wanting more or context-specific information about the entity, or a desire to search in another system for objects mentioning a specific entity. In the latter case, the information behind the backlink should provide a URL to search for the entity, or the backlink should be the search URL itself. Wikidata provides the possibility to specify various URI templates. These can be used to specify a link for searching objects mentioning the entity, rather than just showing a thesaurus entry. When the backlink does not provide extra information or a way to search the entity, the backlink is almost useless. Thus, when systems provide resource links to Wikidata they give users access to a wealth of information about an entity in the web of data and, potentially, to objects mentioning a specific entity. Some systems only provide backlinks from WIKIDATA | VAN VEEN 76 https://doi.org/10.6017/ital.v38i2.10886 Wikidata to their resource descriptions but not the other way around. Users from such systems cannot easily benefit from these links. The second scenario of a federated SPARQL query applies when searching objects in one system based on properties coming from other systems. Formulating such a SPARQL query is not easy because doing so requires a lot of knowledge about the linked data cloud. The alternative is to put the complete linked data cloud in a unified (triple store) database. The technology of linked data fragments might solve the performance and scaling issues but not the complexity. 8 Using a central knowledge base like Wikidata could reduce complexity for the most common situation of searching objects in other systems using properties from Wikidata. This use case requires these systems to take the users query and automatically formulate a SPARQL search. There are many systems that are linked to Wikidata that do not support SPARQL at all or only support it in a way that is not intended for the average user. Those systems can still let users benefit from Wikidata by offering a simple add-on to search in Wikidata for entities that meet some criteria and use the identifiers for a conventional search in the local system as shown for the case of the historical newspapers. These two use cases illustrate how the use of a Wikidata identifier can lower the barrier to access information about an entity and to finding objects related to an entity by minimizing the number of hubs, minimizing the required knowledge and minimizing the required technology. This is achieved by linking resources to Wikidata and, even more so, by making objects searchable by means of the Wikidata identifier. ADVANTAGES OF USING THE WIKIDATA IDENTIFIER AS UNIVERSAL IDENTIFIER Summarizing the above, a number of significant advantages of using the Wikidata identifier as universal identifier can be seen. These include: • Using the Wikidata identifier as resource identifier makes Wikidata the first hub. Applications therefore have in the first instance to deal with only one description model. From there, it is easy to navigate further: most information is only “one hub away,” so less prior knowledge is required to link from one source to another. • Wikidata identifiers can be used for federated search based on properties in Wikidata, so there is less need to know how to access properties in other resource descriptions. • Wikidata identifiers facilitate generating “just in case” links to systems having the Wikidata identifier indexed. • Complicated SPARQL queries using Wikidata as primary source for properties can be shared and reused more easily compared to a situation with many diverse sources for properties. • Wikidata offers many tools and APIs for accessing and processing data. • Some libraries and similar institutions may even decide to use Wikidata directly for authority control when it reaches a critical mass, relieving them from maintaining a local thesaurus. IMPLEMENTATION Institutions can gradually adopt the use of Wikidata identifiers without needing to make radical changes in their local infrastructure. A simple first step is automatically generating links to INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 77 Wikidata in the presentation of an object or to the object description to provide contextual information and navigation options. As a next step, the Wikidata Q-number of an entity could be indexed along with the descriptions containing it, so these objects become findable via a Wikidata identifier search, e.g. of the form: https://whatever.local/wdsearch?id=Q937 The Wikidata identifier could then be used in conventional as well as federated searches for a resource, regardless of the exact spelling of a resource name. A search may be refined using Wikidata properties without further requirements with respect to local infrastructures. Institutions having a SPARQL endpoint can allow for a federated SPARQL query for combining local data with data from Wikidata. As SPARQL is not easy for the end user this requires a user interface that can formulate a SPARQL query to protect the user from knowing SPARQL. Those institutions willing to start using the Wikidata identifier as resource identifier can unify references in their bibliographic records. Currently, for example, a reference to Albert Einstein, in a simplified, RDF-like (https://www.w3.org/RDF/) XML fragment in a bibliographic record, could look quite different for different institutions, e.g.: Albert Einstein Albert Einstein Albert Einstein Albert Einstein If the Wikidata identifier is used as resource identifier, this could for all institutions become the same: Albert Einstein In this case it becomes easy to navigate the web, to create common bookmarklets, and provide additional functionality using the Wikidata identifier. CATALOGUING PROCESS AND CRITERIA FOR NEW WIKIDATA ENTRIES For institutions that decide to link their entities directly to Wikidata, their catalog software would have to be configured to support Wikidata lookups. Catalogers would not have to know about linked data or RDF to create links to Wikidata; they would simply have to query Wikidata and select the appropriate entry to link. The cataloging software would then add the selected identifier to the record being edited. If a query in Wikidata does not yield any results the item would first then have to be created by the cataloger. Creating a new item using the Wikidata user interface (figure 4) is straightforward: create an account, add a new item, and add statements (fields) and values. WIKIDATA | VAN VEEN 78 https://doi.org/10.6017/ital.v38i2.10886 Figure 4. Data entry screen for entering a new item in Wikidata. Catalogers must be aware of some rules when creating items. Wikidata editors may delete items that fall under one of Wikidata’s exclusion criteria, such as vandalism, empty descriptions, broken links, etc. In addition, the item must refer to an instance of a clearly identifiable conceptual or material “notable” entity. Notable means that the item must be mentioned by at least one reliable, third-party published source. Here, common sense is required: being mentioned in a telephone book or a newspaper is in itself not considered as notability. Entities that are not notable enough to be entered into Wikidata would then remain identified by a link to a local or other thesaurus. POSSIBLE OBJECTIONS TO WIKIDATA AS AUTHORITY CONTROL MECHANISM Although it is, at least at the present moment, not the intention of this article to propose the use of Wikidata as the primary local authority control mechanism, some institutions may nonetheless consider the opportunity to do so. There are numerous objections to this idea to note, including: 1) Institutions may consider themselves authoritative sources of information, and may therefore want to keep control over “their” thesaurus. The idea that the greater community can make changes to “their” thesaurus may not be tenable to them. Quality control and error detection certainly are important issues, but experts from outside the library can sometimes provide more and better information about a resource than cataloguing professionals. For misuse and erroneous input, the community can be relied on and trusted to correct and add to Wikidata entries. Information that is critical for local usage, such as access control, may still be managed locally. Despite possible objections to using Wikidata for universal authority control, national libraries and other institutions can INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 79 work together with Wikidata to share responsibility of maintaining the resource, to optimize and harmonize the shared use of Wikidata, and maintain validity and authority. This might imply a more rigorous quality control. 2) Existing systems like VIAF and ISNI already, at present, still contain more persons than Wikidata, so why use Wikidata? VIAF and ISNI are domain specific and are more restrictive with respect to updates of their content and the availability of tools and APIs. In Wikidata both VIAF and ISNI are just one hub away and for internal use the VIAF and ISNI identifiers remain available. The question here is whether there will be a moment that Wikidata reaches a critical mass and supersedes VIAF and ISNI. 3) There may be disagreement about a certain entity, especially when it concerns political events or persons whose role is perceived differently by different political parties. Wikidata contains neutral properties. The properties that may contain subjective qualifications or might suffer bias are mostly behind the backlinks, like the abstract in Wikipedia. A fundamental difference between Wikipedia and Wikidata is that Wikipedia doesn’t have to be consistent across languages. Wikidata is much more structured and therefore more useful for semantic applications. It doesn’t allow for the different nuances in descriptions like Wikipedia articles do and therefore Wikidata doesn’t reflect different opinions in descriptions and is less subject to bias.9 Furthermore, the cataloguing practices in libraries are subject to bias and subjectivity too. Perception and political view may, for example, be reflected in some subject headings and may also change over time.10 It is debatable whether a cataloger is more neutral and less biased than a larger user community. Although the use and acceptance of Wikipedia as a true source of information may be arguable, in the light of the current “fake news” discussion it is extremely important to guard the correctness of information in Wikipedia. In this context it is interesting to note that “according to a study in Nature, the correctness of Wikipedia articles is comparable to the Encyclopaedia Britannica, and a study by IBM researchers found that vandalism is repaired extremely quickly.”11 4) Some objections have to do with the discussion of “centralization versus decentralization.” Some institutions may not want a central system perceptively having control over their local data. The idea of using Wikidata as a common authority control mechanism is not that different from the use of any other thesaurus or identifier framework like ISBN, ISSN, etc., except for its use of a central resource description. 5) What if Wikidata disappears? There are solutions in terms of mirrors and a local copy of Wikidata. Moreover, national libraries and other, similar institutions that are already responsible for long-term preservation of digital content can take responsibility for keeping Wikidata alive to maximize its viability WIKIDATA | VAN VEEN 80 https://doi.org/10.6017/ital.v38i2.10886 CONCLUSION Reconciliation of linked data identifiers in general, and using the Wikidata identifier as universal identifier in particular, has been shown to have many advantages. Libraries and similar institutions can gradually start using the Wikidata identifier without needing to make radical changes in their local database infrastructure. When Wikidata reaches a critical mass, libraries and similar institutions may want to switch to using Wikidata identifiers as the default resource identifiers or authority records. However, given the enormous growth of the number of collections that link entities to Wikidata that is already taking place, we might end up in a situation where the perception is that “if an item is not in Wikidata, it doesn’t exist” stimulating putting more items in Wikidata and making local descriptions less relevant. From a strategic point of view for adopting Wikidata decision makers may pose the question: “Why do we have a local thesaurus when we already have Wikidata?” The next question, then, will probably not be “Should we go this way?” but rather “When should we go this way and start using the Wikidata identifier as The Identifier?” REFERENCES 1 Robert Sanderson, “The Linked Data Snowball and Why We Need Reconciliation,” SlideShare, Apr. 4, 2016, https://www.slideshare.net/azaroth42/linked-data-snowball-or-why-we-need- reconciliation. 2 Karen Smith-Yoshimura, “The rise of Wikidata as a linked data source,” Hanging Together, Aug. 6, 2018, http://hangingtogether.org/?p=6775. 3 Joachim Neubert, “Wikidata as a Linking Hub for Knowledge Organization Systems? Integrating an Authority Mapping into Wikidata and Learning Lessons for KOS Mappings,” in Proceedings of the 17th European Networked Knowledge Organization Systems Workshop, 2017, 14-25, http://ceur-ws.org/Vol-1937/paper2.pdf. 4 Theo van Veen, “Wikidata as universal library thesaurus,” presented Oct. 2017 at WikidataCon 2017, Berlin, https://www.youtube.com/watch?v=1_NxKBnCOHM. 5 “HTTPRange-14,” Wikipedia, accessed Mar. 15, 2019, https://en.wikipedia.org/wiki/HTTPRange-14. 6 Theo van Veen et. al., “Linking Named Entities in Dutch Historical Newspapers,” in Metadata and Semantics Research, MTSR 2016, ed. Emmanouel Garoufallou (Cham: Springer, 2016), 205–10, https://doi.org/10.1007/978-3-319-49157-8_18. 7 Video demonstration of “KB Research Portal,” KB | National Library of the Netherlands, http://www.kbresearch.nl/xportal, accessed Apr. 26, 2019, https://www.youtube.com/watch?v=J5mCem-hEMg. 8 Ruben Verborgh, “Linked Data Fragments: Query the Web of data on Web-scale by moving intelligence from servers to clients,” accessed Mar. 15, 2019, http://linkeddatafragments.org/. 9 Mark Graham, “The Problem with Wikidata,” Apr. 6, 2012, https://www.theatlantic.com/technology/archive/2012/04/the-problem-with- wikidata/255564/. INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2019 81 10 Candise Branum, “The Myth of Library Neutrality,” May 15, 2014, https://candisebranum.wordpress.com/2014/05/15/the-myth-of-library-neutrality/. 11 “The Reliability of Wikipedia,” Wikipedia, accessed Mar. 15, 2019, https://en.wikipedia.org/wiki/Reliability_of_Wikipedia.