April 2019
Text by Christian Lieske and Dr. Felix Sasaki

Image: © Radachynskyi/istockphoto.com

Christian Lieske is involved in SAP’s language-related technologies and production. He has worked with organizations such as the World Wide Web Consortium and has contributed to the European Commission’s MultilingualWeb initiative. He has a formal education in Computer Science, Natural Language Processing and Artificial Intelligence.


christian.lieske[at]sap.com
www.sap.com


 


Felix Sasaki's field of interest is the application of Web technologies for representation and processing of multilingual information. He has worked for the World Wide Web Consortium (W3C) and DFKI in the area of Artificial Intelligence. He recently joined the German publisher Cornelsen Verlag as content architect.


felix.sasaki[at]dfki.de
www.dfki.de


 


 

Wikidata at work

Wikidata is a large-scale, non-profit knowledge base that anyone can edit and use. Applications that are intuitive to use and powerful interfaces for programmers make it a versatile tool for a large variety of usage scenarios – including knowledge discovery, content enrichment, terminology work, and translation.

Wikidata is an environment for collaborative work in the field of data and information. Built on the ideas of Linked Data/the Semantic Web, it currently provides information on more than 50 million items. Hosted by the Wikimedia Foundation, Wikidata provides an open source for modern content creation. In this article, we look at how content creators can use Wikidata, the "magic" behind it, and the Wikidata tooling. Furthermore, the article touches on some loose ends and calls to action. The goal is to stimulate an open discussion on content creation processes as exemplified by Wikidata.

How Wikidata can help

We live in a world of constant change, and for many of us there is a need for lifelong learning. Learning often means exploring parts of the world that we have not really seen before.

For example, let’s say that you want to explore the world of Machine Learning (ML) and Artificial Intelligence (AI). To get a basic understanding, you might look at the disciplines or sub-fields related to ML and AI. An ontological view like the one in Figure 1 – generated by a so-called Wikidata query) – could help.

Figure 1: Restricted ontological view of the domain "Machine Learning/Artificial Intelligence"

 

Naming things in different languages

If English is not your native language, you might prefer translations for the English terms used in the ontology. Figure 2 shows a bilingual list in English and German that you can generate by running another query on Wikidata.

Figure 2: Restricted bilingual list of terms for the domain "Machine Learning/Artificial Intelligence"

 

Digging deeper

In addition, Wikidata also allows you to dig deeper. Let’s suppose you would like to know more about Stuttgart, the venue of the tekom and tcworld conferences. By running a query on Wikidata, you could generate a file like the one in Figure 3.

Figure 3: Spreadsheet with facts and relationships for "Stuttgart"

 

Blend and enrich

Wikidata also offers the option to combine information from different sources. The map in Figure 4 was created by blending different types of geographical data.

Figure 4: Location of certain Automated Teller Machines (ATMs) in Stuttgart

 

The magic behind Wikidata

Just like Wikipedia, Wikidata can be consumed (read) or modified (written) by anyone. Key differences from Wikipedia are:

 

  • Wikidata stores information in a structured manner, while information in Wikipedia is stored largely unstructured – the semi-structured info boxes are the exception.
  • There is only one Wikidata instance, while there are approximately 300 single-language Wikipedia instances.

Structure is not added to information via tables, lists or specific markup. Instead, it emerges from what is called the Wikidata data model. This model defines what can be stored and how it is stored.

The central elements of the Wikidata data model are called items. They represent concrete or abstract entities. Items can be attributed using statements. Statements are composed of properties and their values, which may refer to other items. Statements can be qualified and documented by references, as shown in Figure 5.

Figure 5: A high-level view on the Wikidata data model
Source: www.mediawiki.org/wiki/Wikibase/DataModel/Primer

 

For the representation of items, statements, etc., Wikidata has adopted many approaches known from Linked Data/the Semantic Web. To capture information, Wikidata specifically uses "Subject-Predicate-Object" arrangements:

 

Q1022, P361, Q8172

Q1022, P361, Q451619

Stuttgart, part of, Stuttgart Government Region

Stuttgart, part of, Stuttgart Metropolitan Region

 

Furthermore, Wikidata distinguishes between items/concepts (language-agnostic), and labels/terms (language-specific). This subject-predicate-object data model is closely related to the idea of graphs – collections of nodes and edges/circles and lines (see Figure 6).

Figure 6: Subject-Predicate-Object arrangement in graphs (left: language-agnostic abstract identifiers; right: language-specific human-readable labels)

 

Another attractive feature of Wikidata is that it not only allows you to work with text, but also with pictures and sounds.

Technology and tools

The main components of the Wikidata infrastructure are a database management system (WikiBase) and a wiki component (MediaWiki). Both are open source, and you can set up your own Wikidata system (e.g. behind a firewall).

Programming interfaces supporting JavaScript Object Notation (JSON) and the Resource Description Framework (RDF) with its query language SPARQL, help to make Wikidata as versatile as possible. This enables the creation of JavaScript libraries such as Qlabel and the following end-user tools:

 

  • Reasonator
  • Ask Wikidata
  • Wikidata Translate (including disambiguation)

As items may include ontological information, taxonomies and other knowledge organization tools can be generated. "Wikipedia and Wikidata tools" demonstrates how to gather translations, synonyms, category information, links to media assets, and much more.

Implementations such as "Wikidata-Taxonomy" allow useful yet limited usage scenarios related to Wikidata. The full power of Wikidata is accessible via SPARQL, the standard for programs related to linked data/the Semantic Web. The Wikidata example queries demonstrate this and illustrate how to work in domains such as medicine, computer science, history or sports. An interesting feature of the SPARQL query interface to Wikidata are the different options for visualizing results, including tables, diagrams, timelines etc.

Content

Wikidata relies heavily on its content. Two forces are at play here:

 

  1. The coverage as such
  2. The interconnection between data sets

All content is created and maintained by humans or machines (bots). Guiding principles are:

 

  • Information shall meet the criteria for notability as specified by Wikidata
  • Information can be contradictory

All content in Wikidata can be used according to the Creative Commons CC0 license.

As an alternative to storing content in Wikidata itself, it is possible to establish connections from Wikidata content to other data sets. The most important step in this process is to "match" data items in both Wikidata and the other data set. The matching not only allows an import, but also enables automatic content enrichment. For example: Wikidata contains identifiers of the "Gemeinsame Normdatei (GND)", and links them to Wikipedia. The GND identifier for an author can thus be linked to his biography in Wikipedia.

Loose ends and calls to action

Caveats

Like most large collections of information, Wikidata is not perfect regarding general coverage, completeness of information for certain domains and items, or correctness and accuracy. A case in point is the domain of terminology.

Wikidata has a relation to terminological data categories as defined by ISO 12620 and ISOcat or its successor DatCatInfo.

Currently, however, data categories related to terminology are hardly used. A Wikidata query on the use of terminological data categories yields only 462 hits, most of them relating to linguistics and literary studies. Other domains are hardly represented at all.

Loose ends

Initially, Wikidata scored low with regard to lexicographic information – the world of lexemes, senses, variety, etc. "Terminology" was lacking as well: The data model only allowed preferred terms/names to be used effectively in Wikidata. Synonyms, abbreviations, unauthorized names, etc., could only be categorized as "also known as". They also seemed to be unsupported by Wikidata's search. An extension that was put in place recently, addressed this issue and should help to improve the usability of Wikidata in lexicographic contexts. The aim, among other things, is to be able to deal more comprehensively with data categories in the areas of etymology. It is assumed that this in turn will yield benefits for Wiktionary – the "lexicon" of the Wikimedia ecosystem.

The previous two sections touched on areas where interested constituencies and individuals could become active to improve Wikidata. Here are some specific ideas:

 

  • Examine possible shortcomings in the data model.
  • Systematically integrate data categories relevant to a certain domain into Wikidata or adapt existing Wikidata data categories to the needs of that domain.
  • Systemize the mapping between Wikidata properties and domain-specific data categories.
  • Explain the added value of mapping for a certain domain (e.g. access to multimedia assets).
  • Make the added value of the mapping clear for Wikidata (e.g. make variants of terms such as unauthorized writing attributable in Wikidata).

We hope that this article can inspire other content creators to explore Wikidata and become active participants in improving this useful knowledge base.

 

Further reading