Linked Data and Schema.org: Crossing the language chasm with terminological assets
Since 2011, the Schema.org initiative has set a standard for structured information. The technical communication community can already benefit and deliver valuable input for further progress.
The results pages returned by search engines often contain more than just links to websites presumably relevant to your search. They also reveal various types of information such as reviews, and sometimes even a visual preview of related content on websites (see Figure 1).
The trick behind this is a concept called "Linked Data" as well as the so-called "Schema.org", an endeavor driven by Google and other major players on the Web .
This article explores the state of affairs of Linked Data with regard to technical documentation in general, and terminology in particular. The idea is to answer the question “Which role can terminological assets play in today’s Web – especially concerning the search context?”
Let’s start with the following scenario (see Figure 2): A Chinese technician familiar with Chinese terminology in his field of expertise is searching the Web for a product related to "" (Easy Graphics Framework). A vendor offers this product in his portfolio but only provides corresponding information in English and German. Because the Web content is not available in Chinese, the technician remains unaware of the offer, and the vendor thus loses a potential sales opportunity.
So-called "tailor-made cross-lingual snippets" are an important ingredient in crossing the language chasm in the Web. These blocks of content are derived from a dedicated, curated bilingual company-owned asset such as a terminology database (see Figure 3). They are sensitive to the domain-specific language and are therefore far more accurate than the bilingual texts generated by generic machine translation solutions.
Figure 3: "Tailor-made cross-lingual snippet"
Today’s web search
A web search involves far more than a search engine and a keyword being entered into a search box by a user. Rather, it relies on the infrastructure on which any distributed application on the Web is based (see Figure 4): a client (e.g. a Web browser), a network, and a server. Behind the scenes, this infrastructure makes use of a set of technologies such as the Hypertext Transfer Protocol (HTTP), which are sometimes collectively referred to as the Open Web Platform .
In a similar vein, the complexity of a search engine results page (SERP) is not visible. SERPs have evolved from simple lists of links into rich, interactive assemblies of content blocks called "snippets" or "cards" (see  and ). In many cases, these content blocks are organized in different categories such as organic or sponsored results, rich snippets, informational boxes, or knowledge graph cards (see Figure 5). As referred to in the introduction, an approach named "Linked Data" supports most of this. Linked Data can also be described as a kind of "universal application programming interface (API)" that dramatically facilitates information handling on the Web.
But not only search engines make use of Linked Data. Many websites are also distributed applications: Their content is not written by the website owner but rather aggregated dynamically from a variety of sources. And this aggregation makes ample use of Linked Data. The music landing page of the British Broadcasting Corporation, for example, uses Linked Data to tap into information sources such as Discogs, Dbpedia, last.fm, and MusicBrainz .
Close-up 1: Linked Data
As Linked Data is behind Web searching and websites, it is a major force in today’s Web. Its fundamental ideas are:
- Put interesting data entities (e.g. facts) and the relations between them on the Web and identify each of them via a Globally Unique Identifier (GUID).
- Use Semantic Web technologies to create mashups of these entities (e.g. attractive websites or powerful search applications).
Quite a number of concepts and technologies related to Linked Data originate from the so-called Semantic Web – the notion of a Web-based network of units with explicit meaning. The Semantic Web tackles semantic issues such as the following: A program called a "screen scraper" looks at the string "Call me on 2016-07-27" but misinterprets "2016-07-27" as a phone number. In contrast, the Semantic Web extracts the correct meaning by using structured information. It thus explicitly encodes "2016-07-27" as a date (see Figure 6).
Figure 6: Text with explicitly encoded/structured information
A pivotal concern regarding the Semantic Web is the explicit representation of information. The basic model behind this representation structures information into a subject, predicate, and object. For example: A product (=subject) has the name (=predicate) Easy Graphics Framework (=object).
This model can easily be visualized as a graph (see Figure 7). Here, both entities (= subjects) as well as their characteristics (= predicate) are identified via GUIDs. This basic model relies strongly on the concept of "vocabulary": A vocabulary can be understood as the set of values that can serve as subject, predicate or object. A vocabulary thus describes an area of relations, as it comprises all terms that are relevant to that area. For example: the "friend-of-a-friend vocabulary" (FOAF; see ) describes relationships between persons (e.g. "friend", "father", "sister", etc.).
Figure 7: Visualization of Semantic Web
Close-up 2: Vocabularies and semantic annotations
The Linked Open Vocabularies (LOV; see ) site maintains an overview of popular Linked Data vocabularies like (but not limited to):
- "Friend-of-a-friend (FOAF)" for persons and their relationships
- "Dublin Core" for general metadata (e.g. titles of works of art)
- "Basic Geo Vocabulary" for geographical information (like latitude and longitude)
- "Schema.org" for popular search areas (e.g. events, products, reviews…)
Schema.org is particularly relevant with regard to Web searches  and thus deserves a closer look. In 2011, the world’s biggest search engine providers realized that their users could benefit from Linked Data, and that they as key players could assist in lowering the entry barrier to the world of Linked Data. Therefore, they took the core concepts of the Linked Data technology stack and agreed on simple mechanisms (markups and vocabularies) to encode (linked) information in Web pages. The outcome was Schema.org, an effort to bring the idea of linked and structured information to the Web at large.
Early on, these search engine providers understood that Schema.org could only achieve broad adoption if their work was handled and informed by an open community rather than a closed club. Thus, in 2015, a dedicated Community Group was formed within the World Wide Web Consortium (W3C) to provide an open and transparent forum .
The basic ideas behind Schema.org are:
- Provide a way to identify unique concepts and relations via Uniform Resource Identifiers (URIs). For example, the concept of a person is identified via http://schema.org/Person.
- Provide simple means to embed the aforementioned information into Web content.
- Use just one vocabulary for Schema.org definitions, and one domain name, viz. http://schema.org, for all vocabulary definitions.
This is a key difference between the Semantic Web and Linked Data: Vocabularies like FOAF or Dublin Core are maintained independently and rely on different domain names. The benefit is flexibility; the disadvantage is uncertainty for the user about how the vocabularies work together, and what to use when.
According to the Schema.org website, today, more than ten million sites use Schema.org markup to encode Linked Data information – without requiring in-depth knowledge of the complex Semantic Web/Linked Data technology stack. The coverage of domains differs according to the type:
- Markup for persons appears on more than one million Web site domains.
- Markup for news articles appears in fewer than 50,000 domains.
- Markup for technical articles appears in fewer than 1000 domains.
Crossing the language chasm
So let’s go back to the scenario described at the beginning of this article – an unsuccessful search in Chinese and a lost sales opportunity – and find out if the result might be altered by turning to Linked Data and Schema.org (see Figure 9). How could the vendor’s terminological knowledge be leveraged?
Implementing Linked Data and Schema.org for this scenario involves the following steps:
- Get terminology out of the terminology database (e.g. as TermBase eXchange (TBX) – a standard for the exchange of terminological data)
- Map the vocabularies of the terminology database to that of Schema.org (e.g. "Pr" to "http://schema.org/Product")
- Get the mapped information serialized (e.g. as JSON-LD)
- Annotate content automatically with your serialized data
Steps 2, 3 and 4 are often combined and basically create a view of terminological data that is of high value in the context of Search Engine Optimization (SEO). First tools are emerging to facilitate these steps . Relying on technology from the FREME project ( and ), these tools assist with the conversion from TBX to JSON-LD (see Figure 10).
Unfortunately, translations cannot be realized in a straightforward way: Schema.org does not include an "is_translation_of" relation to express that something is a term, let alone a whole term database entry . Schema.org’s "sameAs" is not tailored towards cross-lingual links, and its "CreativeWork" is not tailored towards terms but only towards books and movies, etc. Thus, the tailor-made cross-lingual snippet as outlined in Figure 3 cannot be easily realized yet. This is likely to change if the Schema.org Community Group receives corresponding requests and input at the W3C.
Nevertheless, with a bit of twisting and bending, there is a way to realize these cross-lingual snippets using Schema.org: Schema.org provides a "sameAs" relation and the ability to assign a language identifier to a concept via the "inLanguage" type. Taken together, these two Schema.org constructs can encode the translation relation between "" and "Easy Graphics Framework" that is needed in our scenario (see Figure 11).
Figure 1: "Translation relation" in current Schema.org
As TermBase eXchange (TBX) is used widely in terminological contexts, a TBX-to-Schema.org converter (see ) can be used to generate Schema.org markups that can facilitate cross-lingual searching (e.g. retrieving English content based on a query in Chinese).
A more comprehensive mapping of TBX to Linked Data can be realized through the OntoLex Linked Data vocabulary . OntoLex has already emerged as an industry standard for publishing terminological information as Linked Data. IATE – the terminology database of the European Union, for example – has already been converted to OntoLex. A general converter from TBX to OntoLex is available .
Based on Semantic Web concepts and technologies, Schema.org enhances the search experience for users, and helps content owners and website administrators to make their content more findable. The approach benefits from standards-based information provisioning – the fact that no specifications are necessary, and no proprietary API is involved. Furthermore, modeling is easy and universal.
However, as of today, Schema.org does not include everything that is needed to realize cross-lingual scenarios easily. As Schema.org is based on feedback, input from the technical documentation and terminology community could help to change this.
This article was supported by the FREME project and co-funded by the Horizon 2020 Framework Program of the European Union, Grant Agreement Number 644771, see .