February 2017
By Joe Pairman

Image: © erhui1979/istockphoto.com

Joe Pairman is Lead Consultant at Mekon Ltd., helping clients from healthcare to software realize the full potential of structured content and innovative information delivery with taxonomy. Before joining Mekon, he led the implementation of a DITA XML-based component content management system at HTC, including designing the support content architecture for HTC’s help app and responsive website.




The secret life of taxonomies: Web findability beyond browsing and facets

A taxonomy is essential infrastructure. Its visible forms such as menus and facets are only one application. It can also provide the crucial plumbing and wiring – the metadata and logic to guide your customers from Google to your site and finally, to the actual content they need.

It has to be said that most people have never heard of taxonomy. Others associate it exclusively with plants and animals. In tech comm, however, taxonomies relate to the classification of things in general – not just in biology, but for information of any kind. Most visible in the digital world are the tree-like structured taxonomies that we see in website menus.

When trying to tidy up the implicit taxonomies in your current tools and content – the directory structures and tables of contents – you might uncover ambiguities and conflicts in the ways you or your colleagues categorize and name things. In fact, if the classification tree is all you know of taxonomy for information delivery, you may well become disillusioned with its dogmatism and exclusivity, where each thing belongs in one container only.

As information delivery catches up with e-commerce, we are starting to realize the exciting potential of faceted browsing, where customers filter content based on the criteria they themselves deem most important – the products it relates to, the tasks it enables, the date it was written. Without learning our taxonomy, they can quickly combine criteria to find the exact piece of content they need.

Image 1: Faceted search on Cray’s portal, built on the DITAweb delivery platform.
Source: www.cray.com

However, faceted search relies on customers finding their way to your site. The big search engines – Google, Bing, Baidu, and Yandex – are now everyone’s home page, everyone’s top-level navigation. Few people will start their hunt for information on your site, unless you operate in a highly specialized sector. If you are documenting a nuclear power station or a particular medical device, search engines may not concern you much. But most customers of consumer equipment will start their search on the open Web, and if you don’t put your information there, they will take it from other users or even your competitors.

Of course, websites still need top-level navigation, just as offices still need phones and photocopiers. But if trees and facets are all you see of taxonomy, you may relegate it to a position of mere infrastructure – necessary but not really a strategic asset to nurture and grow. Out of sight, however, below the floorboards, taxonomies provide some essential plumbing to help your content show up in search results and to make those results more appealing. They also link your content to related pieces of information, not only on your site but also in the overgrown but prolific garden of the Web.

Synonyms: Driving traffic to your site

An American colleague of mine once reminded me that our respective home countries were "divided by a common language". In London, I take the lift, walk into my flat, and turn on the tap, whereas Americans ride the elevator to their apartments, which have faucets.

Organizations adopt their own vocabularies in a similar way and, sometimes, even teams on different floors will coin their own micro-dialects. Without specific guidance on terminology, when writing for end users, we tend to use the words we feel most comfortable with – words that are clear to us, but may be completely vague to the uninitiated. If you are not using the words that your customers would use to search, it is unlikely that they will find your content.

It has to be said that Google is getting better at understanding synonyms and showing results for common search terms even when those terms do not appear in the listed pages. However, this does not apply to specialized domains, for which there is less content available and hence less for Google’s algorithms to work on. Even when Google does successfully show results for synonyms, it does not highlight those synonyms in the search snippet. When searchers don’t see the term they’ve used and are familiar with, they are less likely to follow that result.

Here is Google’s own SEO advice:

Users who know a lot about the topic might use different keywords in their search queries than someone who is new to the topic… Anticipating these differences in search behavior and accounting for them while writing your content (using a good mix of keyword phrases) could produce positive results.


Image 2: Effective use of synonyms displayed on a Google search results page
Source: Google.com


Methods to help you manage synonyms and find new ones

By using taxonomy with a thesaurus structure you can organize and keep track of synonyms. Like a traditional, book-based thesaurus, it groups phrases with identical meanings. However, as we will see later, it provides richer relationships. Taxonomy management tools allow access to and exports in various formats, so they can act as a useful reference for authors who need to look up synonyms when writing.

Image 3: Concept within a thesaurus structure, in the PoolParty taxonomy management tool
Source: Illustrative example devised for this article


A thesaurus needs to be populated with synonyms that customers actually recognize and use. Various forms of taxonomy research can help with this task: One popular example is the card sort. An open card sort exercise prompts participants to group sample content into categories that they create. These user-generated categories are an excellent source of new synonyms, although they should be corroborated with wider evidence wherever possible.

For a broader validation of these new terms, and to get ideas for others, you can perform a corpus analysis, i.e. the statistical analysis of a body of text. To do this, you need to collect a large number of documents and pages from your own site (or documents), from user postings, and even from competitors. Corpus analysis techniques show you which are the most frequent terms across that body of documents, and which words tend to go together to form particularly relevant phrases. For example, an analysis of a large amount of tech comm content showed me that "metadata" and "DITA" are salient terms, and "business case around single sourcing" is also recognized as a term by the algorithm.

Comprehensive taxonomy management tools provide easier ways to perform corpus analyses, and allow you to match up this data with your own taxonomies. This helps to get new potential synonyms for the terms you already use, and you can even see which of your terms are not frequently used in the wild. This might be a sign that your terminology is not in sync with that of your actual customers. However, it could also indicate that the subjects you deem important are not well represented in the currently available information. You will have to do further research to determine whether this is because content is missing regarding these subjects, or because these topics are not relevant and you shouldn’t bother writing about them!

Keeping users on your site

So, taxonomies can help you maintain a richer, more relevant variety of terms in your content and thus increase the chance that the content will rank well for searches in your domain. You might say that this is all you really need to worry about in terms of findability. If your content is good and uses relevant terms, the big search engines will help users find it, and all is well.

But what if your users land on your relevant-looking page with its high-quality content, but it is just not quite what they are looking for? A single page or topic can only cover so much information. It may be that users need to fine-tune their search to find related content within your site. This is where your own site’s search feature becomes useful.

These days, few users start their search on your site. But many still use your search to take them the last step of the way to the information they need. And once again, taxonomy can help. The same synonyms you made available to your authors can be integrated with your site search to display content, even when authors did not include these synonyms in the content itself.

In fact, thesauri can contain a special kind of synonym – a hidden term – that should never be displayed directly to users but nonetheless should be recognized when they search. For customers who have recently switched to your product from a competitor, you might want to let them find information using the competitor’s terms for the features of this product, to ease the transition. Or, if you can set a standard for terminology, you might want to allow non-standard terms in searches without encouraging their use by featuring them directly in your content. In the example above, the concept of a hard reset has a hidden label "wipe", which may be a usage that the organization does not want to promote.

Automatically link suggestion

To cover the last part of the journey from Google to useful information, present links that users may need to get them from a page that’s merely in the vicinity of the searched-for subject to one that fits exactly. Users are accustomed to seeing these links under headings such as "Related Content", often in a sidebar next to the main page content. If we only have a few topics or pages, we can probably create those links by hand, and update them as necessary when the destination pages have been changed or removed. As our content increases, however, this becomes less manageable.

It is possible to completely automate the process not only of creating these links but also of driving the logic behind them. If you have a very large quantity of content, natural language-processing techniques such as Latent Dirichlet Allocation can identify the subject matter of each document or page in relation to the whole, enabling automatic link generation between pages on the same subject. For example, on a site offering information on financial services, pages about insurance might be linked together, even if some of them did not even feature the word "insurance" but rather "protection". However, apart from the need for a very large quantity of content to get high-quality results, these techniques do not offer the ability to tailor the links to specific user needs.

Metadata powered by taxonomy strikes the perfect balance between manual and automated linking. By carefully defining the required metadata fields for each page or topic as well as the sections of your taxonomy that are used for each of these fields, you can control the pool of links that will be automatically generated. You can also prioritize links based on the role the linked resources play, not just the general subject matter they cover, for example by highlighting the link to a conceptual overview that introduces the current page’s subject matter.

Let's say that you are designing the support content for a range of task and project management tools. In your metadata framework, you mandate that each page is tagged according to the general goal or task it enables as well as the products it relates to. In addition, you have a field for the page’s general role within the support content as a whole – scenario, troubleshooting, overview, or lookup table. Using the first two fields, you establish the pool of related links for a given page – links to all the pages relating to the same task that apply to the same product. Within that field, if there is a page with a role of “Scenario”, the link to that is placed at the top, to allow users to easily jump to a realistic example, putting the more specific instructions on other pages into context. (Authoring guidelines and editorial process ensure that there is at least one scenario for each general task area.)

In the example shown in Image 4, a number of example page titles are shown in black. Some pages have a Task domain metadata value of "Working with contacts", picked from the options in the Task section of the taxonomy. Some have a Product value of "Task Buster Pro". And some have the "Scenario" role, as those pages walk the user through a prepared example. As a result of this metadata tagging, the three bolded titles would be the pool of related links – each page of those three would present suggested links to the other two. On the two non-scenario pages, the link to "Scenario: A project with contacts from multiple sources" would be presented at the top.

Image 4: How relevant metadata values can provide a prioritized list of suggested links
Source: Pairman


This may seem technically daunting, but relies in fact on the same techniques used in faceted browsing, i.e. taking the intersection of various metadata fields to arrive at a pool of results. The only difference is in the presentation.

The links created in this way are far more useful and focused than those created automatically, without any guidance from taxonomy. Yet, they are much easier to maintain than those directly created by authors. If you delete or replace the page on "Assigning a task to multiple contacts", any links pointing to it are updated. Should you add more content that relates to working with contacts in the Task Buster Pro product, new links to that content will be added from the relevant pages.


Taxonomies not only support the obvious navigation elements of a site, but also provide an underlying logic to automatically suggest related links. They enable you to maintain a rich variety of real user language in your content that users can easily identify and recognize in search results. The variety of terms is crucial in order to receive top rankings in relevant searches. Far from becoming irrelevant, taxonomy is more important to findability than ever.