March 2020
Text by Jang Graat

Image: © trigga/istockphoto.com

Jang F.M. Graat is a philosopher and selftaught programmer, with more than three decades of experience in technical communication. He lives in Amsterdam where he founded his company Smart Information Design.


jang[at]smartinfodesign.com


www.smartinfodesign.com

Localizing measurements, "automagically"

Meters, feet or yards – there are still no common global measurement units, and perhaps there never will be. Using DITA, we can achieve automated localization for figures and their units.

In 1999, the Mars Climate Orbiter spacecraft crashed into the surface of the red planet. This put a very premature end to a highly technical product that had taken US$125 million and many years to develop. It also set back another project that was going to use the orbiter as a communications relay station. The reason for the crash was a simple human error: Engineers had assumed that acceleration data was given in metric units, when the engine supplier had used inches. Obviously, the data was not marked up in XML (although the standard did exist as a release candidate back then).

This is a perfect example of the added value that semantic markup can bring. With markup that identifies a measurement as a specific unit, any mismatch in the data could have easily been detected and corrected before the spacecraft was launched.

Without a doubt, in the space industry, the effects of such a mistake are tremendous. But whichever industry you work in, assuming that this problem does not apply to you could turn out to be a costly error. Using incorrect measurements may cause accidents and result in liability lawsuits. At the very least, it will cause a waste of time, effort, and materials as parts will have to be produced again with correctly converted measurements.

Various measurement systems

The world’s leading measurement system today is metric, but this is a relatively new system. For a long time, almost every country or even region had their own way of expressing distance, weight, temperature, etc. Since the meter (the French mètre) was introduced in 1791, there has been a steady increase in the adoption of this system. One of the main reasons may be that it is easier to represent fractions in a decimal system than for yards, feet and inches.

With the industrial revolution, the need for more precision in measurements arose, as parts needed to have precise dimensions to fit together in machines, and gauges needed to be set at specific levels to make for a smooth production process. This demanded precise definitions of standard units. Also, new units were defined as science and engineering opened up new domains of knowledge. In many cases, the new units were named after their inventors (such as Watt, Fahrenheit, Curie).

With the second wave of globalization (no longer just moving people and consumer goods around but also machines, machine parts, and production materials – plus, of course, their documentation) came the need for a common measurement language. Attempts were made to unify the various measurement units across the globe. Still, for various reasons (conservatism, heavy investment in existing systems, not-invented-here syndrome), worldwide adoption of a single unifying measurement system was never achieved and it seems unlikely that it ever will be. Today, apart from the metric (or SI – "Système international d’unités"), there are two other main measurement systems: the imperial system (used in large parts of what used to be the British Empire), and the U.S. customary system (redefining several British measurement units while keeping the same names).

For the occasional or even routine international traveler, this is not a huge problem. We can usually convert the units in our heads to get a rough estimate for things such as the local speed limit or the weather forecast to determine which clothes to pack. This can be compared to estimating the cost of products when expressed in a foreign currency: Generally, you do not need the exact value to the second decimal to determine if the product is too expensive for your budget. But in technical documentation, the diversity in measurement units is a real problem to solve, especially with the increased need for content reuse.

 

Unified Code for Units of Measurement

Without precise information about the measurement units in which technical information is expressed – and what the target units should be – it is virtually impossible for a localization expert to carry out a correct conversion. If the original units are defined in the metric system, the source values are clear in most cases. Keep in mind, though, that in some cases the value of a measurement can differ between the imperial and the U.S. customary systems (for example: a billion). You have to know which system was used for the original content. Also, you need to know which system to use for the target content.

Although this type of information could theoretically be added as metadata outside of the content itself, it is obviously much better to have this information embedded. This greatly increases content reuse opportunities. Assuming the content is already marked up in one of the available semantic markup languages, adding markup for measurements is really the only good method of allowing controlled localization – whether this is automated or not. The localization expert simply needs correct information about the original and target measurement systems to be able to do the job. Automated localization is virtually impossible without it.

But which markup to use? The markup should allow any unit in any system to be expressed unambiguously. There have been various attempts at defining such a universal system of units, but most of them were only concerned with a subset of the full spectrum. The most comprehensive definition to date is called Unified Code for Units of Measurement – UCUM. UCUM defines a unique identifier for all known measurement units, along with definitions of the main characteristics of each unit (symbols, metric or non-metric, relationship to other units).

Converting the source unit into the target unit

UCUM defines seven (metric) base units: meter, second, gram, radian, kelvin, coulomb and candela. A defined set of prefixes allows scaling these units (e.g. km, cm, mm etc.). All other measurement units are defined in relation to one or more other units, except for so-called arbitrary units (e.g. tuberculin unit – for expressing the biological activity of tuberculin) that cannot be localized at all.

As an example, this is the definition of an imperial foot, as expressed in XML:

To get to the base unit, the definition in the @Unit of <value> can be followed to the definition of imperial inch:

The unit "cm" is defined by the prefix "c" for 1/100 and the base unit "m" for meter, which cannot (and need not) be derived further. To convert the meter value to U.S. customary inches, the conversion factor from [in_us] unit to meter must be calculated and the meter value divided by that factor. With the metric base values as the common reference, any non-arbitrary value can be converted into any other non-arbitrary value.

A more challenging conversion would take a “psi” (pounds per square inch) value and express it in base units, and subsequently in “bar” (all of the below values and formulas are taken from the UCUM definitions in the XML file):

[psi] = [lbf_av] / [in_i]2

[lbf_av] = [lb_av] x [g]

[lb_av] = 7000 [gr] = 7000 x 64.79891 mg = 453.59237 g

[g] = 9.80665 m/s2

[in_i]2 = [ 2.54 cm ]2 = 6.4516 cm2

[psi] = (453.59237 x 9.80665 / 6.4516) = 689.47572 g / m s2 = 0.0689457572 bar

 

Making it work in technical content

Having worked with DITA from the start – DITA is the only semantic markup language that allows adding my own semantics without breaking any of the tools – I chose this as the basis for this article. Of course, adding similar semantics to DocBook or other XML markups is possible too, but it would require several years of hard work in the standards committees to make it become part of the next version of the standard.

My new element <measurement>, specialized from <ph>, allows identifying a number as a measurement with a specific unit – given in the @ucum attribute. This allows marking up dimensions as follows:

Via CSS (for online content) or rules in the transformation to PDF, I can add the symbol for the measurement as defined by the UCUM XML file, so that there cannot be a mistake in the visible result. This also removes the need to find and replace a text string representing the measurement unit. In the markup, the symbol for the measurement unit is implied.

Based on the markup and the UCUM XML file, the localization is done by an XSL transform, which follows the logic explained above: Each non-basic unit is converted to the next level using the <value> in the UCUM XML file. Once a basic unit is reached, scaling is done to determine the required prefix. This process converts the above markup into this line:


As some measurements are non-metric and are supposed to be represented in units and sub-units, a single <measurement> will not always be sufficient. Still, when converting to another measurement system, multiple measurements may turn into a single one, or vice versa. To allow this, the definition for <measurement> allows nested <measurement> elements, each with its own @ucum:

The above markup converted to metric would give this result:

An obvious addition is the introduction of an attribute to control rounding precision of the outcome, as we want to have the automatic localization work without the need for further manual corrections. The @round only really makes sense when it occurs once in a <measurement>, i.e. it would not be repeated on sub-measurements:

This becomes the following in metric:

Converting an imperial length into a U.S. customary length takes two steps: first, the imperial <measurement> is converted into metric and, second, the metric is converted into U.S. customary length units. In this case, the @round controls the decimals on the smallest length unit. It takes (much) more XSL code but it is certainly doable. The same applies to complex units (as in the above example of converting psi into bar).

 

Closing remarks

This article is the result of some pioneering work I have been doing in my scarce spare time, which resulted in a presentation at the DITA Europe conference in Brussels in November 2019. It is by no means a complete and final product, and there are obvious omissions in the code as it exists today. In the following section, I want to outline some of the things I am still working on.

Metric to non-metric conversion

So far, only non-metric to metric conversion for simple units is automated. However, complex units are not too hard to add. Conversion into non-metric units is another matter: in such cases, a single metric <measurement> may have to be separated into multiple nested <measurement> components, each with their own @ucum. In some cases, the smallest child unit is not supposed to have a decimal point but must be expressed as 5/8, 13/16, etc. The challenge is not only making the code produce these fractions but also defining the markup to control the process: allowing the author to define whether decimals or fractions are to be used on resulting non-metric units.

Suppressing localization

Having worked in the machinery business for more than a decade, I know that some units, e.g. sizes of nuts and bolts, should never be localized. It seems a shame to keep those measurements from being marked up (as this markup might serve other purposes as well), so there is a need for an attribute that controls whether or not a value should be localized.

This is equivalent to the @translate in DITA, which allows excluding part of the content from the translation process (e.g. when mentioning text that appears untranslated on the screen).

Further specialization

Specialization of <measurement> to <length>, <pressure>, <weight>, etc. adds more specific semantics to the content. It can also help to run sanity checks on the @ucum values that are being used. And with reused content, the extra semantics may help to avoid conrefs to incorrect measurements. Once the measurement domain is put to use, experience with the markup will quickly show which further specializations make sense. After all, DITA is supposed to adapt to business domains. Evolution is based on natural selection, which in turn is based on introduction of new features and putting them to the test.