December 2012
By Claudia Oberle and Wolfgang Ziegler

Image: © Buchachon Petthanya/123rf.com

Claudia Oberle is a visiting faculty member at the University of Karlsruhe and a doctoral candidate at the University of Mainz. She has completed her bachelor’s degree in the subject “Translation Studies for Information Technologies” at the University of Heidelberg and the University of Mannheim. She completed her Master’s degree in Technical Documentation in Karlsruhe. She has also been working for icms GmbH since 2010.


claudia.oberle[at]hs-karlsruhe.de




Dr. Prof. Wolfgang Ziegler is a graduate physicist and has been professor for Information and Content management at the University of Karlsruhe since 2003. He heads the master’s course in Communication and Media management. Companies interested in the Rex evaluations can contact him directly.


wolfgang.ziegler[at]hs-karlsruhe.de
www.hs-karlsruhe.de

Content Intelligence for Content Management Systems

How do we use our content management system? Do we work efficiently and is the reuse as expected? Up until now it has been difficult to find answers to such questions, and wishes for quantitative indicators for content management applications have remained unfulfilled. This could change with the Report Exchange Format.

Content Management Systems (CMS) are customary tools for document creation in many companies. Complex creation processes such as multilingual information management and the collection of information across diverse locations would be very difficult to manage without system support.

The methodological base for the use of CMS includes the modularization and the systematic, i.e. controlled reuse of content. Ideally, a module concept that can be presented e.g. with the help of modularization matrices, is developed as part of a “content engineering” for planning the installation of a system as well as the corresponding documentation processes [1]. In principle, it is possible to assess opportunities for reusing modular contents from these preliminary studies and thus to develop fundamental arguments for the efficiency and benefits of using a system [2].

However, the methodological and quantitative efficiency of using a system is seldom reported later. (System-) indicators are hardly ever collected due to a lack of corresponding assessable internal databases, either before or even after the system is implemented. The quantifiable external costs such as translation efforts or the use of other services for creating and publishing content form the exceptions.

Definitions are available for key indicators for documentation processes [3], but it has not been easy to use them in content management systems till now. The systems offer functions such as the modular (re) usage reports and a controlled change management as part of the process support. Reports, e.g. for modular frequency of use, project or regular statistics are however collected only in individual cases.

From process management to content intelligence

What has been missing till now is a standardized option for making transparent the real efficiency and working method of a content management system - a “monitoring” function. For this purpose stronger technological support from the system would be helpful in the following phases:

  • Pre-documentation (“plan”): Planning module-based documentation projects for working groups and the distributed creation of content
  • Documentation process (“Do”): Creation of documentation with project monitoring and management; active (system) support of the modular reuse and content creation
  • Post-documentation (“Check”): Indicators and usage statistics for monitoring efficiency and classification.

From the perspective of quality and process management this corresponds to the requirements of the classic PDCA-cycle (“Plan-Do-Check-Act”) according to Deming [4]. The post-documentation phase (“Check”) should be followed by the optimization (“Act”) of the content management processes and the methodologies – see figure 1. This article primarily looks at the post-documentation phase to get CMS indicators, without which an optimization would hardly be possible.

Figure 1: PDCA cycle as per Deming applied to the documentation phases; optimization with the help of “content intelligence” methods.

 

Even from a scientific viewpoint it is desirable to introduce a standardized definition of indicators in the sense of quantitative metrics for CMS. It should be possible to achieve statistical insights into long-term CMS usage, e.g. for different sectors and organizational sizes through systematic analysis of the greatest possible number of implementations. The Report Exchange (REx) mechanism presented in the following sections was introduced to turn this into reality. It should further provide evidence for empirical statements that have not yet been verified statistically, e.g. the reuse in relation to module size or the possibilities for using a fine modular variant management [1].

Simple reuse and system indicators that convey a static view of the CMS at a fixed point in time (post-documentation) [5] are used as the starting point for the CMS metrics. The analysis of these indicators has already been realized technically. As an extension dynamic indicators are to be presented as planning indicators (pre-documentation), process and project management indicators (documentation process). The (system) technical analyses are also to be combined with semantic or linguistic methods conceptually like they are already used in the area of the controlled language checker and the authoring-memory functions. The entirety of the linguistic and technical analysis or the reporting process can therefore be called “content intelligence” method – analogous, for example, to business intelligence for commercial company data.

CMS data in the REx format

Presently, the technological basis for the collection of indicators is an XML export file, which contains a series of CMS basis data (figure 2). In particular, these are the characteristics of CMS objects (documents, modules and media). This data delivers information about which other sub objects are referenced respectively. In addition, information about language characteristics, version object sizes (e.g. word count) or time stamp can also be provided. The format structure definition of such a Report-Exchange (REx-) file was developed as an XML schema [6, 7] and made available publicly [8]. A number of CMS providers have already implemented a REx interface and their customers use a “RExport” (REx-Export) for the initial collection of indicators.

Figure 2: Logic for exporting CMS data in the REx format (level 1) and analysis (level 2) with visualisation in REx report.

 

The actual statistical and graphical analysis is executed with the help of XSL scripts outside CMS. The scripts originated as part of a master thesis at the University of Karlsruhe [7]. The scope of the REx definition as well as the analysis options are being developed further continuously in projects. All other existing object metadata can now be analyzed as an extension of the REx version 1.2 currently under development, for instance.

As a result, system users get an overview of absolute and average system sizes, e.g. module, document count, level of reuse of documents or number of module usages. In addition, a graphical distribution of the average and other sizes is created for a more exact analysis respectively. The complete range of the analyzed parameters, i.e. the CMS metrics is defined in the REx specification respectively [8].

Initial projects with cooperation partners (CMS providers and users) are already delivering interesting results. This is done through anonymous exports or even anonymous analysis respectively. This is important, as no personal data should be recorded or analyzed in REx. Only data such as the evidence of the use of objects is contained in the CMS, statistically cumulated and visualized. This is an important fact during discussions with the work council under certain circumstances.

Typical results

As already described, REx users obtain an overview of some basic system indicators among other things. Figure 3 shows a typical analysis result for a real CMS with a total of 492 created documents and an average reuse rate of 99 percent. Such high rates between 80 and 100 percent are not unusual for CMS applications in mechanical and plant engineering based on the data that has become available till now. Almost all documents then consist of modules used in other documents.

The system we looked at has a total of 66,771 modular reuses (11 million reused words) as compared to a relatively low number of 1914 original modules used (in all 2,275 modules and 353,000 words in the system). The increase in efficiency of an implemented system quickly becomes clear here: All controlled, reused contents would have originated from “uncontrolled copy&paste” including lack of translation control in the absence of a CMS. The output modules in the example were used in 30 documents and consisted of 155 words on average.

A number of other indicators can be analyzed in detail. The distribution of the reuse of all existing modules (figure 5) or the distribution of module sizes (figure 4) can be analyzed. Similar analysis can be considered for the media used and the different document-specific sizes.

Figure 4: Distribution of module sizes W in counted words; the height of the bars indicates the number N of modules of one size:  consolidation of smaller modules and greater decline in the frequency of larger modules.

 

Figure 5: Distribution of usage numbers Z of modules in CMS; the height of the bars indicates the number N of modules for a reuse number: Consolidation of modules that are used less than 30 times; further distribution up to 272 reuses.

 

A typical quantitative distribution behavior of module sizes could be identified from the data of the first cooperation projects. The trend [2] determined from earlier surveys towards small modules was also confirmed in practice with the REx data available till now (figure 4). In addition, the count for reuse of the individual modules was greatly distributed. Broadly speaking: A number of modules were used less, some were used frequently (figure 5).

If both sizes are correlated, we get the representations as in figure 6. Among these, we find many (large circles and consolidation in the lower left corner) small modules in the analyzed system, which are used only in few documents. The reason could be the complex variant management at a fine modular level. Large modules are usually used less frequently. Specific and relatively small modules are used most often. The latter include standardized security instructions or other generally applicable sections for instance. Even individual objects can be analyzed further with the help of REx data with combining and tracking in the CMS.

Figure 6: Correlation of module size W (horizontal) and usage number N of modules in documents (vertical); the larger the circle, the more the modules with the same size and usage number.

 

Thus, the present data appears to quantitatively support known statements that the reuse of modules reduces with increasing module size [1, S. 313]. A wider database should enable more statistical studies and conclusions about CMS usage in the future. The analysis however also shows that the data is difficult to analyze to some extent or the interpretation of the data is not clear due to the following factors:

  • different system philosophies and special technical features of the individual CMS (meaning of “versions”; limited export options for object information such as object size)
  • the individual customizations of the systems as per respective customer requirements
  • Individual process design and system usage in the company (versioning is not implemented or cannot be tracked)
  • the use of variant management (document variants created are not considered to some extent in case of RExport.)

For standardized and comparative analysis (benchmarks) these factors must be considered exactly in the future and specifically implemented in the RExports or the REx-analysis if required.

Outlook

To summarize, the use of the presented CMS metrics follows two objectives:

  • To provide tools for process management to optimize the CMS use
  • To gain scientific and methodical insight into the use of CMS an formulate best practices and benchmarks for CMS depending on the sectors

The technology used in RExports is relatively simple to implement as an interface and can be seen as a way to introduce the ‘content intelligence” process in CMS – along with a number of linguistic tools. The initial assessments show interesting options for analysis with a focus on the efficiency of modular reuse and many other system indicators.

Imminent extensions of the methodology include the analysis of dynamic, i.e. time and project related data. Furthermore, all object information should be considered to determine the use of classification systems (metadata) or the semantics (information structures) for instance. Technically, the REx methods and the metric analysis could be established as an application in the future, which can be realized as integrated or additive components for CMS or as an external service.

In the long term “content intelligence” methods could contribute to (more) active system support for easier reuse of modules. An example at the language level would be language control and the authoring memory that checks content in the creation process and makes active suggestions. In this sense, developments towards a “controlled reuse” (check the quality and efficiency of the reuse) and a “reuse memory” (suggestion for reuse) would be a parallel approach.

Interested companies can already determine relevant indicators of the CMS implementation with the REx method and quantify the technical level of efficiency after the implementation of the system. This creates the foundation for a wider assessment and research related to CMS implementations.

References

[1]    Drewer, Petra; Ziegler, Wolfgang (2011): Technische Dokumentation – Eine Einführung in die übersetzungsgerechte Texterstellung und in das Content-Management. Vogel Verlag.

[2]    Straub, Daniela; Ziegler, Wolfgang (2008): Effizientes Informationsmanagement durch spezielle Content-Management-Systeme. 2. Auflage, tekom, Stuttgart.

[3]    Straub, Daniela; Grau, Michael; Fritz, Michael (2008): 101 Kennzahlen für die Technische Kommunikation. tekom, Stuttgart.

[4]    Deming, W. Edwards (2000): Out of the Crisis, MIT Press.

[5]    Ziegler, Wolfgang (2008): Metrische Untersuchung der Wiederverwendung im Content Management, Hochschule Karlsruhe.

[6]    Knopf, Dominik (2009): Report Exchange Format ´Rex` – Metrische Untersuchung der Wiederverwendung im Content Management. Master-These Hochschule Karlsruhe.

[7]    Oberle, Claudia (2010): Indicators für das Content Management – Auswertung und Visualisierung von Daten im Report Exchange Format (REx-Format). Master-These Hochschule Karlsruhe.

[8]     Further information