Algorithms in Structured Writing: Processing Structured Text

This article is part 10 of 12 in the series Understanding and Mastering Structured Writing

boxes-of-sand-eggs-goldStructured writing involves separating content from formatting, as we move content out of the media domain. But to publish content, we have to get it back into the media domain by reuniting content and formatting. To do this, we process the structured text using algorithms.

One of the uses of structured writing is to publish in different combinations of content to different media. To do this we may need different algorithms to create different combinations of content and target different media. Understanding the basics of these algorithms is important to mastering structured writing even if you don’t intend to code the algorithms yourself.

Two into one: reversing the factoring out of invariants

Moving content from the media domain to the document domain and the subject domain involves progressively factoring out invariants in the content. Each step in this process creates two artifacts: the structured content, and the invariant piece that was factored out.

Processing structured text is about putting the pieces back together to produce the desired output; combining the structured content with the invariants that were factored out. If factoring out the invariants moves content toward the document or subject domains, combining the content with the invariants moves it in the opposite direction, toward the media domain. This could mean moving the content from the subject domain to the document domain, or from the document domain to the media domain, or simply from a more abstract form in the media domain to a more concrete form (which will be our first example).

Adding back style information

The first example of separating content from formatting that we looked at involved factoring out the style information from this structure:

{font: 10pt "Open Sans"}The box contains:
{font: 10pt "Open Sans"}[bullet][tab]Sand
{font: 10pt "Open Sans"}[bullet][tab]Eggs
{font: 10pt "Open Sans"}[bullet][tab]Gold

We replaced the style information with style names:

{style: paragraph}The box contains:
{style: bullet-paragraph}Sand
{style: bullet-paragraph}Eggs
{style: bullet-paragraph}Gold

And then we defined the styles:

paragraph = {font: 10pt "Open Sans"}
bullet-paragraph = {font: 10pt "Open Sans"}[bullet][tab]

To unite the styles with the appropriate paragraphs, we write a set of rules that describes how they are combined. We can start with simple search and replace rules:

find {style: paragraph}
    replace {font: 10pt "Open Sans"}
find {style: bullet-paragraph}
    replace {font: 10pt "Open Sans"}[bullet][tab]

I said that the basic processing algorithm was to combine two sources of information to create a new one. Where are these two sources? The first source is the structured text. The second source is the style definitions. In the example above, the style definitions are embedded in the rules themselves. Embedding one of the two sources to be combined in the rules themselves is common practice. In some cases, though, the rules may pull content from a third source. We will see cases of this later.

The result of applying these rules is that we get back the original content:

{font: 10pt "Open Sans"}The box contains:
{font: 10pt "Open Sans"}[bullet][tab]Sand
{font: 10pt "Open Sans"}[bullet][tab]Eggs
{font: 10pt "Open Sans"}[bullet][tab]Gold

If we want to change the styles, we can apply a different set of rules:

find {style: paragraph}
    replace {font: 12pt "Century Schoolbook"}
find {style: bullet-paragraph}
    replace {font: 12pt "Century Schoolbook"}[em dash][tab]

Applying these rules will result in a change in the formatting of the original content:

{font: 12pt "Century Schoolbook"}The box contains:
{font: 12pt "Century Schoolbook"}[em dash][tab]Sand
{font: 12pt "Century Schoolbook"}[em dash][tab]Eggs
{font: 12pt "Century Schoolbook"}[em dash][tab]Gold

Rules based on structures

The tools that do this sort of processing do not literally use search and replace. Rather, they parse the source document to pull out the structures and allow you to specify your processing rules by referring to the structures.

We are not concerned at this level with the actual mechanism by which a processing tool recognizes structures. We are concerned with what to do when a structure is found. So let’s rewrite our rules to match structures rather than find literal strings in the text:

match paragraph
    apply style {font: 12pt "Century Schoolbook"}
match bullet-paragraph
    apply style {font: 12pt "Century Schoolbook"}
    output "[em dash][tab]"

carton-of-eggsI have written these rules in what is called pseudocode. Pseudocode is a way for human beings to sketch out an algorithm to make sure that they understand what they are trying to do before they write actual code. Pseudocode has no formal grammar or syntax. It is intended for humans, not computers, so you can use whatever approach you like as long as it is clear to your intended audience. But pseudocode should clearly lay out a set of logical steps for accomplishing something. For structured writing algorithms, it should make clear exactly how the pieces go together.

Writing algorithms in pseudocode is a great way to make sure that we understand the algorithms we are creating without worrying about the details of code — or even learning how to code. They are also a great way to communicate to actual coders what we need an algorithm to do.

The result of applying these rules is just the same as before:

{font: 12pt "Century Schoolbook"}The box contains:
{font: 12pt "Century Schoolbook"}[em dash][tab]Sand
{font: 12pt "Century Schoolbook"}[em dash][tab]Eggs
{font: 12pt "Century Schoolbook"}[em dash][tab]Gold

The only real difference is that we have factored out the details of how the structures are recognized. (Yes, code and pseudocode are examples of structured writing at work!)

The order of the rules does not matter

You may have noticed that what these rules are doing is pretty much what style sheets do in an application like Word or FrameMaker. In fact, it is exactly what a style sheet does. If you understand style sheets, you understand a good deal of how structured writing algorithms work.

Note that when you create a style sheet in Word or FrameMaker, you don’t specify the order in which styles will be applied to the document. The same is true when you create a CSS stylesheet for the Web. The style sheet is just a flat list of rules. The order in which the rules are applied to the document depends entirely on the order in which the various text structures occur in the document.

This may seem very obvious, but it is key to understanding how structured text is usually processed. It is a subject that is sometimes quite confusing to people who have been trained to write procedural computer programs, which is why I am making a point of calling it out.

Things get a tiny bit more complex when we move into processing the nested structures of the document domain and subject domain, but the basic pattern of a set of unordered rules to describe a transformation algorithm still applies.

Applying rules in the document domain

Suppose we have a piece of document domain structured text that contains this title structure:

title: Moby Dick

We want to transform this document into HTML (which, as we discussed before, kind of sits on the boundary between the media domain and the document domain). When our rule matches a structure in the source document, it outputs the equivalent HTML structure. Here is the pseudocode for this rule (it is in a slightly different format from the pseudocode above):

match title
    create h1
         continue

This says, when you see a title structure in the source, create an h1 structure in the output, and then continue applying rules to the content of the title structure.

The continue instruction is indented under the create h1 instruction to indicate that the results of continuing will appear inside the h1 structure.

In our pseudocode, we are assuming that the text content of each structure will be output automatically (as is the case in many tools), so the output of this rule (expressed in HTML) is:

<h1>Moby Dick</h1>

But suppose that there is another structure inside the title in our source. In this case it is an annotation of part of the title text:

title: Review of {Rio Bravo}(movie)

Here the annotated text is set off with curly braces and the annotation itself in in parentheses immediately after it. So the annotation says that the words “Rio Bravo” refer to a movie. (I really will explain this markup eventually.) The annotation is a content structure just like the title structure, and is nested inside the text of the title.

So what do we do with our rule for processing titles to make it deal with movie annotations embedded in the title text? Absolutely nothing. Instead, we write a separate rule for handling movie annotations no matter where they occur:

match movie
    create i
        continue

When the processor hits continue in the title rule, it processes the content of the title structure. In doing so, it encounters the movie structure and executes the movie rule. The result is output that looks like this:

<h1>Review of <i>Rio Bravo</i></h1>

The continue instruction is really all we need to add to our rules to allow them to deal with nested structures. They remain an unordered collection of rules, just like a style sheet. (In fact, XSLT, a language that implements this model, calls a set of processing rules a “stylesheet”.)

Processing based on context

When we move to the document domain, we can use context to reduce the number of structures that we need. For example, where HTML has six different heading structures, h1 through h6), DocBook has only one—title—which can occur in many different contexts. So how do we apply the right formatting to a title based on its context? We create different rules for the title structure in each of its contexts. We express the context by listing the parent structure names separated by slashes:

match book/title
    create h1
         continue
match chapter/title
    create h2
        continue
match section/title
    create h3
         continue
match figure/title
    create h4
        continue

Now here is the clever bit. You don’t have to change the movie rule to work with any of these versions of the title rule. Suppose our title is the title of a section, like this:

section:
    title: Review of {Rio Bravo}(movie)</title>

When we process this with our rules, the section/title rule will be executed to deal with the title structure, and the movie rule will be executed when the movie structure occurs in the course of processing the content of the title structure, with the following result:

<h3>Review of <i>Rio Bravo</i></h3>

This is the basic pattern for most structured writing algorithms. An algorithm consists of a set of rules.

  • For each structure, you create a rule that says how to transform that structure into the structure you want.
  • Each rule specifies the new structures to create and where to place the content and any nested structures.
  • In each rule, you specify where the processor should process any nested structures and apply any rules that apply to them.
  • If you want a different rule for a structure occurring in different contexts, write a separate rule for each context.

Why is it important to understand this? Because when you are going through the process of abstracting out invariants to move content to the document domain or subject domain, it is really useful to understand how those invariants will be factored back in. In fact, understanding how this works can help you recognize invariants in your source and give you the confidence to factor them out. Writing down the pseudocode for processing the structure you are creating can help you validate that you have factored things out correctly and that the structures you are creating will be easy to process and that the processing rules will be clear, consistent, and reliable.

Obviously there is far more involved in a complete processing system, and we are not going to get into the gritty details here, but let’s look at few a cases that come up frequently.

Processing container structures

When we move content to the document domain or the subject domain, we often create container structures to provide context. These container structures don’t have any analog in the media domain, so what do we do with them when it is time to publish? We obviously use them to provide context for the rest of our processing rules, but what do we do with the containers themselves?

In the previous example the content was contained in a section structure. So how does the section structure get processed?

match section
    continue

Yes, it’s that simple. Just don’t output any new structure in its place. The section container has done its work at this point so we simply discard it. We still want the stuff inside it though, so we use the continue instruction to make sure the contents get processed. In short, the container element is a box. We unpack the contents and discard the box.

Restoring factored-out text

Sometimes when we factor out the invariants in content, we are not only factoring out styles, we are also factoring out text. To process the content we need to put the text back (obviously we can put back different text depending on our needs — which was why we factored it out in the first place).

As we saw, a simple example of factoring out text is numbered and bulleted lists, where we factor out the text of the numbers and bullets. Let’s look at how we create rules to put them back.

Suppose we have a document that contain these two different kinds of lists:

paragraph: To wash hair:
ordered-list:
    list-item:Lather
    list-item:Rinse
    list-item:Repeat
paragraph: The box contains:
unordered-list:
    list-item:Sand
    list-item:Eggs
    list-item:Gold

Let’s write a set of rules to deal with this document. Converting this to HTML lists won’t tell us much, since HTML handles list numbering and bullets itself, so we’ll create instructions for printing on paper. We won’t use real printing instructions (they get tediously detailed). Instead we will use the same style specification shorthand we used above. The paragraph rule is simple enough:

match paragraph
    apply style {font: 10pt "Century Schoolbook"}
    continue

Now let’s deal with the ordered-list. The ordered list structure is just a container, so we don’t need to create an output structure for it. But because this is an ordered list, we need to start a count to number the items in the list. That means we need a variable to store the current count. We will use a `$` prefix to indicate that we are creating a variable:

match ordered-list
    $count=1
    continue

Then the rule for each ordered-list item will output the value of the variable and increment it by one:

match ordered-list/list-item
    apply style {font: 12pt "Century Schoolbook"}
    output $count
    output ".[tab]"
    $count=$count+1
    continue

Every time the ordered-list/list-item rule is fired, the count will increase by one, resulting in the list items being numbered sequentially.

If a new numbered list in encountered, the ordered-list rule will be fired, resetting the count to 1.

This rule will not match list-item elements that are children of an unordered-list element, so we need a separate set of rules of unordered lists. Because unordered-list is just a container and does not produce any formatted output, its rule is just a continue, as we saw with section above.

match unordered-list
    continue

match ordered-list/list-item
    apply style {font: 12pt "Century Schoolbook"}
    output "[em dash][tab]"
    continue

Applying these rules will produce output like this:

{font: 10pt "Century Schoolbook"}To wash hair:
{font: 10pt "Century Schoolbook"}1.[tab]Lather
{font: 10pt "Century Schoolbook"}2.[tab]Rinse
{font: 10pt "Century Schoolbook"}3.[tab]Repeat

{font: 10pt "Century Schoolbook"}The box contains:
{font: 10pt "Century Schoolbook"}[em dash][tab]Sand
{font: 10pt "Century Schoolbook"}[em dash][tab]Eggs
{font: 10pt "Century Schoolbook"}[em dash][tab]Gold

Note how the structure has been flattened and all of the abstractions of document structure have been removed. We are back in the media domain, with a flat structure that specifies formatting and text.

Processing in multiple steps

We do not always want to apply final formatting to our content in a single step. When we separated content from formatting, we did the separation in several stages. It is often desirable to put them back together in several steps. Not only are the algorithms involved easier to write and maintain if they only do one step of the process, we can often reuse some of the downstream steps (nearer the media domain) for many different types of document domain and subject domain content.

So far, we have looked at examples from the media domain and the document domain. Let’s look at one from the subject domain. We used an example of completing the separation of content from formatting by moving a labeled list from the document domain to the subject domain.

address:
    street: 123 Elm Street
    town: Smallville
    country: USA
    code: 12345

Now let’s look at the algorithm (the set of rules) for getting it back to the document domain, where it should look like this:

labeled-list:
    list-item:
        label: Street
        contents: 123 Elm Street
    list-item:
        label: Town
        contents: Smallville
    list-item:
        label: Country
        contents: 123 USA
    list-item:
        label: Code
        contents: 12345

Here is the algorithm (set of rules) to accomplish this transformation:

match address
    create labeled-list
        continue
match street
    create list-item
        create label
             output "Street"
        create contents
            continue
match town
    create list-item
         create label
             output "Town"
        create contents
            continue
match country
    create list-item
         create label
             output "Country"
        create contents
            continue
match code
    create list-item
         create label
             output "Code"
        create contents
            continue

Notice that the text of the labels, which we factored out when we moved to the subject domain, are being factored back in here, and are specified in the processing rules. As we moved the content from the media domain to the document domain to the subject domain, we first factored out invariant formatting and then invariant text. In the algorithms, we put back the text and the formatting, each at a different processing stage.

Processing content in multiple steps can save us a lot of time, even though this may sound counter-intuitive. The subject domain address structure is specific to a single subject and we might have many similar structures in our subject domain markup. But it is presented as a labeled-list structure. A labeled list is a document domain structure that can be used to present all kinds of information, and that can be formatted for many different media. By transforming the address structure into a labeled-list structure, we avoid having to write any code to format the address structure directly. We can format the address correctly for multiple media using the existing labeled-list formatting rules.

Query-based processing

The rule-based approach shown here is not the only way to process structured writing. There is another approach which we could call the query-based approach. In this approach, you write a query expression that reaches into the structure of a document and pulls out a structure or a set of structures from the middle of the document.

This is a useful technique if you want to radically rearrange the content of a document, or if you want to pull content out of one document to use in another. (The rule-based and query-based approaches are often called “push” and “pull” methods respectively, but I sometimes find it hard to remember which is which. I find rule-based and query-based more descriptive.) We will look at algorithms that use the query-based approach in later articles.

Other articles in the Structured Writing Series:

box-of-coinsWhat is Structured Writing?

Writing in the Media Domain

Writing in the Document Domain

Backsliding into the Media Domain

Writing in the Subject Domain

The Intrusion of the Management Domain

Quality in Structured Writing

 

Series Navigation<< Algorithms: Separating Content from FormattingThe Single Sourcing Algorithm >>
Mark Baker

Mark Baker helps organizations improve the impact of their content by focusing their design, writing, and production processes on producing content that matches the way people seek information on the Web today. He is the author of Every Page is Page One: Topic-based Writing for Technical Communication and the Web. He blogs at everypageispageone.com. You can reach him through his company, Analecta Communications Inc.

Read more articles from Mark Baker