Algorithms: Separating Content from Formatting

This article is part 9 of 13 in the series Understanding and Mastering Structured Writing

eggs-meringuesAt this point in this series, I am going to start looking at the algorithms of structured writing. An algorithm is a formalized method for performing a task. We often associate algorithms with computers, because to get a computer to do anything you have to formalize the algorithm and capture it as a program. But human beings can execute algorithms as well. Indeed, computer programs often replace human beings as the performers of algorithms. This is one of the reasons we turn to structured writing, so that we can hand over the tedious and exacting algorithms of writing and publishing to machines.

But before you hand over a process to a machine, you not only have to define and program the algorithm, you also have to structure and constrain the data so that the program can apply the algorithm to it. This is a large part of why we structure our writing. There is symmetry here: structured writing is about factoring out the invariants in content. Publishing is about factoring those invariants back in. The algorithms for factoring them out mirror the algorithms for factoring them back in.

eggs-in-a-bowlThis does not mean that writers are factoring content out of the media domain into the document or subject domains whenever they write. All content passes from the subject domain to the document domain to the media domain. Structured writing is about recording the content earlier in the process. A well-designed structured writing system, therefore, allows the writer to record their content without having to think about how it will be presented in the document domain or formatted in the media domain.

The algorithm for factoring out the invariants of content, therefore, falls largely on the person who designs the structures that writers write in. A very common mistake in markup design is to think of the process as simply creating a model that reflects the structure of the content. Such an approach often fails to consistently factor our invariants in the content and leaves us with a format that is both difficult to write in and difficult to process. It is much better to think in terms of an algorithm for factoring out invariants in content, and the corresponding algorithms for factoring them back in.

In the next several articles, we will look at the algorithms for factoring content in both directions. We will start with what we might think of as the original structured writing algorithm: separating content from formatting, or, as it is sometimes expressed, separating content from presentation. As we will see, exactly what counts as “formatting” and “presentation” is not a simple matter.

Separate out style instructions

Let’s start with a simple bit of content that includes a description of its format. This is not a real file format, just something I made up to illustrate the point. I’m using CSS syntax to describe the format and representing certain characters by their names in square brackets, just so we can see exactly where everything is going:

         {font: 10pt "Open Sans"}The box contains:
         {font: 10pt "Open Sans"}[bullet][tab]Sand
         {font: 10pt "Open Sans"}[bullet][tab]Eggs
         {font: 10pt "Open Sans"}[bullet][tab]Gold

This file contains content and formatting, so let’s separate the two. Of course, when we remove the formatting from the content we are going to have to add something in its place so we can add this (or other) formatting back later. (The algorithm should not really be called “separate content from formatting” but “separate formatting from content and replace it with something else”.)

The simplest thing to replace the formatting information with is a style:

         {style: paragraph}The box contains:
         {style: paragraph}[bullet][tab]Sand
         {style: paragraph}[bullet][tab]Eggs
         {style: paragraph}[bullet][tab]Gold

Then, of course, we need to record the formatting information in the style (we are separating it, not eliminating it altogether):

         paragraph = {font: 10pt "Open Sans"}

Now that they are separated, we have the choice of substituting different formatting by changing the definition of the style, rather than the content:

         paragraph = {font: 12pt "Century Schoolbook"}

Separate out formatting characters

Cool, but suppose we would like to change the style of the bullet we use for lists. The style of bullet used is certainly part of what we would consider “formatting”, but bullets are text characters. To change them you don’t just have to change the font applied to the characters, you have to change the characters themselves.

So, it turns out that sometimes the typed characters in your text are part of the content, and sometimes they are part of the formatting. So now we need to extend our idea of a style to include content.

         paragraph = {font: 12pt "Century Schoolbook"}
         bullet-paragraph = {font: 12pt "Century Schoolbook"}[bullet]

Now our content looks like this:

         {style: paragraph}The box contains:
         {style: bullet-paragraph}[tab]Sand
         {style: bullet-paragraph}[tab]Eggs
         {style: bullet-paragraph}[tab]Gold

Except that now the writer will be starting the bulleted lines with tab, which is awkward and probably error prone, so we move that character to the style as well.

         paragraph = {font: 12pt "Century Schoolbook"}
         bullet-paragraph = {font: 12pt "Century Schoolbook"}[bullet][tab]

Now our content looks like this:

         {style: paragraph}The box contains:
         {style: bullet-paragraph}Sand
         {style: bullet-paragraph}Eggs
         {style: bullet-paragraph}Gold

And now you can change the bullet style:

         bullet-paragraph = {font: 12pt "Century Schoolbook"}[em dash][tab]

And then we maybe realize that “bullet-paragraph” is not the best name any more, because the style is now a dash, not a bullet. In other words, we discover that we have not done as good a job as we thought of separating content from formatting, because the content still contains formatting information in the form of a style named for a particular piece of formatting.

Name your abstractions correctly

When we separate formatting from content, we have to insert something in its place, and it matters what that something is and what it is called. If we call it the wrong thing we set up a false expectation, and that will lead to authors using it incorrectly, which will mean we can’t format it reliably.

So the first lesson about the algorithm of separating content form formatting is that it matters what you call things. When you do this, you are creating an abstraction, and you need to figure out what that abstraction is and name it appropriately.

So what is the abstraction here? It is a list, of course. The bulleted paragraphs are list items. So maybe we do this:

         {style: paragraph}The box contains:
         {style: list-item}[tab]Sand
         {style: list-item}[tab]Eggs
         {style: list-item}[tab]Gold

and

         list-item = {font: 12pt "Century Schoolbook"}[em dash][tab]

Make sure you have the right set of abstractions

But then, of course, we run into this problem:

         {style: paragraph}To wash hair:
         {style: list-item}Lather
         {style: list-item}Rinse
         {style: list-item}Repeat

Here our list should have numbers, not dashes or bullets. So we realize that the abstraction we want is not as broad as “all list items.” We look at the differences between the different kinds of list items we use and try to group them into abstract types and come up with names for those types. Maybe we come up with “ordered-list-item” and “unordered-list-item”. Then we have:

         {style: paragraph}The box contains:
         {style: unordered-list-item}[tab]Sand
         {style: unordered-list-item}[tab]Eggs
         {style: unordered-list-item}[tab]Gold

and

         {style: paragraph}To wash hair:
         {style: ordered-list-item}Lather
         {style: ordered-list-item}Rinse
         {style: ordered-list-item}Repeat

And the style for ordered-list-items now looks something like this:

         ordered-list-item = {font: 12pt "Century Schoolbook"}<count>.[tab]

And then we realize that we need a way to increment the count and to reset it to 1 for a new list. So we have:

         {style: paragraph}To wash hair:
         {style: first-ordered-list-item}Lather
         {style: ordered-list-item}Rinse
         {style: ordered-list-item}Repeat

and

         first-ordered-list-item = {font: 12pt "Century Schoolbook"}<count=1>.[tab]
         ordered-list-item = {font: 12pt "Century Schoolbook"}<++count>.[tab]

(++count here means add one to count and then display it.)

And this is pretty much how you do lists in FrameMaker today, as well as other tools. But the reason for going through it in such detail is to point out what is involved in even this simple bit of separation. We began by simply removing formatting commands, but then started to remove characters as well, which forced us to include characters in our style definitions, and then to be able to actually calculate characters in our style definitions. And we saw that in performing these separations, we were creating abstractions, and that it was important to consider all the cases we might run into and create the appropriate abstractions to handle them.

eggs-in-a-recipeCreate containers to provide context

One problem with this approach is that the writer has to remember to apply a different style to the first item of a list. It would be better if they could use the same style for each list item and have the numbering just work. But this is hard to do because there is nothing in the content to say where one numbered list ends and the next begins. For this we need a new abstraction. So far we have abstractions for two kinds of list items: ordered and unordered list items. But we don’t have an abstraction for lists themselves.

To this point, we have been separating content from formatting purely in the media domain. We have replaced direct formatting definitions with indirect definitions through styles. The only thing that abstracts any of this beyond the media domain is the names that we have given to the styles that we have created. But now we start to venture into the document domain, creating the abstract idea of a list and inserting that abstract idea into our content.

         paragraph: To wash hair:
         list:
            ordered-list-item:Lather
            ordered-list-item:Rinse
            ordered-list-item:Repeat

We must deal with two significant changes here. First, our structure is no longer flat—we have introduced the idea of a container. A list is a container for list items. In creating this container we have added something to the content that was not there before. Previously it was a series of paragraphs with different styles attached. Now we have a container, which, as far as the formatting is concerned, simply never existed in the original. The writer and reader knew that the sequence of bulleted paragraphs formed a list, but that was an interpretation of the formatting. Now we have taken that interpretation and instantiated it in the content itself.

By creating the idea of a list, we are able to further separate list formatting from the content of the list, because now an algorithm, one I will call the formatting algorithm, can recognize it as a list and can make formatting decisions based on that knowledge.

The second important thing that happened is the content no longer contains references to style names. Instead we have structures.. List, paragraph and numbered-list-item are all structures.

We have replaced styles with structures because the same structure may get a different style depending on where it is in the document. The formatting algorithm is responsible for determining if an ordered-list-item is the first one inside a list and formatting it accordingly. (Which is just how list formatting works in CSS.)

Now authors no longer apply styles to content, even ones with abstract names. Rather they place content in structures and allow the formatting algorithm to apply styles appropriately. The result: content is separated even further from the formatting.

Move the abstractions to the containers

But there is an obvious problem here. What if an author inadvertently does this:

         paragraph: To wash hair:
         list:
            ordered-list-item: Lather
            unordered-list-item: Rinse
            ordered-list-item: Repeat

To avoid this, we move the abstraction outward. Instead of ordered and unordered list items, we create ordered and unordered lists:

         paragraph: To wash hair:
         ordered-list:
            list-item: Lather
            list-item: Rinse
            list-item: Repeat

and

         paragraph: The box contains:
         unordered-list:
            list-item: Sand
            list-item: Eggs
            list-item: Gold

And, of course, the list-item structures can be used in either an unordered list or an ordered list, because it is a list item in either case, and the formatting algorithm can tell the difference based on which type of list it belongs to. The structure name “list-item” describes it role in the document (within its context in the document) in a way that is entirely separated from how it will be styled.

Moving the abstraction out to the container is an important part of the algorithm of separating content from formatting. It keeps things consistent and reduces the number of things authors have to remember.

Creating containers and abstracting out the differences between their contents is an important piece of separating content from formatting. For example, HTML and Markdown both provide six different levels of headings. But content under an H2 or an H5 heading is not in any container. The content simply comes after the heading. This means that is it perfectly possible and legal in these languages to place different heading elements in any order you want. Writers have to pay attention to which heading level they are creating and how it fits in the hierarchy of the document they are creating.

Furthermore, writers understand that the higher the number of the heading element (H1 – H6) the larger the font will appear in the output. They don’t know which font it will be or how big it will be, but larger/smaller is still a formatting distinction. Formatting has not been completely separated from content.

By contrast, in DocBook, we have a section element. Like list, a section is part of the writer’s interpretation of what they are creating in the document, but it is only implied, not instantiated by the formatting. By creating a section element, DocBook instantiates the concept of a section. And once we have the instantiation of a section, we don’t need six levels of heading. We can have one element called title. Sections can be nested inside other sections, and the formatting algorithm can apply the correct style to the title based on context:

         section:
             title:
             paragraph:
             section:
                title:

This eliminates incorrect heading choices, ensuring that the headings in the output consistently reflect the section and subsection structure of the document.

(Now it must be said that not everyone necessarily holds to the view that headings in a text do or should reflect a hierarchy of sections. Sometimes they may be simply signposts along the way, and like any signpost, the size of the sign reflects the size of the town, not a strict hierarchy of sign sizes. So if that is how you look at document structures, you should choose a different way to separate content from formatting in your content.)

 

Separate out abstract formatting

eggs-separatorWe noted that in the case of ordered and unordered lists, separating content from formatting actually involves separating out some of the content as well. Or rather, that it involves separating out some of the characters. In other words, the distinction between what is represented in a document using character codes and what is represented in other data structures is not necessarily the same as the logical distinction between content and formatting from a structured writing point of view.

Consider a structure that we might call a labeled list:

Street: 123 Elm Street

Town: Smallville

Country: USA

Code: 12345

The generic structure of a labeled list might look like this:

         labeled-list:
            list-item:
                label: Street
                content: 123 Elm Street

But what if you have hundreds of addresses, all with the same labels. In this case, are the labels really content, or are they formatting? Since they don’t change from one list to another, we could look at them as being part of how the content is presented, rather than being part of the content itself. So we look for a way to separate them from the content.

As always, when we separate something from our content, we have to replace it with something else, and that something is generally an appropriately named structure. So that gets us a structure like this:

         address:
            street: 123 Elm Street
            town: Smallville
            country: USA
            code: 12345

Now, of course, we have moved our content into the subject domain. Notice that in our previous attempts at separation we separated out the formatting of a list and replaced it with an abstract form of a list. This certainly separated out some of the formatting, but it still left us with a list. And deciding to list items in a document is still a formatting/presentation decision.

Moving content from the media domain to the document domain actually separates out the style information, but still leaves a lot of general formatting and presentation decisions, such as the decision to use lists or tables, firmly attached to the content. Separating style from content is useful in a number of ways, but it does not achieve a complete separation for content from formatting.

In the subject domain example, however, we moved our content away from being a labeled list and turned it into an abstract record. We could now have an algorithm turn that record into several document structures. This algorithm, which I will call the “presentation” algorithm, could turn it into a labeled list, a table, a paragraph (with the fields separated by commas), or the address labels for an envelope.

In the subject domain, with the content entirely separated from formatting, we also gain the ability to query and reorganize the content in various interesting and useful ways (which we will explore in further articles).

This is as far as we can go in separating content from formatting, and we can’t separate all content from formatting to quite this extent. It should be clear at this point that separating content from format is not a binary thing. We can achieve various stages of separation for various reasons. It is important to understand exactly which degree of separation will best serve your needs.

 

We haven’t said all there is to say about separating out content, however. The phrase “separating content for formatting” originated in an age where the only media of concern was paper. Modern electronic media not only have formatting, they have behavior. Separating content from behavior is just as important as separating content form formatting. We will look at how to do it in future articles.

Series Navigation<< Quality in Structured WritingAlgorithms in Structured Writing: Processing Structured Text >>
Mark Baker

Mark Baker helps organizations improve the impact of their content by focusing their design, writing, and production processes on producing content that matches the way people seek information on the Web today. He is the author of Every Page is Page One: Topic-based Writing for Technical Communication and the Web. He blogs at everypageispageone.com. You can reach him through his company, Analecta Communications Inc.

Read more articles from Mark Baker