Editor’s note: Mark Baker completes his discussion on working in the document domain with a look at how “backsliding” to the media domain occurs and choosing the document domain language that supports your needs. Read the introduction to writing in the document domain.
In the last article, we looked at how you can move your writing out of the media domain and into the document domain. Moving to the document domain can allow you to factor out many of your media domain constraints, creating greater consistency at less cost, as well as providing a range of automation and validation options for your content. But it is all too easy to authors to backslide into the media domain, undoing all of these benefits. This article will look at why this happens and what we can do to resist it.
HTML is the prime example of a document domain language backsliding into the media domain. HTML was originally designed for sharing scientific papers. It was not designed to strictly control the organization and presentation of scientific papers (it was designed more to accommodate requirements than to constrain them), but it does have features that betray its origins. For example, definition lists, a common feature of scientific papers, precisely define how the key terms will be used. But as the web adopted HTML as generic language of web pages, the definition list (<dl>) structure has come to be treated as a generic labeled list structure, used for all kinds of things other than definition lists.
This highlights one of the challenges of structured writing, which is to make sure that structures do not get used for purposes other than what they were intended for. We see this most when writers look for an easy way to create an effect in the media domain. If the writer wants a piece of text formatted in a particular way, but the only document domain structure that formats that way is intended for something different, it is easy for them to use the structure incorrectly to get the formatting effect they are after.
But now the document domain structure that is being misused no longer expresses the constraints it was designed to express. This means you lose functionality. For instance, you can’t find all the definitions in a set of HTML documents by looking for all the dl elements. You will get all kinds of things that are not definitions. You might also miss a lot of definitions that were not created using the dl structure.
What is clearly true for HTML can potentially affect any document domain language. It can slip back into the media domain if people start using it for the media domain effects produced the default transformation algorithms rather than adhering to its document domain structures. Today, structured writing advocates often dismiss HTML as an unstructured language. They point to other languages, such as DocBook or DITA, as being structured by contrast, despite the fact that all three languages are document domain languages at heart, with many similar structures between them.
And while it was not always so, HTML has largely become a set of basic document domain structures on which authors can hang styles (using CSS). In other words, it has come to be used much like traditional media-domain word processing and desktop publishing applications. When people write in HTML, they frequently do so in WYSIWYG environments, using style-oriented tools that mimic traditional word processing very closely. This usually results in an HTML document that formats more or less correctly, but that is coded very inconsistently from the point of view of the document domain, and which is thus very hard to work with—either to edit by hand or manipulate with algorithms.
No document domain language is immune from backsliding. Organizations usually apply a common formatting algorithm for each of the currently supported media. Writers pretty quickly learn how certain document structures are rendered in each media. If they consider their writing task in terms of the formatting they want to produce, they know what document domain structures will produce it.
The growing use of visual (WYSIWYG) editors for structured writing encourages this tendency to backslide. While these editors make it easier for people to create structured documents in XML by hiding the verbose XML tags, they also give the writer a media-domain view of the content. Once you invite the writer to think in the media domain, backsliding into the media domain is much more likely.
While preventing this altogether is extremely difficult, the extent to which backsliding occurs in your content is directly related to how well the document domain language you are using fits the types of documents you produce and how easily the writers understand that language.
Many organizations fall into the trap of selecting the document domain language based on whether it is the most widely supported or most popular at the moment. But that language–or any document domain language–may not be the one the best fits your document needs or your writers.’ capabilities This can, and frequently does, lead to backsliding and to the accumulation of invalid structures in you content that can come back to haunt you over time.
Choosing a common and well-supported system has numerous benefits, of course. But you should always remember that the point of structured writing is to impose a specific set of constraints on your content that help you meet specific business needs. If a popular system does not express those constraints or cannot capture them consistently without backsliding, it will fail to meet your objectives over the long run.
Resisting Backsliding
As we noted above, the use of visual editors is a significant factor in authors backsliding into the media domain. So one way to get back to writing in the document domain is to avoid such editors. But writing XML or HTML tags by hand is painful, and the result is hard to read. One solution to this has emerged recently in the form of a new syntax called MarkDown. The idea behind MarkDown it to represent the major document domain structures of HTML using the kind of formatting people use in text-only email messages. This approach removes many of the difficulties associated with typing raw HTML. Here’s a sample of Markdown (courtesy of Wikipedia):
Heading ======= Sub-heading ----------- ### Another deeper heading Paragraphs are separated by a blank line. Leave 2 spaces at the end of a line to do a line break Text attributes *italic*, **bold**, `monospace`, ~~strikethrough~~ . A [link](http://example.com). [28] Shopping list: * apples * oranges * pears Numbered list: 1. apples 2. oranges 3. pears The rain---not the reign---inSpain.
It translates into the following HTML (again, courtesy of Wikipedia):
<h1>Heading</h1><h2>Sub-heading</h2><h3>Another deeper heading</h3><p>Paragraphs are separated by a blank line.</p><p>Leave 2 spaces at the end of a line to do a<br /> line break</p><p>Text attributes <em>italic</em>, <strong>bold</strong>, <code>monospace</code>, <s>strikethrough</s>.</p><p>A <a href="http://example.com">link</a>.</p><p>Shopping list:</p> <ul> <li>apples</li> <li>oranges</li> <li>pears</li> </ul><p>Numbered list:</p> <ol> <li>apples</li> <li>oranges</li> <li>pears</li> </ol><p>The rain—not thereign—in Spain.</p>
Markdown was not designed to be a pure document domain language. It was designed to let you write HTML quickly in a text editor. But the net effect of using Markdown is that you no longer work in a strictly WYSIWYG view—you see the structure of the document you are creating.
Many markdown editors use a split screen view that shows the formatted version in one pane as the writer writes markdown syntax in the other. But even here, the writer is still working in the document domain because they still see the structure in the view they are working on. A Markdown editor is never going to produce the kind of messy HTML that a pure WYSIWYG HTML editor can produce.
Markdown Editor
Another interesting factor is at work here. A list in markdown is just a sequence of paragraphs that start with asterisk characters. On the face of it, this is just like a document domain editor creating lists by styling paragraphs. But if you look at the resulting HTML, you see that it creates a proper list wrapper element around the list. The markdown interpreter infers the hierarchical structure of the document domain from the essentially flat Markdown syntax.
The author works in something that looks and feels in some ways like the media domain, though they have no actual styles and cannot change the formatting at all. But they use abstract formatting notation (underlines for headings, asterisks for unordered lists) to create document domain objects. The beauty of this is that the document domain constraints are preserved, while the author can work in a simple format that is easy to type, and reasonably easy to read.
Markdown is a fairly simple language with far fewer document domain structures than HTML or other document domain languages. But there are other similar languages that are significantly broader in the structures they support, such as ASCIIDoc and reStructuredText.
This is an important reminder that XML and its applications are not the only route to structured writing. In fact, there are many other ways to create structured texts that obey the appropriate constraints for a particular use case. We will look at some of them in later articles.
Extending the Document Domain
Another important factor in preventing backsliding is to create document domain structures that as specific to the kinds of documents that you are writing. This can mean creating them yourself or choosing an existing language that support the ones you need – and then perhaps extending or constraining its support at needed.
HTML provides only a fairly basic set of document domain structures, which hastened its slide into the media domain. As we have seen, enforcing or factoring out media domain constraints requires specific document domain structures. But the possible list of such structures is quite large. A few basic features are common to all documents, such as paragraph, lists, and titles. But these structures alone are not enough to hang meaningful and useful document domain constraints on, which is why, as we noted in an earlier article, extensibility is an important part of all structured writing domains.
For example, think about a bibliography—a document structure for listing works cited in or recommended by a document. It generally consists of a heading “Bibliography” followed by a set of paragraphs listing the cited works. In the media domain, it is not a particularly complicated structure. Just a sequence of paragraphs with some bold and italic formatting for author names, book titles, etc.
Your media domain stylesheet, may define some character styles that arguably belong to the document or subject domains, such as author-name or book-title. You may even have a specific paragraph style for bibliography entries, but it is unlikely to be more complicated than that.
But these few media domain styles don’t really cover all the rules for creating bibliographies that your institution or publisher is likely to insist on. Different organizations have different rules for the presentation of a bibliography entry, which go into detail about how each work and its authors are listed and how the listings are presented. Authors must follow these constraints on the writing of the bibliography, but the constraints are not modeled by the media domain styles the authors are working with. The authors must learn and follow these constraints for themselves, and when they have finished writing, these constraints are not explicit in the content in a machine-readable way.
How would you write a bibliography in a document domain language? You could use paragraphs and inline bold and italic markup (or document domain equivalents such as strong and emphasis) for titles and authors. But this would simply be using the media domain approach using document domain structures. Even if you use the nominally document domain strong and emphasis instead of bold and italic, you are still backsliding because in a bibliography bold and italic are used to distinguish different parts of an entry, not to emphasize part of the text. In fact, there really isn’t a generic document domain way of creating a bibliography that is not effectively backsliding into the media domain. The only way to create a bibliography in the document domain is with an explicit bibliography structure.
That means you are either going to have to extend your document domain language to include one, or use a document domain language that already includes one.
One such language is DocBook. Here’s an example:
<biblioentry id="bib.xsltrec">
<abbrev id="bib.xsltrec.abbrev">REC-XSLT</abbrev>
<editor><firstname>James</firstname><surname>Clark</surname></editor>
<title><ulink url="http://www.w3.org/TR/xslt">XSL Transformations
(XSLT) Version 1.0</ulink></title>
<publishername>W3C Recommendation</publishername>
<pubdate>16 November 1999</pubdate>
</biblioentry>
The example is in XML, which can be hard to read, so here is the same structure in a simpler notation that it easier for humans to read. (I’ve used this notation for earlier examples, and I’ll talk more about it later.):
biblioentry:(#bib.xslttrec) abbrev:(#bib.xsltrec.abbrev) REC-XSLT editor: firstname: James surname: Clark title: XSL Transformations (XSLT) Version 1.0 publishername: W3C Recommendation pubdate: 16 November 1999
This structure not only constrains how bibliography entries are presented and formatted, it actually factors out many of those constraints by breaking down the components of a bibliography entry into separate labeled fields. Given a biblioentry structure like this, you could create an algorithm to present and format a bibliography entry almost any way you wanted to.
In fact, you could write an algorithm to extract bibliography information from a document by looking for biblioentry structures and selecting the desired information from them. For instance, if you want to build a list of authors cited in the document, you could do so by searching the biblioentry records and extracting the author name structures.
Labeled fields illustrate another important way to cut down the number of document domain structures we need. If we capture the individual pieces of information that make up a bibliography entry, we only need one bibliography entry structure even if we want to present bibliography entries differently in different publications (organize them differently, that is, as opposed to merely formatting them differently). This gives your content independence from the bibliography standards or any one institution, since you can create output that complies with many different standards from this abstract bibliography structure.
(What we are seeing here is a little foretaste of the subject domain, for while bibliographies are a common document feature, regardless of the subject matter of the document, a bibliography itself is always about the same subject: books and other information sources. So when we model a bibliography entry this way, we are really abstracting out a document domain constraint by moving the content to a subject domain structure.)
Specialized Document Types
If providing the specific document structures that your authors need is part of avoiding backsliding, providing specific document types for the different types of documents they produce takes this one step further
So far we have looked at moving individual elements of a document such as lists, graphics, and bibliographies into the document domain to introduce constraints on how they are structured and how they are formatted. But within the document domain, many distinct types of documents exist, each of which has its unique patterns and requirements. We can create multiple document types in the document domain for these different document types.
Some public markup languages support more than one document type. For instance, DocBook supports book and article, DITA supports concept, task, and reference types (but these are actually topic types rather than document types – documents in DITA are assemblies of topics), and SPFE provides a range of more specific document types for different purposes. And as each of these systems is extensible, you can add more types to meet your needs.
Some of these document types sit squarely in the document domain. For example, manual, quick-reference card, article, web-page, picture-book, novel, and catalog are all distinct document types that are distinguished by the kind of reading task they are used for, regardless of the subject matter. Thus a quick reference card can be a quick reference to any subject, a manual can be a manual for any product or service.
Many more document types are specific to certain subjects. A recipe is specific to the preparation of food. A telephone directory is specific to finding telephone numbers. A knitting pattern is specific to creating knitted fabrics.
Once we get into document types that are specific to a particular subject, however, we are starting to get into the subject domain. We will go there in the next article.