Editor’s note: Mark Baker explores the subject domain in this latest in his ongoing series about Structured Writing. Comments and questions are always welcome.
In our examination of the document domain, we saw that while there are document types that are independent of any particular subject, such as manual, article, or report, there are also many document types that are specific to particular subjects. For instance, a recipe is a document type specific to preparing individual dishes.
You can write a recipe in the document domain. For instance, you could write it in reStructuredText, like this:
Hard Boiled Eggs ================ A hard boiled egg is simple and nutritious. Prep time, 15 minutes. Serves 6.
Ingredients ----------- ====== ======== Item Quantity ====== ======== eggs 12 water 2qt ====== ======== Preparation ----------- 1. Place eggs in pan and cover with water. 2. Bring water to a boil. 3. Remove from heat and cover for 12 minutes. 4. Place eggs in cold water to stop cooking. 5. Peel and serve.
However, there are specific constraints on the format of a recipe that this approach neither follows nor records. To impose and record these constraints, you need a recipe document type.
A recipe follows a well-known pattern. It has an introduction, a list of ingredients, and a set of preparation steps. The simplest form of a recipe document type might look something like this (in the markup I have used in other articles, and will explain in a later one):
recipe: Hard Boiled Egg introduction: A hard boiled egg is simple and nutritious. Prep time, 15 minutes. Serves 6. ingredients: * 12 eggs * 2qt water preparation: 1. Place eggs in pan and cover with water. 2. Bring water to a boil. 3. Remove from heat and cover for 12 minutes. 4. Place eggs in cold water to stop cooking. 5. Peel and serve.
This simple structure expresses the basic constraint of the well-known recipe pattern and records that the author followed it. Because these structures are specific to the subject matter (rather than specifying document structures or formatting) this markup is in the subject domain.
One of the common patterns of structured writing relates to the factoring out of invariants. One of the invariants of the recipe pattern is that it contains sections titled “Ingredients” and “Preparation” (or words to that effect). Notice that these have titles been factored out here. Since we have structures specifically for ingredients and preparation we can factor out the actual titles and have the presentation algorithm add them in the transformation to the document domain.
If your organization publishes a lot of recipes, you probably have a lot more constraints on the content of your recipes. For instance, you might have a constraint that every recipe must state its preparation time and the number of people it serves. In our subject domain markup, we can enforce and record that constraint by moving the information from the introduction section to separate fields:
recipe: Hard Boiled Egg introduction: A hard boiled egg is simple and nutritious. ingredients: * 12 eggs * 2qt water preparation: 1. Place eggs in pan and cover with water. 2. Bring water to a boil. 3. Remove from heat and cover for 12 minutes. 4. Place eggs in cold water to stop cooking. 5. Peel and serve. prep-time: 15 minutes serves: 1
Enforcing and recording these additional fields not only makes sure your recipes consistently meet your corporate constraints, they also offer some interesting publishing possibilities. For instance, with this markup in place, you could easily query your set of recipes to create a cookbook of recipes you can make in 30 minutes or less.
Does this mean that the preparation time will now be displayed as separate fields in the output, rather than in-line? Not necessarily. It might be a good idea to call it out in separate fields so that readers can find the information more easily, but if you really wanted that information at the end of the introduction in every recipe, it would be a simple matter for an algorithm to construct the sentences “Prep time, 15 minutes. Serves 6.” from the prep-time and serves field values in the presentation algorithm.
The specification of constraints could get more detailed still. For instance, the specification of ingredients in a recipe generally requires three pieces of information, the name of the ingredient, the quantity, and the unit of measure used to express this quantity:
ingredients: ingredient: name: eggs quantity: 12 unit: each ingredient: name: water quantity: 2 unit: qt
We can take some shortcuts to make this markup less verbose. The following markup accomplishes the same thing, relying in part on the ability of an algorithm to break “2qt” into quantity and unit fields without the author having to do it explicitly. (I’ll talk about this markup in a later article.):
ingredients:: ingredient, quantity eggs, 12 water, 2qt
(Note: This markup defines the ingredients as database-style table with fields for ingredients and quantity. Two colons separate the name of the record from the names of the fields in the first line, and the field names are separated by commas. In the following lines, the field values are separated by commas.)
By adding and recording these constraints, we get similar benefits as before. We can better enforce any constraints we have about how ingredient lists are structured and formatted, and we gain access to the specific data involved, meaning, for example, that we could write an algorithm to convert our units from imperial to metric for publication in markets where metric units are preferred.
Cost and opportunity profiles: key differences between subject and document domains
At this point we can see that with subject domain markup, we have a lot of choices about how documents are constructed. This highlights an important difference between document domain and subject domain markup. A document domain markup language specifies the content and order of a document. We expect that the document domain markup will specify exactly what content is to appear on the page and in what order. This is necessary in the document domain because the document domain does not record any information about the specific subject matter of individual pieces of text. We can’t write an algorithm to publish certain pieces of a document domain file, because the markup does not record which are which. (More on this in the next article.)
Once we introduce subject domain markup, however, that changes. With subject domain markup in place, we can write algorithms that select certain pieces of the content to display or not display. This means that the recorded content no longer specifies exactly what content is to appear on the eventual rendered page and in what order. Rather it is a collection of identifiable pieces of content that you can select from or reorder for publication.
Let’s suppose we run a publishing company that publishes a number of magazines. We want to create a common store of recipes for use in all the magazines. But different magazines have different requirements. Wine Weenie magazine needs to have a wine match with every recipe. The Teetotaler’s Trumpet, naturally, wants a non-alcoholic suggestion.
Here is how that might be handled in the document domain:
<section publication="Wine Weenie"><title>Wine match</title><p>Pinot Noir</p></section><section publication="The Teetotaler's Trumpet"><title>Suggested beverage</title><p>Lemonade</p></section>
This is an example of what we call conditional text. The “publication” attribute says, display this text only in this publication. (This makes it management metadata, which we will talk about in a future article.)
By contrast, this is how this might be handled in the subject domain:
<wine-match>Pinot Noir</wine-match><beverage-match>Lemonade</beverage-match>
This markup says nothing about which documents should contain either of these pieces of information. Nor does it contain the subheadings what would introduce either of them in the appropriate publication. This produces a number of interesting consequences:
- We must take an additional step to publish this content. Any actual publication requires a specification of what content will appear where, so we need to create a document domain representation of this content for each publication we want to publish it in. Selecting the appropriate content for each document is the task of what I will call the synthesis algorithm.
- The subject domain versions that we create for each publication don’t need the conditional markup that is required when we write in the document domain. We apply conditions externally in the synthesis algorithm and create different document domain files for each publication.
- The author does not have to know anything about the mix of publications in which this recipe will appear. They just write a recipe and supply all the pieces of information they are asked for.
- If we add a new publication to our stable called Family Dining, we can write a new synthesis algorithm to create recipes with both an alcoholic and non-alcoholic beverage suggestion. If our content was recorded in the document domain, we would have to go back through all the recipes and add new conditional markup to describe the content to appear in Family Dining. Having our content in the subject domain could thus save us major costs, and could make new forms of publication easier and more economically attractive. For instance, if we wanted to suppress some content for mobile publication, we could do so, without editing any of our existing content.
- The subject domain markup is specific to recipes. Only recipes have wine-match and beverage-match fields. By contrast, the document domain markup is much more general. It could be used to drive any publishing process that publishes the same content to multiple publications. This reduces the cost of developing the markup and the algorithms to process it. But, it also means that authors have to know a lot more about how the publishing system works, and what content is appropriate for each publication. This requirement reduces your pool of authors and makes authoring more difficult and therefore more expensive. Finally, it means that it is much more expensive to maintain the content if you change your roster of publications and need to implement new rules.
As the above discussion illustrates, writing in the document domain and writing in the subject domain have very different cost and opportunity profiles. Writing in the document domain is usually cheaper to start with, but runs into maintenance costs and opportunity costs when your needs change. Writing in the subject domain incurs more up-front costs, but can save you money and create more opportunities down the road.
In the document domain article, we also noted that document types are not universal. Some work better in one media, and some in another. While content recorded in the document domain can be formatted for any media, it is not necessarily structured perfectly for every media. Different media have different properties, such as behavior and navigation, which demand a different approach to document structure and even the way content is written. For hypertext environments, where readers are likely to arrive at a page by search and navigate onwards by links, a page is a hub, not a leaf, and this can make a big difference to both writing style, document structure, and the subject matter you include inline (versus linking to it).
If we want to create different document structures for different media, recording our content in the subject domain gives us that flexibility.
Using subjects to establish context
In the discussion of the document domain, we noted that we can use context to identify the role that certain structures play in a document, which allows us to get away with fewer structures. For instance, we can use a single title tag for all titles because we can tell what kind of title each one is from the context in which it occurs. The same is true with subject domain structures. They can provide context that allows us to treat basic text structures differently.
Back to our markup language for recipes:
recipe: Hard Boiled Egg introduction: A hard boiled egg is simple and nutritious. ingredients:: ingredient, quantity eggs, 12 water, 2qt preparation: 1. Place eggs in pan and cover with water. 2. Bring water to a boil. 3. Remove from heat and cover for 12 minutes. 4. Place eggs in cold water to stop cooking. 5. Peel and serve. prep-time: 15 minutes serves: 6
If we want to have a separate formatting for the list of steps in a recipe procedure, we can do so without having to create a separate recipe-preparation list type. The list of steps here is in an ordinary ordered-list structure, but that structure is the child of a preparation element which is the child of a recipe element. We can write a rule that creates special formatting just for ordered lists that are the children of preparation elements that are children of recipe elements.
Future-proofing content
One important motivation for structured writing is what is often called “future-proofing”. Future-proofing means building a system or product with a view to making it able to survive future changes in environments or requirements. However, future proofing is difficult, because you cannot know with certainty what changes will occur, how likely they are, or what they will cost.
Building a future-proof platform can increase up-front costs delaying the time it takes to get to market (and possibly missing a window of opportunity). Nor can you be sure that your investment will actually pay off, since the future you prepared for may not be the future you get.
On the other hand, if you choose not to build a future-proof platform, you risk not being able to keep up with developments in a market and losing your early lead. You may eventually face massive and expensive changes when future events render your current system obsolete. Instances of both problems abounded when traditional publication systems were confronted with the rapid rise of the Web.
Rather than anticipating the particular way in which the future will develop, you can take the safest approach to future proofing: create features that will be of value no matter what happens in the future. Creating content in the subject domain is the best way to practice this kind of future proofing for content, because it creates metadata that contains only true statements about the subject matter itself. Those statements are going to remain true as long as the subject matter itself remains unchanged. That is as future proof as you can make your content.
For example, suppose you write your ingredient list in reStructuredText as a table:
====== ======== Item Quantity ====== ======== eggs 12 water 2qt ====== ========
Later you decide that you want to present ingredients as a list instead. To do this, you will have to go back to your content and change the markup. Doing this across a whole collection of recipes will be expensive.
Suppose instead that you use subject domain markup:
ingredients:: ingredient, quantity eggs, 12 water, 2qt
Now you don’t have to change the content to make the change in presentation. You just change the presentation algorithm. Thus the subject domain markup has future proofed your content against this change of layout. The document domain reStructuredText markup specified the use of a table, which is not a truth about the subject matter, but a decision about layout that can change independent of the subject matter. The subject domain markup simply specifies that “eggs” is an ingredient and “12” is a quantity. These are truths about the subject matter that will not change. Thus they are invulnerable to future changes outside of the subject matter itself.
Moving your content from the media domain to the document domain provides a degree of future proofing. By factoring out the formatting details, you protect your content against changes in formatting rules. Moving your content from the document domain to the subject domain provides additional future proofing. By factoring out the content and organization of documents, you can target different publications and to create different document designs for different media.
Simplicity and Clarity
One of the biggest benefits of subject domain markup for authors is a much higher degree of simplicity and clarity, compared with a typical document domain language.
While a general document domain language like DocBook needs to have elements that address a wide range of document structures, a recipe markup language such as we have developed in this article, has only a few simple elements. Better still, it generates very few permutations of those elements. Because subject domain languages do not specify document order, we don’t need to allow for many possible document orderings in the language, thus reducing the permutations we have to allow for and deal with. The synthesis algorithm can take the named structures of the subject domain markup and order them in any way you like.
Because subject domain structures describe the subject matter they contain, they are also much clearer to authors, who may not understand complex document structures (or, more often, the subtle distinctions between several similar document structures), but who do (we hope) understand their subject matter.
The combination of simplicity and clarity mean that in many cases you can get authors to create subject-domain structured content with little or no training. For instance, even if we add some additional fields to our recipe markup, you could still hand a sample like the one below to an author and ask them to follow it as a template, without giving them any training or any special tools.
recipe: Hard Boiled Egg introduction: A hard boiled egg is simple and nutritious. ingredients:: ingredient, quantity eggs, 12 water, 2qt preparation: 1. Place eggs in pan and cover with water. 2. Bring water to a boil. 3. Remove from heat and cover for 12 minutes. 4. Place eggs in cold water to stop cooking. 5. Peel and serve. prep-time: 15 minutes serves: 6 wine-match: champagne and orange juice beverage-match: orange juice nutrition: serving: 1 large (50 g) calories: 78 total-fat: 5 g saturated-fat: 0.7 g polyunsaturated-fat: 0.7 g monounsaturated-fat: 2 g cholesterol: 186.5 mg sodium: 62 mg potassium: 63 mg total-carbohydrate: 0.6 g dietary-fiber: 0 g sugar: 0.6 g protein: 6 g
Of course, the downside is that recipe markup is only good for one thing: recipes. A general document domain language can be used to write all kinds of documents. It will not enforce or record nearly as many constraints, or enable nearly as many options for validation or publishing, and it won’t be nearly as clear and simple for authors to use. But neither will it require you to create subject domain languages for each of the subjects you write about. At first glance, that may seem like a slam dunk case for sticking with the document domain, as the idea of inventing subject domain languages and the synthesis and presentation algorithms to go with them may seem daunting. But as we will see, the decision is not so clear cut, as sticking with the document domain comes with a lot of complexity, and sometimes custom development, that may not be apparent at first.
Using available subject domain languages
However, moving to the subject domain does not necessarily mean having to develop all your subject domain structures and their accompanying synthesis and presentation algorithms yourself. Subject domain languages already exist in many fields. For instance, there are at least three recipe markup languages out there, REML, RecipeML, and CookML.
Some of the most commonly used subject domain languages are those for documenting software APIs. These include Sphinx (Python), JavaDoc (Java), and Doxygen (multiple languages). We will look at some of these in more detail later.
Wikipedia contains a long list of XML markup languages (though note that not all subject domain languages use XML). Many of these are subject domain languages. Does this mean that if a subject domain language already exists for your subject matter that you should use it rather than developing your own? Not necessarily. It is pretty much a universal rule of markup languages (and a lot of other things) that the more needs they try to serve, the more complex they become. If you take a look at REML and CookML you will see that both are more complex than the recipe markup language we developed here.
The upside of these existing languages is that they may have algorithms and systems associated with them that you can take advantage of. However, they are often more complex than you need, sometimes less clear, and often don’t enforce or record all the constraints you want for your business.
You don’t have to adopt these languages directly, however, to take advantage of the systems that use them. Since all recipe languages describe the same subject matter, it is easy to transform content from one to another where they have equivalent structures. This allows you to create your own language to maximize simplicity and clarity for your authors, and to enforce and record all the constraints that are important to you, and still take advantage of existing functionality with a simple transformation.
The same is true of existing document domain languages. If you have your own subject domain language, you do not have to create an entire publishing chain for it. You only need to create an algorithm to transform it to an existing document domain language that has the publishing features you need.
Limits of Subject Domain Markup
While all content is specific to its subject matter, not all content breaks down into such easily identifiable fields as a recipe. A generic essay document format fits equally well for an essay on radishes as an essay on asteroids. Subject-specific structures are much easier to discern for reference works. The format of a telephone directory, an API reference, or a parts list is always specific to the subject matter. In fact, we find that the format of an API reference for one programming language can be different from the format of another language because of differences between the languages.
On the other hand, many document types today are typically written as unstructured text when they could well be written in specific subject domain structures. The content could be separated into distinct fields, as we separated prep time and number served into distinct fields in our recipe example, but is instead presented ad hoc as part of general text. Developing and applying subject domain templates the pull out this content into distinct and accessible fields can serve to make the content more consistent, more accessible to the reader, and more adaptable for various publishing scenarios. We will look at some examples in later articles.
This only scratches the surface of what you can do when you record your content in the subject domain. In future articles, we will look at the major algorithms of structured content, the benefits they provide, and how the work with content in each of the domains.