Linking Algorithms and Structured Writing

The flock forages by David-Goehring on Flckr.comFew readers read content straight though. Unless the content is perfectly matched to their experience and their goals, they reach points where they need more information, points where they need less, or points where they decide that they need something else altogether. These are deflection points in the content, points in which the reader’s “next” may not be the thing that comes next in the linear order of the work. That may mean deflecting to other content or to a different way of finding information such as asking a friend or posting a question on a forum. So let’s take an in-depth look at deflection and how linking algorithms in structured writing supports it.

Deflections are a natural part of information seeking, or what is also known as information foraging. The reader, in pursuit of their individual ends, will follow where the scent of information leads and that will not always be the next paragraph of the current text.

Writers know they can’t meet everyone’s needs perfectly every time so they use links and other devices to help readers deflect when they need to. For example, if you have Model A, do this. If you have Model B do that. Supporting deflection helps readers achieve their goals and helps keep the reader in the writer’s own content, or other preferred content, rather than deflecting elsewhere.

Writers can handle deflection points in various ways. The writer may choose to do nothing, leaving it up to the reader to look up a word if they don’t understand it, for instance. They may provide footnotes, cross references, sidebars, parenthetical material, or hypertext links as deflection choices. They may use tables or flowcharts to allow readers to choose different paths through content. They may even attempt to anticipate and forestall deflection by using information about the individual reader to dynamically reorder the content to suit the reader’s needs.

Deflection costs the reader more on paper than online. For paper, we may design content to minimize the need to deflect, or to keep deflections inside the local work. In contrast, we may organize content for the web with deflection in mind, allowing different readers to choose their own course, rather than trying to optimize one course for all readers. This difference in the ease of deflection between paper and the Web and other hypertext media is one of the main reasons we want to practice differential single sourcing.

Deflection also enters into the discussion of content reuse. We reuse content in multiple documents so that readers don’t have to deflect from one document to another to find it.

If we reuse content in different media, we might want to have different reuse strategies for paper and hypertext outputs. Me may want to include the same chunk of content in multiple paper documents but link to a single copy of it when creating a hypertext. (Linking, in other words, is a kind of reuse: reuse by reference rather than copying.)

Thus we should not be thinking solely in terms of managing links in our content. We should be thinking about implementing the right deflection strategy in each of our outputs. To see how that works, let’s analyze deflection and linking algorithms in each of the structured writing domains.

Deflection in the media domain

In the media domain, we simply record the various deflection devices as such: cross references, tables, links, etc. For example, in HTML a link simply specifies a page to load:

<p>In Rio Bravo, <a href="">the Duke</a> plays
an ex-Union colonel out for revenge.</p>

The phrase “the Duke” is a deflection point. The reader may not know who “the Duke” is, or may want more information on him. The link supports the reader at the deflection point. The reader can either deflect by clicking the link or stay the course and read on.

But if the HTML page gets printed, the link is lost. The phrase “the Duke” is still a deflection point. The reader can still deflect, by doing a search for “the Duke”, perhaps, or asking a friend what it means. But the printed version lacks any support for that deflection.

If the content had been written for paper, the deflection point might be supported in a different way. For example, it might be supported by adding an explanation in parentheses. (Parenthetical material is a type of deflection; it may be read or skipped.):

In Rio Bravo, the Duke (John Wayne) plays an ex-Union colonel out for revenge.

Or it might be handled with a footnote:

In Rio Bravo, the Duke* plays an ex-Union colonel out for revenge.

* “The Duke” is the nickname of the actor John Wayne.

Clearly this is a case in which differential single sourcing should work well, where we need to handle a deflection point differently in different media. To accomplish this, we need to move the content out of the media domain.

Deflection in the document domain

When we move to the document domain, we factor out the formatting-specific structures of the media domain. But a link is not really a piece of formatting, so conventional refactoring into abstract document structures is not going to apply. For this reason, people working in the document domain often enter hypertext links exactly the way they would in the media domain: by specifying a URL. Thus in DITA you might enter a link as:

<p>In Rio Bravo, <xref href="" format="html">The Duke</xref>
 plays an ex-Union colonel out for revenge.</p>

The difference from HTML is slight here. The link element is called xref rather than a. But the meaning of xref is bit more general. The HTML a element is saying, create a hypertext link to this address. The DITA xref element is saying, create some sort of reference to this resource. (As we will see in a moment, it is capable of linking to things other than HTML pages, which is why it requires the format attribute to specify that in this case the target is an HTML page.) This generality gives us a little more leeway in processing. We can legitimately create print output from this markup that looks like this:

In Rio Bravo, the Duke (see: plays an ex-Union
colonel out for revenge.

This is not the way we would handle the deflection point if we were designing for paper, but it is a small improvement from a differential single sourcing point of view. At least the link is now visible to the reader. (Technically we could do this from the HTML markup as well, but that would be cheating. The HTML markup doesn’t really give us permission to do this. Rather, it tells us to create a hypertext link and nothing else. The problem with cheating is that we assume constraints that are not being promised or enforced, and this can fail in ways we may not expect or catch. Some cheats are more reliable than others, but it’s probably best to avoid that habit.)

Fundamentally, though, this is not a satisfactory differential single sourcing solution. Unless there were no alternative, we would not normally direct a reader of paper to the web for more information, nor vice versa. Linking to an already published file, such as an HTML page, commits us to a particular format for the link target. If we link to content that has not yet been published, we gain the freedom to link to any format of that content that we choose to publish. The simplest way is to link to a source file rather than an output file.

In DITA, we can link to another DITA file (the default format, so we don’t need the format attribute):

<p>In Rio Bravo, <xref href="John_Wayne.dita">The Duke</xref> plays an ex-Union colonel
out for revenge.</p>

We don’t yet know if that DITA file will be published to paper or the Web, what the address of the published topic will be, or if that topic will stand alone or be assembled into a larger page or document for publication. This means that the publishing system is taking on responsibility for both ends of the link. It must make sure that the target page is published in a way the source page can link to, and that the source page links to the right address. But taking on this responsibility gives us the leeway to publish this link as we see fit.

If we publish as a book on paper and the target resource ends up as part of a chapter in the same book, we can render the xref as a cross reference to the page that resource appears on. We could format that cross reference inline or as a footnote. These are all legitimate interpretations of the xref’s instruction to create a reference to a resource.

If we publish to a help system and the target resource ends up as a topic in the same help system, we could render the xref as a hypertext link to that topic.

This is a big step forward, but it still does not let us do this:

In Rio Bravo, the Duke (John Wayne) plays an ex-Union colonel out for revenge.

In other words, we can render the xref as a cross reference or a link or a footnote, but we can only handle the deflection point as a reference to the specified resource. We can’t decide to link to a different resource or handle it by parenthetical clarification instead. To give ourselves the ability to link to different resources, we can turn to the management domain.

Deflection in the management domain

Linking to a source file rather than to an address gives us more latitude about how the link or cross reference is published, but we still always link to the same resource. If we are doing content reuse, this is a problem because we do not know if the same resource will be available everywhere we reuse our topic. We need to be able to link to different resources when our topic is used in different places.

To accommodate this, we can factor out the file name and replace it with an ID or a key. IDs and keys are management domain structures that we looked at in the article on the reuse algorithm. They allow us to refer to resources indirectly. Using IDs lets us use an abstract identifier rather than a file name to identify a resource. Using keys lets us remap the resources we point to. This makes keys the more efficient way to address this problem. So instead of referring to a specific resource on John Wayne, we refer to the key John_Wayne using a keyref attribute:

<p>In Rio Bravo, <xref keyref="John_Wayne">The Duke</xref> plays an ex-Union colonel
out for revenge.</p>

Somewhere in the DITA map for each publication, the key John_Wayne points to a topic. Publications link the keyref to the resource pointed to by that key in each of their DITA maps. This allows us to link to different resources in each publication.

The problem with IDs and Keys

However, we face another problem with linking based on IDs and keys. Keys will let us vary which resource a keyref resolves to, but what happens when there is no resource to which that key can seasonably be assigned?

The xref demands that a reference to a resource be created. But if there is no resource to link to, we will have a broken link, and fixing it is not easy. We can’t simply go in and remove the xref from the source for one publication, because that defeats the purpose of content reuse if we have to edit the content every time we reuse it. Removing the key reference would fix the broken link in one publication, but that would result in the link being removed from all the publications, even where the resource does exist and the link ought to be created.

Relationship tables

One approach we can use to solve the link-only-when-resource-available problem is the relationship table. In a conventional linking approach, the source page contains an embedded link structure pointing to the target page. The source knows it is pointing to the target, but the target does not know it is being pointed to.


The idea that the target resource does not know it is being pointed to is important because it means it does not have to do anything in order for other resources to point to it. The fact that only the source and not the target has to know about the link is fundamental to the rapid growth of the Web. If the target resource had to participate in the link process, the Web could never grow as explosively and organically as it has.

A relationship table takes this one step further. When we create a link using a relationship table, we factor the link out of the source document and place it in a separate table. The relationship table says resource A links to resource B, but neither resource A nor resource B knows anything about it. (Think of it like being introduced to a stranger by a third party because you share a common interest. I collect china ducklings. You make china ducklings. We don’t know each other, but our mutual friend Dave introduces us. You and I are the source and destination resource; Dave is the relationship table.)


Notice that Dave has three choices about how he does the introduction. He can tell me about you as a seller of ducklings, or you about me as a buyer of ducklings, or he can introduce us both to each other. In the same way a relationship table can describe a link A to B, B to A, or both ways.

Once we factor the links out of a piece of content, we can reuse it anywhere we like. If a suitable resource exists to link to, we enter it in a relationship table for that build and have the presentation algorithm create the link at build time. If no suitable resource is available for a different publication, no entry is made in the relationship table for that publication, and the presentation algorithm does not create a link.

The problems with link tables

But, while link tables address the problem of link management in reused content and allow us to link to different targets in different publications, they separate the link from the deflection point it supports. The link that marked the deflection point has been factored out of the content so there is no way to put it back inline. Links generated by relationship tables end up in a block, usually at the end of the page.

Since the end of a page is a deflection point in a hypertext system, there is a legitimate case for creating and managing page-level links. But if the page is well designed to fulfill a discrete purpose for the reader, the end of the page is actually the point at which the writer knows least about what the reader might want to read next. The relationship table approach does not support the full array of foreseeable reader deflection points which occur in the body of the page rather than its end.

The other problem with the relationship table approach is that it is time consuming. We have to rewrite the links for each content set. And because the deflection points are not recorded in the content source, we have to figure out the appropriate links each time. This goes against the spirit of recording something once and using it many times. A mechanism intended to help us reuse content ends up forcing us to redo the work of linking for each publication we create.

Conditional linking

Before we leave the management domain, it is worth mentioning a management domain approach that we could use to address our differential single sourcing problem and get the appropriate deflection strategy for online and paper publishing. We could use conditional structures to define both options in the source file. With a little specialization to support media as a conditional attribute, we could do this in DITA:

<p>In Rio Bravo, <ph media="online"><xref keyref="John_Wayne">The Duke</xref></ph><ph media="paper">
The Duke (John Wayne)</ph> plays an ex-Union colonel out for revenge.</p>

In DITA, the ph element is used to delineate an arbitrary phrase in the content to which we want to apply management domain attributes. Here we define two different versions of the phrase “the Duke”, each with different forms of deflection support, and each with a corresponding media condition. The synthesis algorithm would then choose the appropriate version of the phrase for each publication based on the conditions set for the build.

The pretty obvious problems with this approach include the fact that requires twice the work for authors to create every link, and it doubles the maintenance cost of the content as well. It also flies in the face of the idea of creating formatting-independent content.

Unfortunately, in a general purpose document domain tagging language with management domain support, it is pretty much impossible to prevent writers from doing things like this in order to achieve the effects they want. And in practice writers do end up using conditional markup like this for all kinds of differential single sourcing and reuse problems that are not easy to solve in the document and management domains. In some cases this can lead to tangles of conditions that are hard to maintain and debug.

For an alternate approach to this problem, and the others we have discussed, we can to turn to the subject domain.

Deflection in the subject domain

The management domain approach to links poses several big problems:

  • Like all management domain structures, they are artificial. They don’t correspond to things in the author’s everyday world, which makes them harder to learn and use.
  • We can’t link to a key or an ID that does not exist. This means that as we are developing a set of content, the first pages we write have very few other pages to link to. Authors cannot enter links to content that has not been written yet.
  • In reuse scenarios, the use of IDs and keys does not solve the whole problem because it cannot guarantee that the resource that an ID or key refers to will be present in the final publication. We can use relationship tables to address this problem, but they create additional complexity for authors and have the disadvantage that we can’t use them to create inline links.
  • Unless we resort to ugly conditional structures, we can’t use media-appropriate deflection mechanism for differential single sourcing.

As we have seen before, we can often remove the need for management domain structures by moving content to the subject domain. The same it true with deflection points.

In the document domain we handled a deflection point by specifying a resource to link to, specifying both that the deflection mechanism would be a link and that the link target would be a particular page.

In the management domain we used keys to factor out the target resource but not the deflection mechanism (it was still an xref).

In the subject domain, we can factor out both the target resource and the deflection mechanism. We do this by marking up the subject of the deflection point:

<p>In <movie>Rio Bravo</movie>, <actor name="John Wayne">the Duke</actor> plays an ex-Union
colonel out for revenge.</p>

This markup clarifies that the phrase “the Duke” refers to the actor named John Wayne. These are respectively the type of the subject (actor) and its value (John Wayne).

Given this markup, we can easily create the paper-style deflection mechanisms we have been looking for. We simply have the presentation algorithm take the value of the name attribute and output it between parentheses:

<p>In Rio Bravo, The Duke (John Wayne) plays an ex-Union colonel out for revenge.</p>

The subject domain markup is not link markup. Unlike the document domain markup, it does not insist that a reference should be created nor does it specify any resource to link to. This markup is a subject annotation. It clarifies that the phrase “the Duke” refers to the actor named John Wayne (and not the Duke of Wellington or the Duke of Earl) and that the phrase “Rio Bravo” refers to the movie (and not to the city in Texas or the nature reserve in Belize). That clarification is what allows us to produce the parenthetical explanation of the phrase in the example above. It also allows us to create a link if we want to. We’ll look at how in a moment. But first we should look at the implications of subject annotation more deeply.

Subject annotation markup says, “this is an important subject that we care about in this context”. How is this an appropriate way to handle a deflection point? Writers cannot know with certainty what the deflection points will be for individual reader. But they can anticipate that important related subjects are likely deflection points. This is what they are doing when they create links in the document domain and it is what they are doing in the subject domain. The difference is that in the document domain, they handle the mention of an important subject by creating markup that says “create a link to resource X”, and in the subject domain they handle it by creating markup that says “this is a mention of important related subject Y”. This leaves us with more options about how to handle the deflection point, and that is what we have been looking for.

Marking it up a phrase as a significant subject does not oblige the publishing algorithm to create a link. If we decide to have the publishing algorithm create a link on the Web and a cross reference on paper, nothing in the markup obliges us to use any particular formatting or target any particular resource. There is no question of cheating here if we decide to create one kind of deflection device or another, or not to create one at all. The markup is giving us the information to make our own decisions rather than forcing us to create a particular structure.

In all our previous examples, mentions of “Rio Bravo” were not marked up, even though it is clearly an important subject and a potential deflection point. This reflects the author’s decision not to create a link to support this deflection point. But what if we want to make a different choice later? By marking up “Rio Bravo” as a significant subject, we keep our options open. Now we tell the presentation algorithm to create links on the names of movies if we want to, or not if we don’t want to.

But there are additional reasons to annotate Rio Bravo as a significant subject, because that annotation can be used for other purposes as well. The subject annotation says that “Rio Bravo” is the title of a movie. In the media domain, the titles of movies are commonly printed in italics. We can use the subject domain movie tags to generate media domain italic styling. We could also use this subject annotation to generate document domain index markers so that we can automatically build an index all mentions of movies in a work.

Subject annotation thus serves multiple purposes, and correspondingly reduces the amount of markup that is required to support all these different publishing functions. This is a common feature of subject domain markup. None of it is directly tied to specific document domain or media domain structures which will be required to publish the content. Each piece of subject-domain markup may be used to generate multiple document domain and media domain structures. For example, we could generate the following document domain markup from from the subject domain markup above (the example is in DocBook):

       <primary>Rio Bravo</primary>
    <citetitle pubwork="movie">Rio Bravo</citetitle>,
        <primary>John Wayne</primary>
    <ulink url="">The Duke</ulink>
    plays an ex-Union colonel out for revenge.

This sample contains index markers, formatting of movie titles, and links on actor’s names, all generated based on the subject annotations in the source text. It should be clear how much less work it is for an author to create the subject domain version of this content than the DocBook version. Yet all the same publishing ability is maintained in both versions.

Generating links from subject annotations offers other advantages:

  • In a reuse scenario, whenever have to worry about broken links or creating relationship tables. We generate whatever links are appropriate to whatever topics are available in the presentation algorithm.
  • In a differential single sourcing scenario, we are never tied to one deflection mechanism. We can generate any mechanism you like in whatever media you like.
  • We don’t have to worry about maintaining the links in our content because our source content does not contain any links. The subject annotations in our content are objective statements about our subject matter, so they don’t change. All the links in the published content are generated by the presentation algorithm, so no management is required.
  • We avoid any issues with wanting to link to content that has not been written yet. The subject annotation refers to the subject matter, not a resource. Links to content that is written later will appear once that content becomes available to link to.
  • It is much easier for authors to write because they do not have to find content to link to or manage complex link tables or keys. They just create subject annotations when the text mentions a significant subject. This requires no knowledge of the publishing or content management system. It does not even require knowledge of any other resources in the content set. It only requires knowledge of the subject matter, which the author already has.

Finding resources to link to

Of course, the question remains, what resources do we link to, since they are not specified in the text? If we choose to translate subject annotations into links, we need a way to find resource to link to. We do this by looking up resources based on the subject information (type and value) captured by the subject annotation. For this we need content that is indexed using those types and values (or their semantic equivalents). So naturally this means that we need to index our content. If you have a page on John Wayne, you can index it like this:

     title: Biography of John Wayne
        type: actor
        term: John Wayne
            John Wayne was an American actor known for westerns.

Now the linking algorithm looks like this:

match actor
    $target = find href of topic with index where type = actor and term = @name
    create xref
        attribute href = $target

However, content stored in the subject domain may already be indexed effectively enough by its inherent subject domain structures:

     name: John Wayne
         John Wayne was an American actor known for westerns.
        film: Rio Bravo
        film: The Shootist

Here the topic type is actor, and the name field specifies the name of the actor in question. This is all the information we need to identify this topic as a source of information on the actor John Wayne.

Only very minor changes to the linking algorithm are required to use this:

match actor
    $target = find href of actor topic where name = @name
    create xref
        attribute href = $target

There is a lot more to how this mechanism works in practice, including how to handle imperfect matches and what happens when the query returns multiple resources. But that takes us into the specifics of individual systems and that is more detail than we need for present purposes.

Indexing of topics may also be done by a content management system, in which case the linking algorithm would query to CMS to find topics to link to.

A useful feature of this approach is that we can have the publishing algorithm fall back to creating a link to an external resource if an internal one is not available. If a search of the index of our own content fails, we can search indexes of external content. We can build such an index ourselves, but some external sites may also provide indexes, APIs, or search facilities that we can use to locate appropriate pages to link to.

Deferred Deflection

Readers don’t always deflect the moment they reach a deflection point. In some cases, they choose to set the alternate material aside for later reading. This is particularly easy to do on the Web, where they can simply open pages in new browser tabs for reading later.

The idea of the deferred deflection also occurs in document design. A document design that gathers a set of links together at the end of a document, rather than including them inline, is recommending deferred deflection to the reader. It attempts to keep the reader following the writer’s default course to the end of the document before they go off to other things. The relationship table approach to link management that we mentioned earlier can only produce deferred links.

The merits of deferred links are debatable. Some argue that inline links are a distraction, that they actually encourage deflection. But the lack of links does not stop the reader from deflecting if they want to. And if they do deflect, the lack of a link means they may leave our content set and land on competitor’s content or content is that is of poor quality or that contradicts what we have been saying. The fact that the debate exists suggests that we may want to factor this design choice out of our source content so that we can choose between inline and deferred links later.

To leave open the option of deferring or not deferring links, we have to record links at the deflection points they belong to. We can choose to defer them at publishing time if we wish, but if we defer at writing time, we can’t put the links back inline at publishing time because we don’t know where they belong.

For this strategy to work, we need to be able to tell the difference between links that can be deferred and those that cannot. An simple example of a link that cannot be deferred is one that says “For more information, click here.” Obviously this link has to remain on the words “click here”.

But there is a more subtle issue as well. For a link to be deferred on publishing, it must be possible to contextualize the link in the deferred location. In other words, when the deflection point occurs inline in a paragraph the reader should be able to infer where the link will lead from the paragraph and from the text the link is applied to. But lifting the same link text out of the paragraph and putting it somewhere else doesn’t necessarily provide the same context.

For example, a link marked up like this is hard to defer algorithmically:

<p>In Rio Bravo, <xref href="">The Duke</xref>
plays an ex-Union colonel out for revenge.</p>

We could generate a list of links and insert it later in the document. It might look like this:

<p>For more information, see:</p>

    <li><a href="">The Duke</a></li>

But will it be clear out of the context of the original text what the words “the Duke” refer to? (Than answer here is maybe, but it is not hard to imagine cases where it would be a definite no.)

On the other hand, if we mark up the deflection point in the subject domain like this:

<p>In <movie>Rio Bravo</movie>, <actor name="John Wayne">The Duke</actor>
plays an ex-Union colonel out for revenge.</p>

Then, given that we know what the subject of the deflection point is, we could use it to create a list of links that are categorized by type and use the real names of actors even when the original text use a nickname:

<p>For more information, see:</p>

            <li><a href="">John Wayne</a></li>
            <li><a href="">Rio Bravo</a></li>

In short, algorithmically deferring document domain links is always tricky, but we can comfortably defer linking of subject annotations.

Different domain, different algorithm

What linking algorithms illustrate perhaps better than any other is that the movement from one domain to another changes the algorithms in fundamental ways. While the algorithm has the same purpose in each domain, the way it achieves that end can differ significantly.

One point I have tried to emphasize about structured writing algorithms is that they always start with the content structures. How we design the content structures — the way the author records the content — determines everything we can do with the content. We create content structures to support algorithms. We create algorithms to improve content quality or streamline content management and publishing.

In the document domain, the data structures tend to have a one-to-one correspondence with their algorithms. As system designers determine they need a particular algorithm, they create structures to support that algorithm. Thus document domain languages that require support for linking, reuse, indexing, and single sourcing have data structures for linking, for reuse, for indexing, and for single sourcing. (Some of these may be management domain structures, of course.)

In the subject domain, though, the data structures reflect the subject matter. If we go looking for a one-to-one correspondence between a structure and the algorithm it supports, we won’t find it. Thus we will not find link markup or reuse markup or index markup or single sourcing markup in the subject domain. We will find markup that clarifies and delineates the subject matter of the content it contains. Any algorithm we want to apply has to interpret that subject domain annotation and use it as the basis for creating whatever kind of document or media domain structure we want for publishing.

System designers must still think about what algorithms they want to apply, but that is to make sure that the aspects of the subject matter needed to drive the algorithms are captured. Since every subject structure can potentially drive many publishing algorithms, however, we often find our subject domain content already supports any new algorithms we want to apply. This helps future proof our content.

Moving from the document domain to the subject domain is not a matter of asking what the subject domain equivalent of a document domain structure is. Rather it is a matter of asking what information in the subject domain drives the creation of document domain structures. Subject domain content can look very different from its document domain counterpart and will often be starkly simpler and easier to understand.

Series Navigation<< Structured Writing Algorithms in the Publishing Process

Алгоритмы связывания и структурированное писательство | Разработка технической документации

7 years ago

[…] Источник: Linking Algorithms and Structured Writing […]

Subscribe to TechWhirl via Email