Editor’s note: In the latest installment of his series on structured writing, Mark Baker explores quality, from the perspective of robots that read, and the real role of the machine versus the writer.
When I talk to programmers about what I do, they often ask me why structured writing is important any more. Machines are getting so good at reading human language, they argue, that semantic markup to assist the machine is increasingly becoming pointless. But structured writing is not about assisting the machine. It is about enlisting the machine to assist the writer. And where the writer need assistance most of all is with quality.
Robots that read
Machines are indeed getting better and better at understanding human language. An approach called Deep Learning is increasingly becoming a key technology for companies like Facebook, Google, and Baidu for both language comprehension and speech recognition.
The semantic web initiative has long sought to create a Web that is not just people talking to people but also machines talking to machines. This has traditionally involved an essentially separate communication channel — semantic markup embedded in texts but not presented to the human reader. It has also involved the creation of specialized semantic data stores with query language to match, to teach computers to understand relationships that humans would express in ordinary language. Content management systems have implemented metadata schemes, often involving elaborate taxonomies, in an attempt to make content more findable when regular text-based searches don’t work as well as we would like.
But this two-channel approach — one text for the human, another for the machine — only makes sense if we assume that the machine cannot read human language. If machine and human can both read the same text and understand it with the same level of sophistication (or if the machine’s understanding is actually more sophisticated than the human’s) then we shouldn’t need two channels. The human Web becomes the semantic Web.
After all, the human text always was semantic. Semantics is simply the study of meaning. All meaningful texts have semantics. It is just that it has been difficult to build computers that could read and understand like humans do. Semantic technologies are about dumbing the semantics down for the machine because the machine is not bright enough to read the regular semantics.
Dumbing it down for the robots
This dumbing down necessarily involves omitting a great deal of the semantics of the text. Fully expressing all the meaning and implications of even the simplest text in RDF triples would be daunting, for instance. This has always created a problem for semantic technologies: which semantics do you select to dumb down to the machine’s level, and for what purpose? This is why there is no universal approach to structured writing that works for all purposes and all subject matter. You can only represent a fraction of the human semantics to the machine, and which you choose depends on what specific functions you want to perform.
But if the machine can read the text as well as you can, then these limitation vanish. Deep learning is moving us in that direction.
Why then should we bother with structured writing? Quite simply because while machines are rapidly learning to read human text better than most humans, that text is still written by humans, and most humans are not good writers.
Making humans better writers
By that I don’t just mean that they use poor grammar or spelling or that they create run-on sentences or use the passive voice too much, though all those things may be true, and annoying. I mean something more fundamental than that: that they don’t say the right things in the right way. They leave out stuff that needs to be said, or they say stuff in a way that is hard to understand.
We all suffer from a malady called The Curse of Knowledge which makes it difficult for us to understand what it is like not to understand something we know. We take shortcuts, we make assumptions, we say things in obscure ways, and we just plain leave stuff out.
This is not a result of mere carelessness. The efficiency of human communication rests on our ability to assume that the person we are communicating with shares a huge collection of experiences, ideas, and vocabulary in common with us. Laboriously stating the obvious is as much a writing fault as omitting the necessary. Yet what is obvious to one reader is necessary to another. The curse of knowledge is that as soon as something becomes obvious to us, we can no longer imagine it being necessary to someone else.
Thus much of human-to-human communication fails. The recipient of the communication simply does not understand it or does not receive the information they need because the writer left it out. Machines may learn to be better readers than we are, but even machines are not going to learn to read information that just isn’t there.
We write better for robots than we do for humans
Actually, one of the advantages of the relative stupidity of computers is that is forces us to be very careful in how we create and structure data for machines to act on. We quickly hit on the phrase “garbage in, garbage out”, because the machines we were talking to were too stupid to know when the information they were taking in was garbage, and did not have the capacity, unlike human beings, to seek clarification or consult other sources. They just spit out garbage.
This meant that we had to put a huge emphasis on improving the quality and precision of the data going in. We diligently worked out its structures and put elaborate audit mechanisms in place to make sure that it was complete and correct before we fed it to the machine.
We have never been as diligent at improving the quality of the content that we have fed to human beings. Faced with poor content, human beings do not halt and catch fire; they either lose interest or do more research. Given our adaptability as researchers and our tenacity in pursuing things that really matter to us, we often manage to muddle through bad content, though at considerable economic cost. And the distance that often separates writers from readers means that the writers often have no notion of what the poor reader is going through. If readers did halt and catch fire, we might put more effort and attention into content quality.
Even today, when a huge emphasis is being placed on enterprise content management and the ability to make the store of corporate knowledge available to all employees, most of the emphasis is on making content easier to find, not on making it more worth finding. (This despite the fact that the best thing you can do to make content easy to find it to make it more worth finding.) People trying to build the semantic web spend a lot of time trying to make the data they prepare for machines correct, precise, and complete. We don’t do nearly as much for humans. Until we do, deep learning alone may not be enough to make the human web the semantic web.
Part of the problem has always been that improving content quality runs up against the curse of knowledge. Both the authors who create the content and most of the subject matter experts who review it suffer from the curse, meaning that there are few effective ways to audit written content. Style guides and templates can help remind authors of what is needed, but they are difficult to remember and to audit, meaning there is little feedback for an author who strays. Also, the long form content typical of the paper era did not lend itself to obvious auditable pattern. The short form content more prevalent in the Web era more naturally falls into repeatable and auditable patterns which we can express through structured writing.
Structure and quality
Structured writing provides a way to both guide and audit content for quality. While you don’t need computers to define a structure for content, paper-based processes always had to be built around the publishing process, and thus largely stayed in the media domain. But most of the valuable structure that guides and audits writers writing about a specific subject for a specific audience lies in the subject domain. Without computers capable of processing subject domain markup into publishable media domain markup, the ability to apply structured writing to the problem of quality was limited.
My reply to the people who ask me whether structured writing is relevant, therefore, is “garbage in, garbage out”. Structured writing is not about making content readable to machines, it is about making content better. Making content readable to machines is something we do so that we can use the machine to help us make the content better.
Structure, art, and science
To many writers, this is controversial. Many see quality writing as a uniquely human and individual act, an art, not a science, something immune to the encroachment of algorithms and robots. But I would suggest that the use of structures and algorithms as tools does not diminish the human and artistic aspects of writing. Rather, it supplements and enhances them.
And I would suggest that this is a pattern we see in all the arts. Music has always depended on the making and the perfecting of instruments as tools of the musician. Similarly the mathematics of musical theory gave us well tempered tuning, on which all of Western music is based.
Computer programming is widely regarded as an art among its practitioners, but the use of sound structures is recognized as an inseparable part of that art. Art lies not in the rejection of structure but in its mature and creative use. As noted computer scientist Donald Knuth observes in his essay, Computer Programming as an Art, most fields are not either an art or a science, but a mixture of both.
Apparently most authors who examine such a question come to this same conclusion, that their subject is both a science and an art, whatever their subject is. I found a book about elementary photography, written in 1893, which stated that “the development of the photographic image is both an art and a science.” In fact, when I first picked up a dictionary in order to study the words “art” and “science,” I happened to glance at the editor’s preface, which began by saying, “The making of a dictionary is both a science and an art.”
As writers we can use structures, patterns, and algorithms as aids to art, just like every other profession.
Of course, few writers would claim that there is no structure involved in writing. We have long recognized the importance of grammatical structure and literary structure in enhancing communication. The question is, can the type of structures the structured writing proposes improve our writing, and if so, in what areas? Traditional poetry is highly structured, but it is doubtful that using an XML schema would help you write a better sonnet. On the other hand, it is clear that following the accepted pattern of a recipe would help you write a better cookbook, and using structured writing to create your recipes can help you both improve the consistency of your recipes and to produce them more efficiently.
The question then becomes, how much of our work is like recipes and would benefit from structured writing, and how much is like sonnets and would not. The answer, I believe, is that a great deal of business and technical communication, at least, can benefit greatly. If you look at much of that communication and see no obvious structure, I would suggest that this is not evidence that structure is inappropriate, but that appropriate structure has not been developed and applied to the content.
We must also acknowledge that many writers have had a bad experience with structured writing. In many of these cases, the structured writing system was not chosen or designed by the writers to enhance their art; it was imposed externally for some other purpose, such as to facilitate the operation of a content management system or make it easier to reuse content. In some cases, these systems actively interfere with the author’s art and directly hinder the production of high quality content.
In the opening article of this series, I noted that structure exists to serve a particular purpose:
[A] piece of structured content is structured for a particular purpose that you thought of at the time you created it. The content is structured for that purpose or set of purposes you thought of, but is unstructured for other purposes. Just as a hat can be the right size for Tom and the wrong size for Harry, a piece of content can be structured for Mary and unstructured for Jane. It all depends on context.
Writers who have had bad experiences with structured content have usually been faced with structures that were not designed for the writer’s purpose. But such content is not merely unstructured for these author’s purposes, it is actually contra-structured. It has an enforced structured that actively gets in the way of the author doing their best work.
I talk to authors all the time who show me page designs and layouts that make no sense, lamenting that the system does not give them any other choices. Content structure is not generic, and you cannot expect to simply install the flavor of the month CMS or structured writing system and get a good outcome.
Properly applied, however, as a means to guide and enhance the work of authors, structured writing can substantially improve content quality. In upcoming articles, we will look at the algorithms of structured content, many of which relate directly to the enhancement of content quality.
Until the robots take over
Of course, this all supposes that the machines are not becoming better writers than us as well. Companies like Narrative Science are working on that, but I don’t think they are nearly as far along that path as the deep learning folks are in teaching computers to read.
Do robots suffer from the curse of knowledge? Maybe not. But current writing robots certainly work with highly structured data, so structured writing is still key to quality content even when the robots do come for our keyboards.
Actually, according to James Bessen’s recent article in The Atlantic, The Automation Paradox, automation does not decimate white collar jobs the way we have been told to fear. By reducing costs, it increases demand, resulting in net growth of jobs, at least for people who learn to use the new technology effectively.
That said, all the semantic technology and content management in the world is not going to make the difference it should until we improve the quality of content on a consistent basis. Structured writing, particularly structured writing in the subject domain, is one of our best tools for doing that.