Regular expressions on shallow and deep data, Part I

May the 13th 2010 at 12 o’clock pm

This post is about something I have had in mind for a while, but never heard anyone else say, a little surprisingly considering the amount of my life which is structured in XML. One of the big rallying cries of those who want an extensible internet is that ‘machine to machine communication should be a machine language; written by machines to be guaranteed valid for other machines’. For the internet and many other situations, the middleware communication language is XML, used for content, application data, scripts, styles, and many, many linking languages.

Background

There is usually a small number of culprits named for the break-down of the model. The number one group to be blamed is users. They type their content by hand, badly, in HTML (or XHTML but really without the ‘X’), which is unsafe because they cannot guarantee that what they type is valid; then, the content is processed by tools, more tools, and works its way down infecting with faulty markup the whole ecosystem of mashups, feed aggregators, social commenting tools, and so on. It is not really the users’ fault, because we the programmers forced them into the situation where they had to hand mark up their content, but user education is at root of many problems.

The other big wrongdoers are the software authors. They have simply not understood the model of XML authoring needed to be safe. All over the internet, millions of programmers, bloggers, and standards experts are shouting that web authoring tools are simply not using the right paradigm. If there is the slightest chance that the person on the other end is going to need you to be passing parsable content (that is, at the least well-formed and with valid characters), then it is not right to program in such a way that leaves this is doubt. The wrong way of doing it is how almost every single CMS works:

<html>
    <head>...</head>
    <body>
        <?php for($posts as $post) echo "<div>$post</div>\n"; ?>
    </body>
</html>

This is a broken way of writing code, which we all know full well (or are under-informed to publish a CMS), the problem being that there is no guarantee that the output is correct. The solution in PHP is to use the XMLWriter class. Even good systems like Drupal fail on this one, because however well checked and inspected they are, the methods above of writing out tags manually in code cannot guarantee correct transfer. On the other hand, a good library for XML output will ensure that well-formed data is produced. To re-formulate the example:

<?php
$x = new XMLWriter();
$x->openMemory();
$x->startElement("html");
$x->startElement("body");
for ($posts as $post) $x->writeElement("div", $post);
$x->endElement();
$x->endElement();
print $x->outputMemory();
?>

So, this or a similar very under-documented, under-used class should be used by every major CMS at least for the production of Atom. What I would like to suggest is that there is in fact a good reason the wrong architecture has been followed this far.

How did CMSs begin avoiding XML libraries?

The analysis I would like to make is my own. The problem is not using deep and shallow content models appropriately. I call a content model deep if in a serialization start indicators can output a long way away from their end indicators, with no output relating to higher-level object in between. Alternatively, when stored internally as a tree, a deep model corresponds to a deep tree. HTML is a classic deep model, because the DOM has many levels of nesting, and in its text serialization tags like <body> are opened a long way away from their end tags, with no reminder in between that the tag is open.

On the other hand, describe a content model as shallow if the context of higher levels is output alongside lower lever output, which naturally limits the depth of nesting that can be efficiently handled. Email quoting is a well-known shallow content model. For example:–

Hi Bill, Sorry to take so long to reply.

> Do you have the document? I asked
> for it last week.
No; my bad.

Here, a quotation is opened, but, unlike a <blockquote>tag in HTML, which once opened can be forgotten until you close it, the quotation indicator has to be repeated on each line. This is the key feature of a shallow model’s serialization.

With these definitions, my contention is twofold: firstly, that hand-authoring content in deep models is flawed from the start. Secondly, that whether CMSs use XML libraries or not in templates to wrap content is irrelevant if inappropriate functions are applied to the content itself before output. This is argued on the basis that regular expressions only work on shallow content models. These then allow us to answer the question, ‘Where did CMSs begin avoiding XML libraries?’ with the response that content which should have been stored in a shallow format all along was authored in a deep structure, which was then further processed by shallow tools like regular expressions, in the end leaving applications with no choice but to positively avoid tools guaranteeing good output, on the grounds that they could not guaranteed good input to the tool.