Regular expressions on shallow and deep data, Part II

May the 14th 2010 at 8 o’clock am

With these definitions, my contention is twofold: firstly, that hand-authoring content in deep models is flawed from the start. Secondly, that whether CMSs use XML libraries or not in templates to wrap content is irrelevant if inappropriate functions are applied to the content itself before output. This is argued on the basis that regular expressions only work on shallow content models. These then allow us to answer the question, ‘Where did CMSs begin avoiding XML libraries?’ with the response that content which should have been stored in a shallow format all along was authored in a deep structure, which was then further processed by shallow tools like regular expressions, in the end leaving applications with no choice but to positively avoid tools guaranteeing good output, on the grounds that they could not guarantee good input to the tool.

Firstly, the historic adoption of deep models over shallow, and the disadvantages of this

The first assertion, that authors should use shallow markup, is fairly easy to maintain. After all, GUIs are the shallowest markup of all for text, showing the formatting as it is entered. With emboldened text, each character is marked visually as having the property ‘bold’ by being displayed in a different font weight. No context, no flicking to the start or end of a page in sans-serif font is needed to determine that any piece of content is sans-serif. The hugely widespread use of GUIs for entering text should make us suspect that the dominance of this paradigm is not unfounded. In fact, ordinary books themselves, by inspection, represent data linearly with very little nesting of meaning. To a programmer, that is a little like saying that arguments which we might like to have written out in continuation-passing style are instead invariably expressed in natural language in the flattest terms possible. The analogy is in fact particularly good, because it highlights the different ways machines and humans are expected to interact with data. Unless it was planned to represent it on a computer, I do not think it would occur to anyone to store a novel in a tree, nor would users of word-processing software think that way. In the early days of the internet, there was no option but to manually write data in forms like HTML which are semantically low level, and it seems tenable that is only as a legacy of this that authors currently have any expectation of trees representing the data.

The question then of how data should be stored is not so much how it should be exchanged, where XML is employed, but rather what representation is canonical. What is to be avoided I would conclude is allowing the tree-representation of content to be the primary reference, and instead compile that from a flat markup which is closer to the author’s meaning.

So, there are two conclusions, firstly one for the purpose of the argument held. Part of the mess HTML is in as a neither human-level nor machine-level interchange format is as a result of an early misunderstanding of whether the way the machine captures human expression and stores it ought to be shallow or deep. Because machines favour deep formats, and those had to be manually entered, a problematic legacy was introduced. The second conclusions concerns where to proceed. Clearly, my leaning is towards shallow content models is normative, even if purely on the grounds of semantically aiming to capture an author’s thought better. There is the advantage though that such representations can be produced directly, a genuine advantage to the problem of avoiding trusting even oneself to efficiently produce, automatically validate, and correct direct XML markup. The problem of the verbosity of new markup like RDF hardly helps the case that it is at all reasonable even practically for the DOM representation of content to continue to be generated or even tuned manually.

The relationship between deep formats and regular expressions

Even all this is strictly speaking background to the new factor being discussed, which is the relation between historic use of regular expressions and continued use of unsafe processing paths in modern CMSs.

Regular expressions as paradigmatically shallow processors

To begin with, I must defend the assertion that the deep and shallow distinction in markup formats is at all relevant to regular expressions. There are two things which might be meant by regular expressions in this context. Firstly, the theory which is studied in university courses mainly covers standard context-free grammars. In this case, only the very shallowest of formats may be parsed, though they still present far fewer problems than deep ones. Context-free grammars have some strong limitations, and this inability to correctly process tagged markup is well known.

On the other hand, almost all programming languages use more powerful syntax that what is strictly covered by a computer scientist’s use of the term ‘regular’. With callback functions, back-references, and so on, the expressive power of modern regular expression languages does permit proper parsing. I know of no industrial quality libraries which attempt to build complete of any significance though (they would be much slower than traditional direct parsers). More to the point, these additions are not practically used in such a way as to safely process and guarantee nesting in deep models. As far as the contention goes that deeply structured data should not be manipulated with regular expressions, there is wide agreement among information architects and schema designers.

Regular expressions as affecting attitudes over time to data handling

Finally, I would like to apply the reasoning about regular expressions only working effectively on shallow models to the sort of input that CMSs seem to be committed to producing. There is a vicious cycle between a tag soup CMS producing tag soup because the author is inappropriately writing it directly, and letting the author write it directly because his output is only being treated as tag soup anyway, so it is safe (even without taking into account that the information structure of deep models is unsuited to canonically capturing the author’s intention). What I would like unusually to add to that is the distinct role regular expressions have played in this development.

Since the very beginning, regular expressions have been used to do dodgy things to input strings in web applications. Rather than seeing that as just a symptom of the tag soup model, perhaps we can turn the reasoning on its head and view this trend as a cause of the problem. It was very early on that database schemas and output templates began relying on regular expressions, and in turn the constant bugs in very simple functions locked the model in place. Most systems now have some functionality for automatically adding paragraphs to text users enter, and this is a notoriously buggy process. [Wordpress has introduced malformed output on two of the drafts of this page alone, with no exotic markup at all and providing well-nested input to the filter with just some small typos.] Users’ data becomes trapped in these formats, ad-hoc ways of storing data not useful in themselves, and with no solid way of transforming them into something good. This fuels this secondary vicious cycle between design and content.

So, while the historic role of regular expressions is often overlooked, I would suggest it is a part of the thinking which lead to the current practices most CMSs follow.

What then?

The argument and discussion has been mainly historical and polemical. I do however have some goals, which should at least are clear. I favour strongly semantic, readily parsable formats for data exchange using small and common ontologies. I want to be able to use RDF to represent objects in a way that can be parsed and processed for search and storage. Namespaces which are common are vital to that. I also see the value in portability and interaction of technologies that formal data transfer allows. Without following this route a long way down, is there any realistic alternative at the end of the line for embedded MathML in feeds? Browsers will move fast to adopt HTML5, but the whole model of linked and aggregated consumption will either move down that line only very slowly or not at all.

To be fair, the deployment of HTML5 parsers does go a long way towards fixing the problem in the medium-term for the specific application of web-browsing. What was in HTML 4 an undefined, unreliable, and incomplete way for machines to swap page data has been tweaked very effectively into a highly reliable way to redeem the production paradigm which is unable to guarantee more stringent output. This is actually very good for users given the current environment of websites, and an excellent way of making something which is actually worth using even for its own sake. For the benefit of extensibility with namespaces, I still consider it worth targeting XHTML5 as a possibility, but the semantic features it has built in are very much the way to go. For this reason, the week’s news that Firefox 4 development builds now have the parser turned on is very welcome.

In the long term however, it would take a huge change it attitude to produce content according a stricter separation of maintaining data in shallow formats for human interaction and deeper formats for reliable machine interchange.