Blogging an XML-driven DMS, I

July the 12th 2010 at 2 o’clock pm

Prefaratory note

There have been hints over the last few months I would treat myself to the fun re-working of this site I have intended for a while. That has finally happened, and I neatened up a few odds and ends this week needed before the site goes live. While I tidy a few things, I will document my code in my little chunks of spare time over the next fortnight and explain the system itsef a little. So, my non-CompSci friends will have to bear with me. To open a little window into my mood, I have been very personally challenged recently, and this is somewhat a retreat from that, something to keep my hands busy in between bouts of maths and melancholy. Excessive code here is therefore to some extent a deliberate ploy to distract myself while I do some thinking. I am not very hasty, and I would prefer a bit of industry to than lengthy introspection to draw out some decisions.

Introduction

I have had some requests for websites this summer, and as part of my continuing experiments with web technologies, I also ought to build myself a decent system for my own site to keep myself in the game of evolving techniques and ideas. There are various reasons why I have concluded that I should to roll my own:

It is not hard. The last two iterations of this and some other sites I have set up were on my own CMSes.
It is the done thing: Half of the technology blogs I subscribe to are run on custom code. Indeed, this might be because many of the top people whose blogs I follow are involved in creating and specing new technologies, so had to build their own blog software to work through the prototype system before anyone large players had adopted it.
More substantially, I have significant XML requirements that will not go away, because I use at least MathML. HTML5 may allow certain content in text/html (even without namespaces---one of Ian Hixie’s persistent mistakes in my opinion), but it is a long way down the line before it will be easy to exchange polyglot documents using non-XML content in XML wrappers.

That is, suppose I want to use Atom syndication, which is an XML wrapper. If I were to work with my data not in XML, how would I embed it in the Atom? This is actually a popular open problem in some circles. For example, how exactly ought valid HTML5 to be transferred in Atom? As technically unparsed data on the XML layer, with some attribute on the container to indicate the second parser to use on the contents? That is the ugly route many people are taking at the moment (for rebuttal, see Norman Walsh, Escaped markup considered harmful). It will certainly not be an option to use namespace-aware parsing according to the two standard syntaxes of XML and HTML5 to allow just one parser layer, because an XML UA might not recognize your custom namespace. That leaves only even clunkier possibilities, like some new version of XML with processing instructions to indicate parser switching (this has genuinely been suggested and fiddled with by some people).

It is clear to me though that basically if you put your content into an XML wrapper, you need to have it close enough to XML that you can properly do the transformation to do genuine embedding. So, I am really sold on end-to-end XML work. In fact, this is a significant change to most current thinking which still revolves around essentially opaque strings rather than strongly structuring information (as the prevalence of HTML flattened in Atom, mentioned above, shows). To put that in context, Eliot Kimber of the original 1996–8 XML committee notes:

The key is to understand that XML is still the best available solution for persistent data. I think a lot of people who use XML day to day forget (or never were told) that XML, via SGML, was originally designed to facilitate search and long-term, application-independent archiving of data. It is almost coincidence that makes that same application-independence useful for communication of transient data. Convenient but not optimal.

Kimber, 2008

So, to develop this long bullet, there is a need for closer coupling using real XML tools to see how much leverage can be applied to data which is strongly marked up from the start. Putting this as a criterion in a DMS narrows the competition down enough (certainly eliminating all big webpage CM systems) to be a justification for building my own. While there are XML CMSs, the field is small and growing to give me grounds to build one myself and do something innovative and new while I am at it.
Fourthly, I like small systems. Drupal is generally fantastic, and could be coaxed into doing what I want, even though it is HTML-centric rather than XML, and its information architecture in particular is superb, but it is simply too big for my tastes. If I can handle my site in a few thousands of lines, I would much prefer to do so.
Fifthly, I am still exploring data semantics and want to play around more than I could with some established framework. Building in DocBook and RDF will be useful to me as a learning process.
Finally, I am extremely picky. Getting the fine detail right to an extreme degree is a lure that other people’s code is unable to offer.

Aims

What then are the defining features of what I am building?

In the range of size, it will be fairly small, less than almost every established project, but more over-architectured than some personal sites and minimal systems, like Hixie’s or Blosxom.

It also must be able to correctly handle and output XML. At least some of the content will have to be stored in XML, such as MathML, and whether or not the rest is, the whole pipeline must make sure that nothing is mangled.

The input must be natural and flexible. There should be no requirement to hand-author XML directly, but with no dodgy, inscrutable, or complex transformations taking control away from my power-hungry posting. The data stored must as richly marked-up as possible, in a language that is interchangeable, widely accepted, clearly defined, and can be mapped to other formats. Nothing too demanding or heavy on buzzwords though. I am also interested in semantic relations between items of data. This is probably best represented using RDF/OWL, but no-one has worked out yet how to make any effective use of this sort of data on the internet. As a long term goal, I want to follow this work and see what could be done with it.

It should integrate well with existing tools, so the preferable way of storing data will be in flat XML files that can be versioned, processed, and searched from the shell. For lightness, I would prefer to have no database at all.

The information architecture of storing pieces of data is also important, including reasonable structuring in terms of content-types, metadata, taxonomies, and so on. These are small things, but I may as well get them right.

Progress

So far, I have built the guts of the system and am ready to go with it. It meets the aims well, in ways I had not always anticipated. It needs some styling and frontend work to make it pretty still, and I have some concerns with the feeds which are holding me back from jumping on it right away. Expect more soon.