Blogging an XML-driven DMS, II

July the 12th 2010 at 2 o’clock pm

If you had ambitious plans for a DMS with some experimental edges and a new and unusual twist on things, where would you start? There is a long list of technologies to use and integrate, and various core things to work in. Over the years I have written some systems of my own for various commissions, and used off-the-shelf software like Wordpress and Drupal with my own plugins and patches for various tasks, so I have some experience of where things go well and badly in this sort of task. So, take this as a worked example for how to go about building something worthwhile like this, a little like the many toy tutorials on the internet but better in that it describes a real and useful design process.

Key design decisions: document storage format

For a start, some of the key decisions will involve your information architecture. A DMS project is doomed to be at best a mediocre waste of time if you do not do some pen and paper work before you code to sort out how you will store and manage information. For me, the first choice was data storage itself.

Will I use XML for storing internal semantics of my content, or a custom language? I think there are a lot of attractions to a very simple format like Markdown (helpfully its genesis is documented by its creators Aaron Swartz and John Gruber). On the other hand, they tend to confuse authoring syntaxes with good storage tools. Markdown is very lean in what it stores, and does not allow for even the full range of HTML semantics, which is already too limited for long-term storage of much of the content I would like to produce. Parsing it is also a little unstable and is potentially a danger area if I end up wanting to drop things like SVG safely into my documents. A worse route still is creating my own lightweight markup language like Camen’s ReMarkable, which is a nightmare in terms of long-term portability and sharing.

The other option, of strongly typed markup, is very rich in solid, mature options. The main contenders are XHTML, DocBoook, and DITA. There is more literature comparing DocBook and DITA than I can reference or survey, but the conclusion is that the tools for working with DocBook are more stable, its format is closer to what I want, and the large number of elements already in existance with well-defined and useful semantics is beneficial to me. If this is all new, basically think of DocBook as XHTML with a huge number (literally hundreds) of extra elements allowing richer semantics in a wide variety of applications, but especially technical science, maths, or computing documents.

This was therefore the backend format, and the DMS problem is now rephrased in terms of ‘How do I store, serve, and work with content in a load of DocBook-based units?’

DocBook is handled with XSLT. To get a flavour for good use of it (hopefully), you will need to read through the source code to my site when I publish it soon, as a very un-noddy way of giving you code examples in the tutorial. Basically we use specially designed languages to transform and easily work with the DocBook data. Obviously, while there are various possible output targets for the system (Atom, JSON, plain text) the key for a website is HTML. DocBook out of the box has systems for transforming to HTML, and writing customisation layers for that was my first real coding job.

Coding task: DocBook to outline-compliant HTML

HTML5 updates HTML in various ways, and the first new little experiment of my project consisted in massaging the HTML output of the DocBook filters to conform to the HTML5 outline algorithm. If this is new to you, the highly useful Mozilla Developer Network happens to have a good article explaining it. In brief, document structure can now be more closely defined in the HTML source for the benefit of search, contents, assistive technologies, navigation, and so on. No-one it seems has yet made the HTML output of DocBook-XSL compatible with the HTML5 outline algorithm, so I will make that code available shortly.

Design decision: various details

URLs

Back then on the planning board, the question of how to store and tag the content arises. As a beginner a while back, my initial CMSs made some big errors which need to be sorted out early. Firstly, think through your URL structure very carefully indeed, and make sure you have virtual URLs or some sort of flexible handling system built into the core. My solution to this is fairly standard. Content handlers register support for various URLs, and get to manage them (if you are a beginner and want to know how to actually turn a sentence like that into code, follow along in my source; I happen to be using PHP). The URL structure I adopted is ..me.uk/[<target>/]<resource>/?, that is, target can be specified to describe what format the resource should be served in (eg. feed, source, and so on, or blank for the default HTML), and resource is the document’s identifier such as 2010/05/post or projects/computing/tests. I have finally satisfied my inner peeve also about too-common trailing slashes on URLs by consistently following the more pleasing convention of putting no slash on URLs representing a single resource, and having one on the end of a page representing multiple documents such as tags/xhtml/.

Taxonomy

Secondly, after making sure you know what URLs you are minting, think through taxonomy. One of the keys to a correct and right isolating of information for storage is disentangling links between data and the data linked. Suppose I have a document, which has some information about Tacitus and Pliny. Our first instinct is to give it tags with values “Tacitus” and “Pliny”, so we can display the document and put the usual little link at the bottom ‘tagged with …’. This is actually a mistake though, because we have muddled two things: the piece of data ‘Pliny’, and the link between that data and the document. By duplicating the value of the tag in every document that references Pliny, we make it harder later to change the displayed value of the tag to ‘Pliny the younger’ without going through each document to change the value of the tag, possibly breaking any permalinks to category indices and so on as well.

This is a classic rookie mistake in information storage, and as an example see Camen Design, picked as an example not out of vindictiveness, but because it is such a good site and so very open about its code to browse and be inspired by. However, Mr Camen would not now I suspect lock himself into fixed category names in quite the same way if going back.

All good CMSs implement the solution to this as a matter of course, in the form of taxonomies. You can read up on this on Google if the idea is new. Somewhere, you need to store a map between keys and values, and label the document with the key not the value so that it can be resolved to the value when the document is processed. Not hard, but important to get these sort of things well thought through from the start. Drupal is the most impressive CMS I am aware of in terms of working very hard on the interface to get taxonomies and content-types really accessible to the average administrator.

Design decision: code structure

Unfortunately, by this point a good number of code files will need to be written. What the exact layout your project uses is up to you, but some idea of what classes you will write, how they interract, and so on, will be necessary now. Plan a little in advance and everything will fall in place. As you go, what sort of API of core functions you will use will become more apparent and it should be possible to refactor as you go unless you have very big plans.

Finishing off

After much of the basic code is written, the rest of the code becomes details really and can be worked out later. On the other hand, all the frontend stuff sticks its head in, and a whole load of time will get spent working out the layout, design, and styling of the components, as well as all the finicky things like ETags, caching, scrupulously making sure we follow the HTTP spec, and more. At the very least, make sure all the points covered by Yahoo! YSlow and Google’s PageSpeed are followed, but there is plenty more to get right too that comes under the banner of ‘doing it right’ more than ‘making it fast’. As a benchmark, I checked my code against Wordpress, and found out that with page caching enabled I was spending 1ms in the script compared to 190ms for WP, or 5ms without caching. You should be easily able to get similar speedups over larger projects.