XML is not a Markup Language!

I read an article recently in which the author implied that structured XML applications like OpenXML are somehow derivatives of XML itself. I can see how he might have arrived at that simplistic, but misplaced, premise.  Although the name for which XML stands, eXtensible Markup Language, sounds as though it is a root document type from which all other XML-based applications are extended, that is not actually the case.  XML is not a Markup Language!

It is easy to get the wrong idea because so many of the current applications of XML–true markup languages–actually have “ML” in their names: WebML, GovML, XGMML, OPML… (these and many more are listed at Cover Pages XML Applications and Initiatives). Indeed, markup languages are applications of XML, but are not derivations of XML.

So what is XML, if not a markup language? XML is basically a standard for how to read markup. Formally, the XML W3C Recommendation defines the grammar rules for a parser generator. That is, it defines the syntax and rules for configuring a parser to recognize the parts of a document. It does not help that the spec’s home page itself describes XML as “a simple, very flexible text format derived from SGML (ISO 8879)”–this wording only augments the perception that XML is somehow a format from which further designs are derived.

The W3C XML specification describes certain default aspects of the parsing rules for XML-based documents: the character delimiters required for markup recognition (most notably angle brackets), the role of spaces, equal signs, and quotes for recognizing attributes, the character encodings allowed for conforming data streams, and rules for accepting short forms of empty elements, among others. In SGML systems, those rules were externally declared; XML systems typically internalize many of those behaviors, although features such as encodings can be changed for documents that have a proper XML declaration up front. With just this much information built into it, an XML parser can at least work its way through an XML document and report whether the document is Well Formed–the most minimal form of validation in XML.

What makes Docbook, DITA, and XHTML truly different from each other, though, is a further feature–the real “X” of XML: the meta-language or rules for describing a formal schema for a type of document. Specific types of schema notation that you may have seen include Document Type Definitions (DTD), XML Schema Definition (XSD) and Relax-NG.  The schema defines the constraints for the structure, content, and often the semantics of a particular model of markup. What that means is that the schema, not XML itself, sets forth the actual rules for what markup a document is allowed to contain. In that sense, XML is not actually extensible; the term was sort of bruised from the beginning to mean something more like “openly definable.” In fact, most XML vocabularies are dead ends, extension-wise; they are expressions of finite models, constrained to meet particular business requirements for automated processing and presentation. In most cases, you cannot take an existing XML markup application and extend it further within the XML processing architecture, other than by committee process to modify the current schema.

There is an historical path around this constraint. The mechanism is simply to formalize a processing expectation for an element’s role, which you might think of as ascribing a synonym to an element name and then associating the processing based on the shared relationship. This mechanism, called Architectural Forms, finds use in a number of contemporary XML implementations.

DITA, the Darwin Information Typing Architecture, is an enhanced example of the principal of using architectural role attributes to extend the roles of elements (msgph is a subclass of ph, for example). It goes further by associating those roles with schema shells that match or further constrain the structure and content and perhaps even the meaning of the original element declarations. In effect, a DITA specialization is always a more constrained example of the previous schema from which it was derived. Therefore it is actually possible to generalize specialized DITA content back to its base schema declaration rules–the generalized content is always a fully allowed subset of the base type. In a critical distinction, Architectural Forms usage generally does not correlate the ancestry of roles as well as DITA, and this can affect the reliability of content inclusions from other roles-based sources. It’s fair to say that while Architectural Forms informed on DITA specialization, DITA specialization defines its own unique rules for the validation and processing of derived vocabularies.

So it’s a misconception that XML-based languages such as DocBook, OpenXML and DITA are restricted subsets of some archetype starter set or data model. Rather, they are fully-defined models, in an XML syntax, of structure and content as constrained by their respective XML schemas. DITA itself is just an application of XML that starts off defining a topic and a map, in simple terms. But it goes on to define role-based processing and rules for specializing those content models into the rich panoply of DITA specializations, which I submit a better example of true derivation from an archetype.

In summary, XML provides the rules for creating your schema, and your schema defines the markup language for your content. Hence XML is not a markup language. All clear?

This entry was posted in XML. Bookmark the permalink.

2 Responses to XML is not a Markup Language!

  1. Pingback: XML is not a Markup Language! | Learning by Wrote - xml

  2. martin says:

    A very interesting and stronmg argument but I would still argue that essentially xml is still a markup language which attempts to classiy bits of data ina non semantic way, thus making storage and transfer of data easier.

Comments are closed.