XML Basics
XML Basics
I do not pretend to be an XML expert, and this page has been written, primarily, as background reading for my article on Office Open XMLOffice Open XML [link to the main article at OOXML.php#XMLBasics], Microsoft’s latest file format. I try to present a little history and enough basic information to make that article readable. I hope, however, that this page will also prove to be readable on its own, with as little requirement for foreknowledge as possible. As always, please let me knowlet me know [link to e-mail the author at mailto:Tony@WordArticles.com] if I have failed.
XML stands for eXtensible Markup Language, and it is the latest in a long line of markup languages stretching back to the ’60s. Markup languages have their origins in the publishing industry, where manuscripts would be ‘marked up’ in preparation for printing, but XML has gone far beyond its humble beginnings and can now be used to store any, and, it seems, all data. This is not the place for a history lesson, nor for a full XML tutorial (The World Wide Web Consortium’s web-siteThe World Wide Web Consortium’s web-site [link to http://www.w3.org/XML/] is always a good place to start), and what follows is just enough to enable you, I hope, to understand Word documents. There isn’t a huge amount of terminology, and what there is, is introduced as it becomes relevant.
In a markup language, a, theoretically, clear distinction is made between data and metadata, metadata being, loosely, data that describe the data. To take a simple example, to have a word in a Word document in bold type, the word itself would be the data, and the instruction that it should be in bold type would be the descriptive metadata.
A typical XML construct, and the sort of trivial example that one might find in a text book, could be:
<person> <name> <firstName> Tony </firstName> <surname> Jollans </surname>
</name> </person>
This is one way of holding some data about me in XML. The data, “Tony” and “Jollans”, are bounded by descriptive metadata or markup, in turn bounded by what are commonly called angle brackets, mathematical less than and greater than symbols.
The first point to note is that the markup tags are paired: “<person>” means “this is the start of the person information” and “</person>” means “this is the end of the person information”. Everything in XML must be started and ended properly, and this is a basic criterion of being what is called well formed. One of the rules of XML is that programs that process it, consumers in the jargon, must refuse to do so if it is not well formed. The end tag is always the same as the start tag except that it is preceded by a solidus, or forward slash, as shown in this example, but there is a special abbreviated form for a single tag that is both start and end in one: “<MVP/>”, for example, might be a single tag that says that the person is an MVP.
The next thing, perhaps, to know is that the names of tags are case-sensitive: “name” is not the same as “Name”. It seems to be becoming common for tags to be written in what is sometimes called camel case, and beginning with a lower case letter, as I have shown “firstName” in the example above. There are some very formal rules, but for most practical purposes all you need to know is that tag names may not contain spaces, and must begin with a letter.
Any XML consumer should be able to work with the above XML, as it is well-formed, but it has no real meaning by itself: only an application that understands the tags, and knows what a surname, for example, is, can do anything useful with it. Provided the person, or program, producing the XML, and the person, or program, interpreting the XML both know the particular rules that apply in the particular case, there is nothing more that is required for XML to be used.
There is a formal way of stating the rules for the particular markup being used, in what is called a schema, but schemas say nothing of the meaning of the data. The XML, the tags, the schema: they provide a structure that can be used, but it is down to the users and the application software to make sure those data are sensible. There is an old acronym: GIGO, Garbage In, Garbage Out; the following construct is the same as the one above, but the data are garbage and any action taken with them, given the assumption that they should be names, will produce garbage.
<person> <name> <firstName> 2.71828 </firstName> <surname> 3.14159 </surname>
</name> </person>
So it is with the XML that makes up a Word Document: only a word-processing application can really work with it.
You have just seen some trivial XML. Here is some more:
<bookmark> <name> BookMark01 </name> </bookmark>
This does not come out of a Word document, but, in another universe, it could have: superficially, it looks like it could be a way of storing information about a bookmark in a document. This XML, just like the previous example, contains a “name” tag, used for the name. The name of a bookmark, however, is not the same as the name of a person, bookmarks do not have first names and surnames, so the markup is different.
A program that worked with both people and bookmarks might have difficulty understanding the markup if it couldn’t distinguish between the names of people and the names of bookmarks. A way of avoiding this is provided by the use of what are called namespaces. A namespace is a sub environment in which, in this example, name always means the same thing. You might, for example, have one namespace called ‘realworld’, which included people, and one called ‘cyberspace’, which included concepts, and you might then code:
<person> <realworld:name> <firstName> Tony </firstName> <surname> Jollans </surname> </realworld:name> </person>
<bookmark> <cyberspace:name> BookMark01 </cyberspace:name> </bookmark>
Of course this makes no real sense, but the special form of tag name, using a prepended namespace name, with a colon separator, can be used to distinguish, or disambiguate, the two different name tags. With this, an application that knew about both bookmarks and people would easily be able to know which tag was which.
In reality namespaces have names like “http://schemas.openxmlformats.org/wordprocessingml/2006/main”. To me, and probably to you, this looks like a web address; do not be fooled: it is, in the jargon, called a Uniform Resource Name (URN). URNs are one type of Uniform Resource Identifier (URI); web addresses are Uniform Resource Locators (URL), another type of URI. You will note that this namespace name has a protocol (the bit at the beginning that, roughly, specifies what type of name it is) of “http://”. This stands for HyperText Transfer Protocol and indicates a special type of URL related, unsurprisingly, to the transfer of hypertext, which, in normal English, means transferring the stuff web pages are made of (“hypertext”) to your computer. Namespace names have nothing to do with hypertext, or transfer of anything, and this form of name seems designed to confuse. There might be a valid web address with the same name as the namespace, and it might – or might not – have something to do with the namespace, but there are no rules, and, in practice, namespace names rarely mean anything at all, and just happen to look like web addresses.
When namespace names are long they can make the XML very unwieldy and there is a special way of providing a short alias to use instead of the full name. This works in exactly the same way as it does in, say, SQL and will be familiar to those of you who have worked with relational databases. Sticking with the same nonsense I have used this far, you could code:
<allTogetherNow xmlns:w = "realworld" xmlns:s = "cyberspace"> <person> <w:name> <firstName> Tony </firstName> <surname> Jollans </surname> </w:name> </person>
<bookmark> <s:name> BookMark01 </s:name> </bookmark> </allTogetherNow >
Here, “w” is defined as an alias for the “realworld” namespace, and “s”, one for the “cyberspace” namespace. You will notice the use of the “xmlns” namespace on the containing “allTogetherNow” tag. This is a special namespace (guess what it stands for) that belongs to the granddaddy of all XML, as it were, and should be uniquely understood by all XML consumers.
In all the XML examples I have used this far, I have used colour and space to try to make them clear for the human reader. The colours are typical of those used by XML editors but they have no inherent meaning, and the spaces, more generally, white space, that is any combination of spaces, tabs, carriage returns, and linefeed characters, may be used freely, except where explicitly forbidden (generally within names and around the “<”, “/”, and “>” characters). Without them, the XML is far less understandable by people, but it makes little difference to machines, and this means the same as the previous snippet:
<allTogetherNow xmlns:w="realworld" xmlns:s="cyberspace"><person><r:name><firstName>Tony</firstName><surname>Jollans</surname></r:name></person><bookmark><s:name>BookMark01</s:name></bookmark></allTogetherNow>
My apologies for throwing a long line to your browser; unless you have a very wide screen or are viewing at less than 100%, you will need to scroll right should you wish to see the complete line. This is typical of Word documents, which may well contain single lines that are millions of characters long.
There is, of course, more than this to XML and any items not explained here will be explained, as much as necessary for a basic understanding, when they appear in the main discussion. For the moment, that is all I wish to say: it should be enough to enable you to follow my meanderings through a Word document, and you are encouraged to return to the main articlemain article [link to the main article at OOXML.php#XMLBasics] (when I have written it, that is).