XML White Paper
June 23, 2016
Note: Use of the code examples in this paper, including this document's source code, is absolutely free.
Although visual and user interface standards are a necessary layer, they are insufficient for representing and managing data. Today, the Internet is merely an access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access and display standard. It must set an information understanding standard: a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot do this, because HTML is a format that describes how a Web page should look, and does not represent data. For example, HTML does not:
In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage data as data.
A standard for data representation will expand the Internet in much the same way that the HTML standard for display did a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Payments, medical histories, pharmaceutical research data, semi-conductor part sheets, and purchase orders will all be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. The data standard is XML. This paper shows how XML can be used as a standard format for Web information, based on proposals currently before the W3C standards organization, including recent proposals from Microsoft Corporation. To help developers get started using XML as a data format, Microsoft has posted a free pure-Javaô parser for the XML language as well as additional XML resources at .
XML: A Standard Format for Data
XML provides a data standard that can encode the content, semantics, and schemata for a gamut of cases, from simple to complex. XML can encode the representation for:
The flexibility of a single data representation format allows any software to determine the semantics of a data element, without previous knowledge of the underlying meaning of the data. Information can then be reused for new purposes and in novel contexts. For example, a record from a database of restaurants and a record from a client contact database might both be reused in the context of an appointment, for example, in setting a lunch date with a client. The relationships between the restaurant and contact data do not reside in the schema data described by either database individually, but are extensions defined by the instance of the appointment.
XML provides a structural representation of data that has proved broadly implementable and easy to deploy. Industrial implementations in the SGML community and elsewhere demonstrate the intrinsic quality and industrial strength of XML's tree-structured data format, such as aircraft (ATA), automotive (J2008), banking (OFX), and semiconductors (Pinnacles PCIS).
Benefits of XML
As a universal standard for the expression of data, XML offers many advantages to organizations, software developers, Web sites, and ultimately end-users.
For software developers building Web applications and line-of-business Intranet software, XML provides a powerful, flexible format for expressing data -- whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk. Because structured data in XML can include a self-describing schema, XML promises interoperability between applications that manipulate structured data independent of the underlying semantics.
For Web sites, XML offers a mechanism for adding meta-data or meta-content to HTML. For example, the proposed Channel Definition Format (CDF) provides an application of the XML language that allows a Web site to publish existing HTML content as a channel for "push" clients from Microsoft, PointCast, AirMedia, or BackWeb. XML can also provide a means for embedding arbitrary data and annotations within HTML, extending the possibilities for Web-based applications based on HTML and scripts.
For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet. For example, because XML enables publishers to supplement their Web sites with meta-data such as CDF, users can receive "pushed" content as structured channels. As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these Intranet and Externet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web because it enables such a wide array of business applications to be implemeted on the Internet.
The XML Syntax
XML stands for "eXtensible Markup Language," and is a text-based format, similar to HTML in many respects, but designed especially to store and transmit data. Like HTML, an XML document holds text annotated by tags. However, in HTML, there is a limited set of available tags and these specify how the enclosed text should look (bold, italic, and so on). XML, by contrast, allows an unlimited set of tags, and each indicates not how something should look, but what kind of data it contains. For example, a tag might hold a price, an order number, or a name. Each documentís author can determine what kind of data to use and choose tag types that fit his needs; either public tag types or tags invented on his own.
Letís look at some XML:
Layman Andrew 1997 - 2018 5.95 Number, the Language of Science Dantzig, Tobias 12.95 Introduction to Objectivist Epistemology Rand, Ayn 0-452-01030-6 12.95 Tchaikovsky's First Piano Concerto Janos 1.50 small
Rather than describing the order in which data should be displayed, the tags indicate what each item of data means (whether it is a title element, an author element, and so forth). Any receiver can decode the document, and each can use it for his own purposes. For example, the bookstore might use it to fulfill the order. A market analyst might use many similar orders to discover which books are most popular, which price categories sell well, and so forth. An individual might file it as a record of his purchases.
XML also supports text markup, in which an elementís text contains tags in the middle of the text flow. These usually indicate something special about the textís meaning. For example:
Tchaikovsky's First Piano Concerto
Here, the purpose of the element was not to separate "Tchaikovsky" from the rest of the recordís title, but to indicate that, in addition to being part of the title, it is also the composer's name. Anyone looking for composers Would search on such tags, while anyone processing the data -- looking for the recordís title -- would skip these tags to arrive at the complete title.
Some terminology: In XML, there are two types of tags, a start tag (such as ) and an end tag (such as ). The information between these tags is called the contents. The element includes the start tag, contents, and end tag, all taken together. As you can see from the example above, elements can contain other elements in their contents, such as the within each .
Because this is an order from a bookstore, the element names reflect bookstore terminology. However, if you looked at an XML document containing medical research data, you would find experiments, temperatures, dosages, results, and so forth. Each kind of document has terms specific to its needs.
Namespaces in XML
XML can provide a mechanism for authors to invent new element names and also publish those names so that a community can easily agree on standard terms for representing common data elements. The (please note that viewing this proposal requires W3C member authentication) makes every element name subordinate to a Universal Resource Identifier (URI), which ensures that even if two authors choose the same name, they remain unambiguous. In the same way that anyone can publish their own Web pages or view pages from others, the namespace facility allows anyone to define his own dictionary of terms or to use a public namespace of common terms.
Layman Andrew 1997 - 2018 1234567890
This tells any reader that if a name begins with "co:" its meaning is defined by whoever owns the "http://www.company.com" namespace.
Names used within the co:ITEM element are presumed to come from the same namespace, and if so, do not need further qualification. Namespaces make sure that element names do not conflict and clarify who defined which term. They do not give instructions on how to process the elementsóreaders still need to know what the elements mean and decide how to process them. Namespaces simply keep the names straight.
An author can specify an element's data type; that is, whether an element's string contains a number, a date, and so on, and the format of the string's contents. One can use a lextype attribute for this purpose:
1997 - 2018
Here, "DATE-ISO8061" specifies that the SOLD-ON elementís contents are a date in the format specified by the international standard ISO 8061. As with element names, authors can design their own data types and also use types shared publicly. Microsoft is working with the W3C to define a set of standard types, and will publish a public list that anyone can freely use.
Schemata in XML
A schema is a formal specification of element names that indicates which elements are allowed in a document, and in what combinations. Using a schema, an author can define precisely which element names are permitted in his document; and within each element, which subelements, attributes, and relations are allowed. Authors can invent their own schemata, or they can share ones created by other authors. Readers can check the schema references to verify that the document they have received is the correct type. They can also use the information in the schema to automatically validate the structure of the document.
Microsoft has proposed a "Document Type Definition" (DTD) syntax for expressing the schema for an XML document directly within XML itself, allowing XML data to describe its own structure. Expressing schemata within XML adds great power to the XML format because it makes it possible for software examining certain data to understand its structure without earlier knowledge about the data or its meaning.
XML and HTML Complement Each Other
HTML is about user interface; XML is about data. Dynamic HTML describes display and user interaction; XML describes information. This leads to two natural relations between HTML and XML: First, XML can add information to an HTML document; second, HTML can display information expressed in XML format.
Extending HTML with XML
A common complaint about HTML today is that it isnít extensible. The set of tags is fixed and fairly display-centered, which makes it difficult to add information such as revision histories or to mark-up displayed text (for example defining it as a legal point, an authorís name, or an appendix). It is not easy to add semantic information to HTML pages. Historically, various programs have attempted to deal with this problem by using non-standard "tricks" such as hiding data inside HTML comments. These are awkward, and do not lend themselves to widespread, standards-based data sharing. HTML is not easily extended for data representation, partly due to its nature as a display language, and partly because it was not designed for open extensibility. To solve this, Microsoft is working with the W3C to define a format for putting XML data inside HTML pages. By extending HTML to allow arbitrary XML data elements, a wide range of applications can use HTML as the primary document or display format, and also use XML embedded within these documents to hold application-specific data.
Displaying XML Data in HTML
An XML document does not (by itself) specify whether or how its information should be displayed. The XML data merely contains the facts, such as who ordered which books at which prices. HTML is an ideal display language for presenting this data to an end-user. For example, an employee of an online bookstore may visit a Web page to find a list of order entries. On the back end, the individual data records are expressed in XML, but they are presented to the employee as an HTML page. In order to construct this Web page, either the Web server or the Web browser will need to convert the XML data records into an HTML presentation, such as a table.
The mechanisms of data binding and style can be used to arrange XML data into a visual presentation and add interactivity. Data binding is an aspect of Dynamic HTML that moves individual items of data from an information source (such as an XML document) into an HTML display, allowing HTML to be used as a template for displaying XML data, much like a "mail merge" in word processing. Style sheets add greater power to this process. A style sheet is a collection of programming rules for how to pull information out of an XML document and transform it into a display format such as HTML. For example, a style sheet could specify that a bookstore order should show the name and at the top, as an
element, followed by a table containing columns for , , and . Different styles applied to the same XML document can produce different displays, such as an HTML table, an HTML bulleted list, or a PostScript page. Microsoft is working with the W3C to define the exact style sheet mechanism for transforming XML data into a displayable format.
For example, "Melons cost < $1 at the A&P" would be encoded as "Melons cost < $1 at the A&P".
Although simple, robust and extensible, XML is a verbose format compared to binary schemes. Consequently, we expect that HTTP 1.1 compression will improve the efficiency of XML data transfer. Microsoft is working to popularize standard, efficient compression systems for XML.
The high degree of structure in an XML document makes it easier to add digital signatures or encryption, to individual parts of a document as well as a whole document. Microsoft is working with the W3C Digital Signature Initiative to define standard, XML-based security and authentication for XML data.
Other product and company names mentioned herein may be the trademarks of their respective owners.