XML White Paper

June 23, 2016
Microsoft Corporation

Contents

Introduction
XML and HTML Complement Each Other
Summary
Appendix

Note: Use of the code examples in this paper, including this document's source code, is absolutely free.

Introduction

The Web has placed in our hands the potential to communicate with anyone, anywhere. Fully realizing its potential depends on widespread use of standards, because, as with the telephone, this communication depends on numerous layers of interoperating technology. One such important layer is visual display and user interface, exemplified by standards such as HTML, GIF and LiveScript (previously JavaScript™). These standards allow a page to be created once, yet displayed at different times by many receivers.

Although visual and user interface standards are a necessary layer, they are insufficient for representing and managing data. Today, the Internet is merely an access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access and display standard. It must set an information understanding standard: a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot do this, because HTML is a format that describes how a Web page should look, and does not represent data. For example, HTML does not:

  • Provide a standard way for a doctor to send a prescription to a pharmacist.
  • Enable a medical laboratory publish statistical information in a format that any receiver can analyze.
  • Describe an electronic payment in a form that any recipient can decode and process.
  • Provide a standard way to search legal libraries, for example, find all litigation documents about a certain topic.
  • Specify how information in a company catalog can be transmitted, such that a salesman can work offline, show the catalog to clients, take orders, then upload those orders in a standard format.

In short, while HTML provides rich facilities for display, it does not provide any standards-based way to manage data as data.

A standard for data representation will expand the Internet in much the same way that the HTML standard for display did a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Payments, medical histories, pharmaceutical research data, semi-conductor part sheets, and purchase orders will all be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. The data standard is XML. This paper shows how XML can be used as a standard format for Web information, based on proposals currently before the W3C standards organization, including recent proposals from Microsoft Corporation. To help developers get started using XML as a data format, Microsoft has posted a free pure-Java™ parser for the XML language as well as additional XML resources at .

XML: A Standard Format for Data

XML provides a data standard that can encode the content, semantics, and schemata for a gamut of cases, from simple to complex. XML can encode the representation for:

  • An ordinary document
  • A structured record, such as a appointment record or purchase order
  • An object, with data and methods (for example, the persistent form of a Java object)
  • A data record, such as the result set of a query
  • Meta-content about a Web site (such as CDF)
  • Graphical presentation (such as an application's user interface)
  • Standard schema entities and types
  • All the links between information and people on the Web

The flexibility of a single data representation format allows any software to determine the semantics of a data element, without previous knowledge of the underlying meaning of the data. Information can then be reused for new purposes and in novel contexts. For example, a record from a database of restaurants and a record from a client contact database might both be reused in the context of an appointment, for example, in setting a lunch date with a client. The relationships between the restaurant and contact data do not reside in the schema data described by either database individually, but are extensions defined by the instance of the appointment.

XML provides a structural representation of data that has proved broadly implementable and easy to deploy. Industrial implementations in the SGML community and elsewhere demonstrate the intrinsic quality and industrial strength of XML's tree-structured data format, such as aircraft (ATA), automotive (J2008), banking (OFX), and semiconductors (Pinnacles PCIS).

Benefits of XML

As a universal standard for the expression of data, XML offers many advantages to organizations, software developers, Web sites, and ultimately end-users.

For software developers building Web applications and line-of-business Intranet software, XML provides a powerful, flexible format for expressing data -- whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk. Because structured data in XML can include a self-describing schema, XML promises interoperability between applications that manipulate structured data independent of the underlying semantics.

For Web sites, XML offers a mechanism for adding meta-data or meta-content to HTML. For example, the proposed Channel Definition Format (CDF) provides an application of the XML language that allows a Web site to publish existing HTML content as a channel for "push" clients from Microsoft, PointCast, AirMedia, or BackWeb. XML can also provide a means for embedding arbitrary data and annotations within HTML, extending the possibilities for Web-based applications based on HTML and scripts.

For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet. For example, because XML enables publishers to supplement their Web sites with meta-data such as CDF, users can receive "pushed" content as structured channels. As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these Intranet and Externet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web because it enables such a wide array of business applications to be implemeted on the Internet.

The XML Syntax

XML stands for "eXtensible Markup Language," and is a text-based format, similar to HTML in many respects, but designed especially to store and transmit data. Like HTML, an XML document holds text annotated by tags. However, in HTML, there is a limited set of available tags and these specify how the enclosed text should look (bold, italic, and so on). XML, by contrast, allows an unlimited set of tags, and each indicates not how something should look, but what kind of data it contains. For example, a tag might hold a price, an order number, or a name. Each document’s author can determine what kind of data to use and choose tag types that fit his needs; either public tag types or tags invented on his own.

Let’s look at some XML:

      
        
          Layman
                  Andrew
          
        
         1997 - 2018 
        
          5.95
          
            Number, the Language of Science
            Dantzig, Tobias
          
        
        
          12.95
          
            Introduction to Objectivist Epistemology
            Rand, Ayn
            0-452-01030-6
          
        
        
          12.95
          
            Tchaikovsky's First Piano Concerto
            Janos
          
        
        
          1.50
          
            small
            
          
        
      

Rather than describing the order in which data should be displayed, the tags indicate what each item of data means (whether it is a title element, an author element, and so forth). Any receiver can decode the document, and each can use it for his own purposes. For example, the bookstore might use it to fulfill the order. A market analyst might use many similar orders to discover which books are most popular, which price categories sell well, and so forth. An individual might file it as a record of his purchases.

XML also supports text markup, in which an element’s text contains tags in the middle of the text flow. These usually indicate something special about the text’s meaning. For example:

            Tchaikovsky's First Piano Concerto

Here, the purpose of the element was not to separate "Tchaikovsky" from the rest of the record’s title, but to indicate that, in addition to being part of the title, it is also the composer's name. Anyone looking for composers Would search on such tags, while anyone processing the data -- looking for the record’s title -- would skip these tags to arrive at the complete title.

Some terminology: In XML, there are two types of tags, a start tag (such as ) and an end tag (such as ). The information between these tags is called the contents. The element includes the start tag, contents, and end tag, all taken together. As you can see from the example above, elements can contain other elements in their contents, such as the within each .

Because this is an order from a bookstore, the element names reflect bookstore terminology. However, if you looked at an XML document containing medical research data, you would find experiments, temperatures, dosages, results, and so forth. Each kind of document has terms specific to its needs.

Namespaces in XML

XML can provide a mechanism for authors to invent new element names and also publish those names so that a community can easily agree on standard terms for representing common data elements. The (please note that viewing this proposal requires W3C member authentication) makes every element name subordinate to a Universal Resource Identifier (URI), which ensures that even if two authors choose the same name, they remain unambiguous. In the same way that anyone can publish their own Web pages or view pages from others, the namespace facility allows anyone to define his own dictionary of terms or to use a public namespace of common terms.

      
        
          
          
        
        
          
            
              Layman
                      Andrew
              
            
             1997 - 2018 
            1234567890
        
      

This tells any reader that if a name begins with "co:" its meaning is defined by whoever owns the "http://www.company.com" namespace.

Names used within the co:ITEM element are presumed to come from the same namespace, and if so, do not need further qualification. Namespaces make sure that element names do not conflict and clarify who defined which term. They do not give instructions on how to process the elements—readers still need to know what the elements mean and decide how to process them. Namespaces simply keep the names straight.

An author can specify an element's data type; that is, whether an element's string contains a number, a date, and so on, and the format of the string's contents. One can use a lextype attribute for this purpose:

       1997 - 2018 

Here, "DATE-ISO8061" specifies that the SOLD-ON element’s contents are a date in the format specified by the international standard ISO 8061. As with element names, authors can design their own data types and also use types shared publicly. Microsoft is working with the W3C to define a set of standard types, and will publish a public list that anyone can freely use.

Schemata in XML

A schema is a formal specification of element names that indicates which elements are allowed in a document, and in what combinations. Using a schema, an author can define precisely which element names are permitted in his document; and within each element, which subelements, attributes, and relations are allowed. Authors can invent their own schemata, or they can share ones created by other authors. Readers can check the schema references to verify that the document they have received is the correct type. They can also use the information in the schema to automatically validate the structure of the document.

Microsoft has proposed a "Document Type Definition" (DTD) syntax for expressing the schema for an XML document directly within XML itself, allowing XML data to describe its own structure. Expressing schemata within XML adds great power to the XML format because it makes it possible for software examining certain data to understand its structure without earlier knowledge about the data or its meaning.

XML and HTML Complement Each Other

HTML is about user interface; XML is about data. Dynamic HTML describes display and user interaction; XML describes information. This leads to two natural relations between HTML and XML: First, XML can add information to an HTML document; second, HTML can display information expressed in XML format.

Extending HTML with XML

A common complaint about HTML today is that it isn’t extensible. The set of tags is fixed and fairly display-centered, which makes it difficult to add information such as revision histories or to mark-up displayed text (for example defining it as a legal point, an author’s name, or an appendix). It is not easy to add semantic information to HTML pages. Historically, various programs have attempted to deal with this problem by using non-standard "tricks" such as hiding data inside HTML comments. These are awkward, and do not lend themselves to widespread, standards-based data sharing. HTML is not easily extended for data representation, partly due to its nature as a display language, and partly because it was not designed for open extensibility. To solve this, Microsoft is working with the W3C to define a format for putting XML data inside HTML pages. By extending HTML to allow arbitrary XML data elements, a wide range of applications can use HTML as the primary document or display format, and also use XML embedded within these documents to hold application-specific data.

Displaying XML Data in HTML

An XML document does not (by itself) specify whether or how its information should be displayed. The XML data merely contains the facts, such as who ordered which books at which prices. HTML is an ideal display language for presenting this data to an end-user. For example, an employee of an online bookstore may visit a Web page to find a list of order entries. On the back end, the individual data records are expressed in XML, but they are presented to the employee as an HTML page. In order to construct this Web page, either the Web server or the Web browser will need to convert the XML data records into an HTML presentation, such as a table.

The mechanisms of data binding and style can be used to arrange XML data into a visual presentation and add interactivity. Data binding is an aspect of Dynamic HTML that moves individual items of data from an information source (such as an XML document) into an HTML display, allowing HTML to be used as a template for displaying XML data, much like a "mail merge" in word processing. Style sheets add greater power to this process. A style sheet is a collection of programming rules for how to pull information out of an XML document and transform it into a display format such as HTML. For example, a style sheet could specify that a bookstore order should show the name and at the top, as an

element, followed by a table containing columns for , , and . Different styles applied to the same XML document can produce different displays, such as an HTML table, an HTML bulleted list, or a PostScript page. Microsoft is working with the W3C to define the exact style sheet mechanism for transforming XML data into a displayable format.

Summary

XML is a standard, extensible, universal data format for HTML and the Internet. It is flexible enough to allow representation of an incredibly wide range of information, and it also allows this information to be self-describing, so that structured data expressed in XML may be manipulated by software that doesn't have previous knowledge of the underlying meaning behind the data. XML provides a file format for representing data, a schema for data to include a description of its own structure, and a mechanism for extending and annotating standard HTML with additional semantic information. With its powerful expressiveness and flexibility, XML promises to add structure to data on the Internet, bringing the Web one step closer to realizing the potential for universal communication with anyone, anywhere.

Appendix: Various Technical Details

Programmatic Access to XML -- the Document Object Model

In addition to providing a file format for representing data, XML needs a standard API for programmatic manipulation of data. Microsoft is working with the W3C to define a standard set of properties, methods and events for programmers and script authors to use. This "object model" provides a simple means of reading and writing data to and from an XML tree structure. These methods enable programmers everywhere to treat XML as a universal data type for encapsulating and transferring data. Because the object model for XML matches the Document Object Model for HTML, script writers can easily master XML programming.

Up-to-date object model information is available at and also from .

Character Set and Encoding

All information in XML is Unicode text. This includes the contents of elements and element names themselves. As a result, XML supports representation of all international character sets.

Unicode can be transmitted directly as 16-bit characters, but more commonly is transferred using an encoding that is more convenient or compact for certain languages. XML supports a range of encodings (the default is UTF-8), subject only to the restriction that an entire document must share the same encoding.

White space

Unlike HTML, which ignores white space (spaces, tabs, new lines, and so on), XML is for data and thus retains all white space. For example, the following are not equivalent:

        Tchaikovsky's First Piano Concerto


        Tchaikovsky's
            First 
            Piano Concerto

Strictly a Tree

XML elements can contain text and other elements, with the exact rules for a specific document type given in its schema. However, elements must be strictly nested: Each start tag must have a corresponding end tag, and elements cannot overlap partially. The examples shown so far have all been legal. The following is illegal:

      Evolution of Culture in Animals
       by John T. Bonner

Empty Tags

XML has a shorthand for an empty element: ending a tag with a "/>" signals that the element has no contents and does not have an end tag. For example, the following two lines are equivalent:

     


    

Reserved Characters

Several characters are part of the syntactic structure of XML, and will not be interpreted as themselves if simply entered. You need to substitute a special character sequence (called by XML an "entity"). Note that case matters.

< <
& &
> >

For example, "Melons cost < $1 at the A&P" would be encoded as "Melons cost < $1 at the A&P".

Compression

Although simple, robust and extensible, XML is a verbose format compared to binary schemes. Consequently, we expect that HTTP 1.1 compression will improve the efficiency of XML data transfer. Microsoft is working to popularize standard, efficient compression systems for XML.

Security

The high degree of structure in an XML document makes it easier to add digital signatures or encryption, to individual parts of a document as well as a whole document. Microsoft is working with the W3C Digital Signature Initiative to define standard, XML-based security and authentication for XML data.

Further Information and Sample Code

  • The W3C page introducing the XML activity at
  • The current W3C working draft for XML at
  • The Microsoft XML Web site at . This site provides Microsoft's open proposal for representing data in XML, as well as a sample parser for XML that is written in Java and demonstrates the proposed XML Document Object Model.
  • Robin Cover's XML resources page, a very well documented page with XML resources and tools, at

© 1997 - 2018 Microsoft Corporation. All rights reserved. Terms of use.

Other product and company names mentioned herein may be the trademarks of their respective owners.