Welcome!

AJAX & REA Authors: John Funnell, Bob Little, Kevin Hoffman, Maureen O'Gara, Onkar Singh

Related Topics: SOA & WOA, XML

SOA & WOA: Article

SOA with Document-Centric XML Processing

This article introduces the concept of document-centric XML processing and a set of emerging document-centric capabilities

It is also worth noting that there is nothing small about this problem. It is, in my view, the biggest and toughest technical issue in enterprise IT today. Consider the ESB (Enterprise Services Bus) example I used in the first part of the series. Right now, the situation is bad beyond belief. Because of the inefficient DOM parsing, those ESBs are already considered slow in read-only situations, especially for large XML messages. If the desired operations (such as policy enforcement) require both reading and writing, it's like adding insult to injury: no matter how trivial the change is, the entire XML message needs to be re-serialized, quickly degrading the overall performance to the point of unbearable.

My question becomes: Am I the only one seeing the elephant in the room?

How VTD-XML Changes the Picture
Simply put, VTD-XML provides a solution so spectacular that the problem is completely gone.

The first part of this article series introduced VTD-XML as a memory-efficient, high-performance XML parser with integrated indexing and XPath. Virtually every technical benefit of VTD-XML is, one way or another, the result of non-extractive parsing, meaning the original XML text is loaded in memory and fully preserved. However, the most important benefits of VTD-XML - the ones that truly set it apart from other XML processing models - lie in its unique ability to manipulate XML document content at the byte level. Below are three distinct, yet related, sets of capabilities available in the latest version of VTD-XML.

  • Incremental XML modifier: You can modify an XML document incrementally through the XMLModifier, which defines three types of "modify" operations: inserting new content into any location (i.e., offset) in the document, deleting content (by specifying the offset and length), and replacing old content with new content - which effectively is a deletion and insertion at the same location. To compose a new document containing all the changes, call XMLModifier's output(...) method.
  • XML slicer and splicer: You can use a pair of integers (offset and length) to address a segment of XML text so your application can slice the segment from the original document and move it to another location in the same or a different document. The VTDNav class exposes two methods that allow you to address an element fragment: getElementFragment(), which returns a 64-bit integer representing the offset and length value of the current element, and getElementFragmentNs() (in the latest version), which returns an ElementFragmentNs object representing a "namespace-compensated" element fragment. The latest version also transparently supports transcoding, so you can perform cutting and pasting across documents with different encoding formats.
  • XML editor: You can directly edit the in-memory copy of the XML text using VTDNav's overWrite(...) method, provided that the original tokens you're overwriting are wide enough to hold the new byte content.

Using VTD-XML as an incremental modifier to update the text node, you basically navigate the VTD records to the right location, stick in the change, and generate a new document - exactly the same way you would do it with NotePad. Listing 1 shows a simple application updating a text node using VTD-XML.

XML Processing: Object Oriented vs Document Centric
Traditional XML processing models, such as DOM, SAX, and various object data binding tools, are designed around the notion of objects. The XML text - merely the output of object serialization - is relegated to the status of a second-class citizen. You base your applications on DOM nodes, strings, and various business objects, but rarely on the physical documents. If you have followed my analysis so far, it's become obvious that this object-oriented approach of XML processing makes little sense as it causes performance hits from virtually all directions (an in-depth discussion on the topic can be found in "the performance woe of binary XML"). Not only are object creations and garbage collection inherently memory- and CPU-intensive, but applications incur the cost of re-serialization with even the smallest changes to the original text.

What is "document-centric" XML processing? In non-extractive parsing, the XML text - the persistent data format - is the starting point from which everything else comes about. Whether it's parsing, XPath evaluation, modifying content, or slicing element fragments, by default, you no longer work with objects. You only do that when it makes sense. More often than not, you treat documents purely as syntax, and think in bytes, byte arrays, integers, offsets, lengths, fragments, and namespace-compensated fragments. The first-class citizen in this paradigm is the XML text. The object-centric notions of XML processing, such as serialization and de-serialization (or marshaling and un-marshaling), as shown in Figure 1, are often displaced, if not replaced, by more document-centric notions of parsing and composition. Increasingly, you will find that your XML programming experience is getting simpler. Not surprisingly, the simpler, intuitive way to think about XML processing is also the most efficient and powerful (see Table 1 for the technical comparison of DOM and VTD-XML).

More Stories By Jimmy Zhang

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.