Code from an English Coffee Drinker: Savings In Time And Space

In the previous post I talked about a new feature that fixes my main annoyance with the GATE Developer interface. In this post I'm going to address a problem that can affect any use of GATE; the size of GATE XML files.

I'll explain what I mean with this very simple XML document:

<doc>Mark is running a test.</doc>

This document is just 35 bytes in size, and when loaded into GATE contains just one doc annotation in the Original markups set. If we now save this document as GATE XML we end up with this 831 byte file:

<?xml version='1.0' encoding='UTF-8'?>
<GateDocument version="2">
<!-- The document's features-->

<GateDocumentFeatures>
<Feature>
  <Name className="java.lang.String">gate.SourceURL</Name>
  <Value className="java.lang.String">file:/home/mark/Desktop/simple-test.xml</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">MimeType</Name>
  <Value className="java.lang.String">text/xml</Value>
</Feature>
</GateDocumentFeatures>
<!-- The document content area with serialized nodes -->

<TextWithNodes><Node id="0"/>Mark is running a test.<Node id="23"/></TextWithNodes>
<!-- The default annotation set -->

<AnnotationSet>
</AnnotationSet>

<!-- Named annotation set -->

<AnnotationSet Name="Original markups">
<Annotation Id="0" Type="doc" StartNode="0" EndNode="23">
</Annotation>
</AnnotationSet>

</GateDocument>

Clearly we could save a few bytes by removing the comments and line separators but there is nothing much we can do to get the size down anywhere near the size of the original document. Now of course comparing the two documents isn't really fair as they are encoding very different things.

The original document uses inline tags to annotate the document, whereas GATE uses a standoff markup for added flexibility (e.g. inline XML can't handle crossing annotations).
GATE XML explicitly encodes information only implicit in the original document; in this case it's MIME type and filename.

Now clearly this is a slightly contrived example where I've tried to make GATE XML look as bad as possible, but this can quickly become a real problem. While it is true that "disk space is cheap" it only holds when you want reasonable sized disks. If you are looking at processing terabytes of text and wanting to store the output then you really don't want to find that your processed documents are 24 times the size of the original! This is a problem we often run into with even reasonably sized corpora, as it becomes difficult to move the processed document around just due to their size. The solution we usually adopt is to simply zip up a folder of GATE XML documents. While this helps with moving them around you still need to unpack them if you want to reload them into GATE. One solution would be to look at creating a new GATE document format that would require less diskspace but be as flexible. Given the amount of effort that would require I'm ruling it out without even thinking about how I would do it. A better solution would be file level compression that would leave the documents in a format that GATE could still load easily.

My first thought was to simply alter GATE to wrap the input/output streams used for writing GATE XML so that they did gzip compression. This would probably work and would give a reasonable space saving (using gzip to compress the above example brings the file size down to 403 bytes), but it would increase the time taken to load/save documents in GATE due to the compression overhead. Now given that we often want to process thousands or millions of documents time is important so I'd prefer not to introduce code that would slow down large scale processing.

Fortunately I'm not the only person who has ever wanted to compress an XML document and there are a number of libraries out there that claim to be able to efficiently compress XML documents. I settled on looking at Fast Infoset because a) it follows a recognised standard and b) there is an easy to use Java library available.

Fast Infoset is a binary encoding (I was going to say lossless but strictly it isn't as the formatting information of the the original XML file is lost) of XML. I won't go into the technical details other than to say that it supports both SAX and StAX access and claims to be more efficient than equivalent reading of text based XML documents (see the spec or the Wikipedia page for more details). What this means in GATE is that we can continue to use the existing methods for reading/writing GATE XML documents by just swapping out one SAX/StAX parser/writer for another. So let's have a look at a more "real world" example; the GATE home page.

I saved a copy of the GATE home page to my local disk (to remove network effects) and then ran a few experiments, which are summarised in the following table.

File	File Size	Time (ms)
Format	(bytes)	Save	Load
HTML	17,853	N/A	45
XML	70,044	284	140
FI	19,774	190	73

It should be clear from this table that the original HTML file is clearly the smallest representation and the quickest to load, which given the previous discussion shouldn't be surprising. We can also see that the XML encoding sees a huge increase in file size as well as loading time; the XML file is almost 4 times the size of the HTML file. The result of using Fast Infoset, however, is really promising. The file size grew by just 1,921 bytes! Loading time is slower (it takes almost twice as long) but is around twice as fast as loading from XML. The time to save the XML and Fast Infoset versions show little difference which is kind of to be expected (the code is mostly the same so the difference in speed will be down to the number of bytes written, i.e. the speed of my hard drive). It's also worth noting that the speed results were gathered through the GATE GUI and as such are probably not the most rigorous set of data ever collected, although they seemed to be fairly stable when repeated.

From these numbers we would rightly assume that if we just want to store a HTML document efficiently then we should simply store it as HTML; converting it to a GATE document doesn't actually give us any real benefit, and no matter how we store the document on disk we pay a penalty in space and time. Of course storing HTML documents isn't really what GATE is designed for, so let's look at a slightly different scenario; load the same HTML document, run ANNIE over it to produce lots of annotations and then save the processed document. The performance numbers in this scenario are then as follows:

File	File Size	Time (ms)
Format	(bytes)	Save	Load
XML	591,763	191	162
FI	112,919	166	87

In this scenario (where we can't use HTML to save the document) we can see that using Fast Infoset is a lot better than storing the documents as raw XML. Not only can we re-load the document in half the time it would take to load the XML we also make a space saving of 81%! In case you still need convincing that you really do save a lot of disk space using Fast Infoset I'll give a final large scale, real world(ish) example.

As part of the Khresmoi project I'd previously processed 10,284 medical Wikipedia articles. Due to disk space requirements I hadn't kept the processed documents and as the Khresmoi application can take a while to run (it's a lot more complex than ANNIE) I've simply run ANNIE over these documents to generate some annotations (it will certainly produce lots of tokens and sentences even if it doesn't find too many people or organizations in the medical documents). The results are as follows (I haven't shown save time as the GUI doesn't give it when saving a corpus, and the load time represents populating the corpus into a datastore):

File	File Size	Time (s)
Format	(GB)	Load
XML	13.7	1,145.387
FI	2.6	914.741

Now if a space saving of 81% (or 11GB) and a time saving of 4 minutes doesn't convince you then I don't know what will!

So assuming you are convinced that storing GATE documents using Fast Infoset instead of standard XML is the way to go, then you'll want to have a look at the new Format_FastInfoset plugin (you'll need either a recent nightly build or a recent SVN checkout). Documents with a .finf extension will automatically load correctly once the plugin is loaded, if you want to use a different extension then you will have to specify the application/fastinfoset MIME type. You can save documents/corpora using the new "Save as Fast Infoset XML..." menu option (courtesy of the new resource helper feature), which will also appear when you load the plugin.

Code from an English Coffee Drinker

Savings In Time And Space

0 comments:

Post a Comment