I'll explain what I mean with this very simple XML document:
<doc>Mark is running a test.</doc>This document is just 35 bytes in size, and when loaded into GATE contains just one
doc
annotation in the Original markups set. If we now save this document as GATE XML we end up with this 831 byte file:<?xml version='1.0' encoding='UTF-8'?> <GateDocument version="2"> <!-- The document's features--> <GateDocumentFeatures> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">file:/home/mark/Desktop/simple-test.xml</Value> </Feature> <Feature> <Name className="java.lang.String">MimeType</Name> <Value className="java.lang.String">text/xml</Value> </Feature> </GateDocumentFeatures> <!-- The document content area with serialized nodes --> <TextWithNodes><Node id="0"/>Mark is running a test.<Node id="23"/></TextWithNodes> <!-- The default annotation set --> <AnnotationSet> </AnnotationSet> <!-- Named annotation set --> <AnnotationSet Name="Original markups"> <Annotation Id="0" Type="doc" StartNode="0" EndNode="23"> </Annotation> </AnnotationSet> </GateDocument>Clearly we could save a few bytes by removing the comments and line separators but there is nothing much we can do to get the size down anywhere near the size of the original document. Now of course comparing the two documents isn't really fair as they are encoding very different things.
- The original document uses inline tags to annotate the document, whereas GATE uses a standoff markup for added flexibility (e.g. inline XML can't handle crossing annotations).
- GATE XML explicitly encodes information only implicit in the original document; in this case it's MIME type and filename.
My first thought was to simply alter GATE to wrap the input/output streams used for writing GATE XML so that they did gzip compression. This would probably work and would give a reasonable space saving (using gzip to compress the above example brings the file size down to 403 bytes), but it would increase the time taken to load/save documents in GATE due to the compression overhead. Now given that we often want to process thousands or millions of documents time is important so I'd prefer not to introduce code that would slow down large scale processing.
Fortunately I'm not the only person who has ever wanted to compress an XML document and there are a number of libraries out there that claim to be able to efficiently compress XML documents. I settled on looking at Fast Infoset because a) it follows a recognised standard and b) there is an easy to use Java library available.
Fast Infoset is a binary encoding (I was going to say lossless but strictly it isn't as the formatting information of the the original XML file is lost) of XML. I won't go into the technical details other than to say that it supports both SAX and StAX access and claims to be more efficient than equivalent reading of text based XML documents (see the spec or the Wikipedia page for more details). What this means in GATE is that we can continue to use the existing methods for reading/writing GATE XML documents by just swapping out one SAX/StAX parser/writer for another. So let's have a look at a more "real world" example; the GATE home page.
I saved a copy of the GATE home page to my local disk (to remove network effects) and then ran a few experiments, which are summarised in the following table.
File | File Size | Time (ms) | |
---|---|---|---|
Format | (bytes) | Save | Load |
HTML | 17,853 | N/A | 45 |
XML | 70,044 | 284 | 140 |
FI | 19,774 | 190 | 73 |
It should be clear from this table that the original HTML file is clearly the smallest representation and the quickest to load, which given the previous discussion shouldn't be surprising. We can also see that the XML encoding sees a huge increase in file size as well as loading time; the XML file is almost 4 times the size of the HTML file. The result of using Fast Infoset, however, is really promising. The file size grew by just 1,921 bytes! Loading time is slower (it takes almost twice as long) but is around twice as fast as loading from XML. The time to save the XML and Fast Infoset versions show little difference which is kind of to be expected (the code is mostly the same so the difference in speed will be down to the number of bytes written, i.e. the speed of my hard drive). It's also worth noting that the speed results were gathered through the GATE GUI and as such are probably not the most rigorous set of data ever collected, although they seemed to be fairly stable when repeated.
From these numbers we would rightly assume that if we just want to store a HTML document efficiently then we should simply store it as HTML; converting it to a GATE document doesn't actually give us any real benefit, and no matter how we store the document on disk we pay a penalty in space and time. Of course storing HTML documents isn't really what GATE is designed for, so let's look at a slightly different scenario; load the same HTML document, run ANNIE over it to produce lots of annotations and then save the processed document. The performance numbers in this scenario are then as follows:
File | File Size | Time (ms) | |
---|---|---|---|
Format | (bytes) | Save | Load |
XML | 591,763 | 191 | 162 |
FI | 112,919 | 166 | 87 |
In this scenario (where we can't use HTML to save the document) we can see that using Fast Infoset is a lot better than storing the documents as raw XML. Not only can we re-load the document in half the time it would take to load the XML we also make a space saving of 81%! In case you still need convincing that you really do save a lot of disk space using Fast Infoset I'll give a final large scale, real world(ish) example.
As part of the Khresmoi project I'd previously processed 10,284 medical Wikipedia articles. Due to disk space requirements I hadn't kept the processed documents and as the Khresmoi application can take a while to run (it's a lot more complex than ANNIE) I've simply run ANNIE over these documents to generate some annotations (it will certainly produce lots of tokens and sentences even if it doesn't find too many people or organizations in the medical documents). The results are as follows (I haven't shown save time as the GUI doesn't give it when saving a corpus, and the load time represents populating the corpus into a datastore):
File | File Size | Time (s) |
---|---|---|
Format | (GB) | Load |
XML | 13.7 | 1,145.387 |
FI | 2.6 | 914.741 |
Now if a space saving of 81% (or 11GB) and a time saving of 4 minutes doesn't convince you then I don't know what will!
So assuming you are convinced that storing GATE documents using Fast Infoset instead of standard XML is the way to go, then you'll want to have a look at the new Format_FastInfoset plugin (you'll need either a recent nightly build or a recent SVN checkout). Documents with a
.finf
extension will automatically load correctly once the plugin is loaded, if you want to use a different extension then you will have to specify the application/fastinfoset
MIME type. You can save documents/corpora using the new "Save as Fast Infoset XML..." menu option (courtesy of the new resource helper feature), which will also appear when you load the plugin.