Savings In Time And Space

In the previous post I talked about a new feature that fixes my main annoyance with the GATE Developer interface. In this post I'm going to address a problem that can affect any use of GATE; the size of GATE XML files.

I'll explain what I mean with this very simple XML document:
<doc>Mark is running a test.</doc>
This document is just 35 bytes in size, and when loaded into GATE contains just one doc annotation in the Original markups set. If we now save this document as GATE XML we end up with this 831 byte file:
<?xml version='1.0' encoding='UTF-8'?>
<GateDocument version="2">
<!-- The document's features-->

<GateDocumentFeatures>
<Feature>
  <Name className="java.lang.String">gate.SourceURL</Name>
  <Value className="java.lang.String">file:/home/mark/Desktop/simple-test.xml</Value>
</Feature>
<Feature>
  <Name className="java.lang.String">MimeType</Name>
  <Value className="java.lang.String">text/xml</Value>
</Feature>
</GateDocumentFeatures>
<!-- The document content area with serialized nodes -->

<TextWithNodes><Node id="0"/>Mark is running a test.<Node id="23"/></TextWithNodes>
<!-- The default annotation set -->

<AnnotationSet>
</AnnotationSet>

<!-- Named annotation set -->

<AnnotationSet Name="Original markups">
<Annotation Id="0" Type="doc" StartNode="0" EndNode="23">
</Annotation>
</AnnotationSet>

</GateDocument>
Clearly we could save a few bytes by removing the comments and line separators but there is nothing much we can do to get the size down anywhere near the size of the original document. Now of course comparing the two documents isn't really fair as they are encoding very different things.
  • The original document uses inline tags to annotate the document, whereas GATE uses a standoff markup for added flexibility (e.g. inline XML can't handle crossing annotations).
  • GATE XML explicitly encodes information only implicit in the original document; in this case it's MIME type and filename.
Now clearly this is a slightly contrived example where I've tried to make GATE XML look as bad as possible, but this can quickly become a real problem. While it is true that "disk space is cheap" it only holds when you want reasonable sized disks. If you are looking at processing terabytes of text and wanting to store the output then you really don't want to find that your processed documents are 24 times the size of the original! This is a problem we often run into with even reasonably sized corpora, as it becomes difficult to move the processed document around just due to their size. The solution we usually adopt is to simply zip up a folder of GATE XML documents. While this helps with moving them around you still need to unpack them if you want to reload them into GATE. One solution would be to look at creating a new GATE document format that would require less diskspace but be as flexible. Given the amount of effort that would require I'm ruling it out without even thinking about how I would do it. A better solution would be file level compression that would leave the documents in a format that GATE could still load easily.

My first thought was to simply alter GATE to wrap the input/output streams used for writing GATE XML so that they did gzip compression. This would probably work and would give a reasonable space saving (using gzip to compress the above example brings the file size down to 403 bytes), but it would increase the time taken to load/save documents in GATE due to the compression overhead. Now given that we often want to process thousands or millions of documents time is important so I'd prefer not to introduce code that would slow down large scale processing.

Fortunately I'm not the only person who has ever wanted to compress an XML document and there are a number of libraries out there that claim to be able to efficiently compress XML documents. I settled on looking at Fast Infoset because a) it follows a recognised standard and b) there is an easy to use Java library available.

Fast Infoset is a binary encoding (I was going to say lossless but strictly it isn't as the formatting information of the the original XML file is lost) of XML. I won't go into the technical details other than to say that it supports both SAX and StAX access and claims to be more efficient than equivalent reading of text based XML documents (see the spec or the Wikipedia page for more details). What this means in GATE is that we can continue to use the existing methods for reading/writing GATE XML documents by just swapping out one SAX/StAX parser/writer for another. So let's have a look at a more "real world" example; the GATE home page.

I saved a copy of the GATE home page to my local disk (to remove network effects) and then ran a few experiments, which are summarised in the following table.

FileFile SizeTime (ms)
Format(bytes)SaveLoad
HTML17,853N/A45
XML70,044284140
FI19,77419073

It should be clear from this table that the original HTML file is clearly the smallest representation and the quickest to load, which given the previous discussion shouldn't be surprising. We can also see that the XML encoding sees a huge increase in file size as well as loading time; the XML file is almost 4 times the size of the HTML file. The result of using Fast Infoset, however, is really promising. The file size grew by just 1,921 bytes! Loading time is slower (it takes almost twice as long) but is around twice as fast as loading from XML. The time to save the XML and Fast Infoset versions show little difference which is kind of to be expected (the code is mostly the same so the difference in speed will be down to the number of bytes written, i.e. the speed of my hard drive). It's also worth noting that the speed results were gathered through the GATE GUI and as such are probably not the most rigorous set of data ever collected, although they seemed to be fairly stable when repeated.

From these numbers we would rightly assume that if we just want to store a HTML document efficiently then we should simply store it as HTML; converting it to a GATE document doesn't actually give us any real benefit, and no matter how we store the document on disk we pay a penalty in space and time. Of course storing HTML documents isn't really what GATE is designed for, so let's look at a slightly different scenario; load the same HTML document, run ANNIE over it to produce lots of annotations and then save the processed document. The performance numbers in this scenario are then as follows:

FileFile SizeTime (ms)
Format(bytes)SaveLoad
XML591,763191162
FI112,91916687

In this scenario (where we can't use HTML to save the document) we can see that using Fast Infoset is a lot better than storing the documents as raw XML. Not only can we re-load the document in half the time it would take to load the XML we also make a space saving of 81%! In case you still need convincing that you really do save a lot of disk space using Fast Infoset I'll give a final large scale, real world(ish) example.

As part of the Khresmoi project I'd previously processed 10,284 medical Wikipedia articles. Due to disk space requirements I hadn't kept the processed documents and as the Khresmoi application can take a while to run (it's a lot more complex than ANNIE) I've simply run ANNIE over these documents to generate some annotations (it will certainly produce lots of tokens and sentences even if it doesn't find too many people or organizations in the medical documents). The results are as follows (I haven't shown save time as the GUI doesn't give it when saving a corpus, and the load time represents populating the corpus into a datastore):

FileFile SizeTime (s)
Format(GB)Load
XML13.71,145.387
FI2.6914.741

Now if a space saving of 81% (or 11GB) and a time saving of 4 minutes doesn't convince you then I don't know what will!

So assuming you are convinced that storing GATE documents using Fast Infoset instead of standard XML is the way to go, then you'll want to have a look at the new Format_FastInfoset plugin (you'll need either a recent nightly build or a recent SVN checkout). Documents with a .finf extension will automatically load correctly once the plugin is loaded, if you want to use a different extension then you will have to specify the application/fastinfoset MIME type. You can save documents/corpora using the new "Save as Fast Infoset XML..." menu option (courtesy of the new resource helper feature), which will also appear when you load the plugin.

Resource Helpers

For a long time there has been one issue with the GATE Developer interface that has really annoyed me; the inability to add new items to the right click menu of a resource (that you didn't develop) without creating a new visual resource and opening that viewer. This is the reason that you can't run an application in GATE without having the application editor open (the editor adds the run option to the menu). I know from a few conversations that this has annoyed other people in the past as well.

Last year I added support for opening MediaWiki documents to GATE. The MediaWiki markup is used on a lot of wikis, including Wikipedia, so having support for it in GATE is really useful as we often end up processing some or all of Wikipedia. There are two main ways you might encounter MediaWiki markup. Firstly you might just have a piece of text with the markup embedded in it (e.g. the text copied out of the page editor on Wikipedia), but you are more likely to have an XML file containing the content of multiple articles in a single file. I added support for both options to GATE, but the support for multiple articles in a single XML file doesn't work very well; you can get the article content, but you lose the title and any other associated metadata. What I really wanted was to add a new option to all corpus resources that would allow you to properly populate the corpus from a MediaWiki XML dump file, but of course I couldn't (at least not without creating a pointless visual resource, and then the option would only be visible if the corpus viewer was open).

A few days ago I was thinking about this problem again when I had a burst of inspiration and realised that there was a fairly easy way to add support, for what I'm calling Resource Helpers, to GATE. Essentially I'm re-using the idea of a GATE resource being a tool (usually used to add items to the Tools menu) to allow any resource to provide items for the right-click menu of any other resource. This did require some minor changes to the core GATE code, so if you want to try this yourself you will need to use a nightly build (or a recent SVN checkout). There will eventually be some details in the userguide, but in essence all you need to know can be gleaned from the following simple example:

package gate.test;

import gate.Document;
import gate.Resource;
import gate.creole.metadata.AutoInstance;
import gate.creole.metadata.CreoleResource;
import gate.gui.MainFrame;
import gate.gui.NameBearerHandle;
import gate.gui.ResourceHelper;

import java.awt.event.ActionEvent;
import java.util.ArrayList;
import java.util.List;

import javax.swing.AbstractAction;
import javax.swing.Action;
import javax.swing.JOptionPane;

@CreoleResource(name = "Resource Helper Test Tool",
  tool = true,
  autoinstances = @AutoInstance)
public class HelperTest extends ResourceHelper {

  @Override
  protected List<Action> buildActions(final NameBearerHandle handle) {
    // a list to hold any actions we want to add
    List<Action> actions = new ArrayList<Action>();

    // if the resource isn't a document then we are finished
    if(!(handle.getTarget() instanceof Document)) return actions;

    // create a new action that will add a menu item labelled "Helper Test"
    // which will show a simple message dialog box
    actions.add(new AbstractAction("Helper Test") {
      @Override
      public void actionPerformed(ActionEvent arg0) {
        JOptionPane.showMessageDialog(MainFrame.getInstance(),
          "Testing the resource helper for " +
            ((Document)handle.getTarget()).getName());
      }
    });

    // return the list of actions
    return actions;
  }
}

If you already understand how tool support in GATE works, then this should be easy to follow, but basically we are creating a new GATE resource that is marked as a tool, and a single instance of which is automatically created. The resource extends ResourceHelper which forces us to implement the single buildActions method. The example implementation of buildActions shown here, simply checks to see if we are being asked to help a Document instance, and if so adds a single new menu item, labelled "Helper Test", which when clicked pops up a dialog box showing the name of the document.

The one thing to note is that the buildActions method is only ever called once per resource as the return value is cached for performance reasons. If you want to have a truly dynamic menu then you will need to also look at overriding the the getActions method of ResourceHelper, but that is beyond the scope of this post.

Having added this new functionality, as you will have already guessed, I've now added a new "Populate from MediaWiki XML Dump" option to every corpus instance, you just need to load the MediaWiki document format plugin for it to appear.