Code from an English Coffee Drinker

Gordon's Alive!

A few years ago I wrote a small servlet to allow QuickTime movies to be converted into Flash video on the fly, specifically to support playback on the Wii -- I gave it the rather unimaginative name Quick As A Flash. I've had little need to update the code until recently when I upgraded my main PC from Windows to Ubuntu.

I have a web app that I wrote and use to index/search all the DVDs I own. It interfaces with Amazon to get artwork and reviews and allows for linking trailers to each film. I had been using Apple's QuickTime for Java to get the dimensions and duration of the QuickTime trailers I was adding to the index. Unfortunately this has a) never been available under Linux and b) has been deprecated by Apple. So I decided to revisit Quick As A Flash and add support for extracting this information to the servlet.

Quick As A Flash uses FFmpeg to do the transcoding to Flash and it is trivial to read the dimensions and duration of the movie from the FFmpeg output. In a simply case of I-could-so-I-did I've also added support for generating a thumbnail image from the QuickTime movie.

I've no idea if anyone else is using this code or would ever find it useful but if you are interested then you can grab a binary copy or track progress on the Jenkins build page.

About The Size Of It

After quite a lot of work I've now managed to bring some semblance of order (and documentation) to the last of the GATE plugins that I've been trying to clean up for general release. So as of the most recent nightly build of GATE there is now a Measurements Tagger which you can load from the Tagger_Measurements plugin. I'm not going to attempt to give a full description of the PR here, so if you want the full details have a look at the user guide where there are three whole pages you can read.

In essence the PR annotates measurements appearing in text and normalizes the extracted information to allow for the easy comparison of measurements defined using different units. Now while that description is accurate it probably doesn't make much sense so here are a few examples.

Imagine that you wanted to find all distance measurements less than 3 metres appearing in a document. The Measurements Tagger makes this really simple. You could annotate your documents and then look at the unit and value features of all the Measurement annotations to find those where the unit is "metre" and the value is less than 3, but this would miss lots of valid measurements. For example, 3cm is less than three metres but uses a prefix to make writing the measurement easier. Or how about 4.5 inches? This is clearly less than 3 metres but is specified in an entirely different system of units. Fortunately as well as annotating measurements with the unit and value specified in the document, this new PR also normalizes (where possible) the measurement to it's base form.

The base form of a unit usually consists solely of SI units. This means, for example, that all lengths are normalized to metres, times to seconds, and speeds to metres per second (which is classed as a derived unit but is made up only of SI units).

In our example this means that 3cm is normalized to 0.03m and 4.5 inches to 0.1143m which allows them to both be recognized as being less than 3 metres. Under the hood the PR uses a modified version of the Java port of the GNU Units package to recognize and normalize the measurements. This approach makes it easy to add new units or to customize the parser for a specific domain, providing a very flexible solution.

The PR doesn't actually contain code for recognizing the value of a measurement, rather it relies on the annotations produced by the Numbers Tagger I cleaned up and released back in February. This means that this new PR can also recognize numbers written in many different ways allowing for measurements such as "forty-five miles per hour", "three thousand nanometres" and "2 1/2 pints".

Both the Numbers and Measurement taggers were originally developed for annotating a large corpus of patent documents. Once annotated the corpus could then be searched via another GATE technology called Mímir. Mímir, is a multiparadim IR system which allows searching over text, annotations, and knowledge base data. There are a couple of demo indexes (including a subset of the patent corpus) that you can try, and this video does a good job of explaining how the measurement annotations can be really useful.

If you find the whole topic of measurements interesting then I'd recommend reading "About The Size Of It" by Warwick Cairns. It's only a short book but it explains why we use the measurements we do and how they have evolved over time. I found it interesting, but then I quite like reading non-fiction.

Hopefully the new measurement PR will turn out to be really useful for a lot of people/projects. If you benefit from using GATE in general, or these new PRs in particular, then why not consider making a donation to help support future development.

Hudson Becomes Jenkins

I've upgraded the Hudson instance I use to compile most of my software to the newest version which, after a dispute with Oracle, is now called Jenkins. As well as upgrading the software I've changed the URL to match. I'm using J2EP in order to rewrite the old URLs to their new forms so hopefully all existing links will continue to work as before, but if you spot anything that doesn't seem to work properly please leave a comment so I can fix things.

When Was Yesterday?

Today sees the release of another of the GATE plugins I've been working on cleaning up over the last few months. Unlike the other plugins I've talked about recently this one has a much longer history as I wrote the core code back when I was a PhD student.

Many information extraction (IE) tasks benefit from or require the extraction of accurate date information. While ANNIE (the IE system that comes with GATE) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. My PhD focused on open-domain question answering, an IE task in which dates can play an import role; any "when" question, or questions starting "who is..." benefit from accurate date information. The problem was that I couldn't find a good Java library for parsing dates into a common format, so of course I set about writing one.

The library I wrote is unimaginatively called Date Parser and has been freely available since around 2005. You can currently find the parser being built by my Hudson server. Without going into too many technical details (the Javadoc is available for those who like that kind of thing) the parser takes a string and attempts to parse it as a date starting from a given offset. Unlike the built in DateFromat class which is limited to parsing one date format at a time my parser attempts to handle as many date formats as possible. Of course there are only so many ways you can re-arrange three pieces of information, but the parser also handles relative dates and dates which are not fully specified. For example, "April 2011" would be parsed into a Date object representing the 1st of April 2011. Possibly more interesting though is that fact that words/phrases such as yesterday, today, next Wednesday, and 3 days ago are all also parsed and recognized. In these instances the actual date being mentioned is calculated based upon a context date supplied to the parser. So if the word yesterday appears in the context of the 3rd of March 2011 the string will be recognized as referring to the 2nd of March 2011.

The parser worked really well during my PhD work and has seen numerous improvements since then as well. It started to be used in GATE projects a year or so ago and was initially used in conjunction with ANNIE. ANNIE adds Date annotations to document and I wrote a JAPE grammar that would find these annotations and then run the parser over the underlying text adding the normalized date value (if found) as a new feature. The code eventually moved to being a PR (rather than JAPE) for performance reasons and to support some new features. The problem, however, was that the dates the parser could handle and the dates that ANNIE finds don't always align. This meant that adding a new date format required changes to both ANNIE and the Date Parser. So when I started to clean up the code for release I made the decision to re-write the PR as a standalone component that no longer relies on ANNIE.

Surprisingly it was very easy to convert the existing code to remove the reliance on ANNIE and I think the performance (both time and accuracy) have been improved as a result. This isn't to say that ANNIE is bad at finding dates, just that it does some things differently and it also annotates times with Date annotations which for this task can confuse the issue.

Full documentation is available in the user guide and the PR is already available in the nightly builds of GATE (you need to load the Tagger_DateNormalizer plugin) so feel free to have a play and let me know what you think.

More Ice In Your Tea?

I really shouldn't blog when I'm angry or annoyed as I tend to rant a little more than I intend! In retrospect I was a little harsh in my last post -- anyone who freely gives their time to developing free software shouldn't have to put up with me disparaging their work.

So as penance I've now tracked down the source of the weird class loading bug I highlighted and have submitted a detailed bug report, including a proposed fix, to the IcedTea netx project (netx is the name of the open-source Web Start replacement). You can monitor the progress of the bug through their public bug tracker. If I had the right permissions it's such a simple fix that I'd be happy to do it myself, but you have to earn the respect of project maintainers before getting the right to commit code.

Update, 23th February: it's now been fixed in the main code tree although it will take a while before it makes it into an Ubuntu update.

Why You Shouldn't Drink The IcedTea

I'm all for supporting open-source software but there are limits. I've recently switched to using Ubuntu on my main machine at home and have run into two bugs in the same piece of open-source software.

If you are a regular reader of this blog then you are probably aware that I do most of my software development using Java. A default install of Ubuntu (10.10) includes the OpenJDK based IcedTea version of Java 6. This is a version of Java that is covered by an open-source license -- which is in comparison to the Sun/Oracle version of Java for which you can read the source but which was not covered by an open-source licence (it's now "mostly" covered by GPL v2 with the classpath exception). I've never really understood the philosophical argument behind IcedTea and the need for a clean room implementation of Java, although Oracle's recent attack on Android provides some explanation. Anyway, given that it was the default installation of Java I was willing to give it a try. Within minutes though I'd found two show stopping bugs and so have switched back to using the reliable Sun/Oracle release of Java 6.

The first bug is visual and one that I knew existed in earlier versions of IcedTea but which I hoped had been fixed by now. In essence the ImageIO JPEG reader in IcedTea doesn't properly handle JPEG images with embedded colour profiles. What you end up with is an image that looks like a a photographic negative rather than the image you tried to load. This bug basically means that you can't use IcedTea for any application that allows users to load arbitrary JPEG files. For me this means I can't recommend it for running Convert4Frame, TagME, PhotoGrid or 3DAssembler. Also I can't use IcedTea to run the tomcat server in which I host my cookbook web app. What is really annoying about this bug is that it was originally in the main Sun/Oracle distribution, reported all the way back in 2003, but was fixed in Java 5 update 4, you can read all about it in the bug report. If the open-source version can't fix a bug that is around eight years old then why do they even bother!

The second bug is a little stranger but no less annoying. The documentation for the method ClassLoader.loadClass(String name) states that either it returns the resulting Class object or throws a ClassNotFoundException if (wait for it) the class was not found. That all seems perfectly logical to me. Unfortunately there appears to be at least one situation in which IcedTea returns null instead of throwing an exception when the class cannot be found.

I distribute a lot of the open-source Java software that I develop in my spare time via Web Start and once I had Ubuntu up and running I thought I'd check Java by launching 3DAssembler. Unfortunately it failed to load and gave me a rather strange NullPointerException. After a bit of digging around (the version of the app on my website doesn't match my development version and hence the line numbers were out) I eventually tracked the problem back to this try/catch block.

try {
  Class rmClass = Assemble3D.class.getClassLoader().loadClass("org.jdesktop.swinghelper.debug.CheckThreadViolationRepaintManager");
  RepaintManager.setCurrentManager((RepaintManager)rmClass.getConstructor().newInstance());
  System.err.println("EDT Debug Mode Is Active");
}
catch (ClassNotFoundException e) {
  // the debug classes from SwingHelper are not available
}

This code tries to load a class, via reflection, that catches EDT violations (painting Swing components from the wrong thread) and that I only use during development to aid in debugging. I load the class via reflection so that when I distribute the application I can simply leave out the JAR file containing the debug class and everything will continue to work -- the class isn't found so an exception is thrown, caught and ignored and the application continues on. The problem with IcedTea is that when running as a Web Start application the call to loadClass in line 2 returns null instead of throwing a ClassNotFoundException. This means that the catch block isn't triggered and the exception is thrown all the way out of the main method, killing the application. It seems to only be a Web Start issue as running my development copy locally under IcedTea doesn't cause loadClass to return null. Of course this problem I can fix by changing the catch block to trap all exceptions, but the point is I shouldn't have to!

As I said at the beginning of this post I'm all for open-source software, but I believe there are cases where developers who give their time freely to projects should think more about the merits of the project and if it is really needed. The "official" Oracle release of Java is now, for all intense and purposes, under an open-source license for the development of desktop applications (mobile and embedded uses are a different kettle of fish). Given this, is there really any need for a clean room implementation, especially when that implementation is so buggy as to render it useless in many situations?

What's Actually Worth Reading?

Another day, another GATE processing resource -- as you can tell I've been busy tidying up the PRs that I've developed recently. One of the reasons for this spurt of cleaning and documenting code is that a project I'm currently working on is ending soon and the information extraction pipeline we have developed needs to be fully documented. Being able to just point to multiple sections of the GATE user guide for more details on each PR in the application makes the documentation much easier to write. Of course that means that the PRs have to actually have documentation in the user guide!

I won't go into details about the project I'm currently working on with The National Archives (if you want the details then there was a press release and the head of the GATE group, i.e. my boss, has blogged about it) suffice it to say that it involves processing millions of web pages drawn from hundreds of different web sites.

We can extract an awful lot of information from the web pages we are processing, so much so in fact that it can be difficult to search through the information. We have multiple tools to help with searching but one thing we quickly realised is that it would be nice to ignore information extracted from boilerplate content. Most web pages contain text that isn't really part of the content; headers, menus, navigation links etc. These sections can contain entities that we might extract but it is highly unlikely that they will be relevant to the main content of the page. For this reason it would be nice to be able to exclude these in some way when searching through the extracted information.

The approach we choose was to keep everything extracted using the IE pipeline but to also determine the sections of the document that were actually content. This allows us to search for entities within content. It also means that if our ability to determine what is useful content and what isn't is flawed in any way we have still extracted the entities appearing in other parts of the document.

Rather than implementing a content detection system from scratch I decided to base the PR on an existing Java library called boilerpipe. The boilerpipe library contains a number of different algorithms for detecting content most of which are available through the new GATE PR. There are some features that are not available due to it currently not being possible to map them directly to a GATE document.

To give you a better idea of what the new PR does here is a screen shot of a web page loaded into both a browser and GATE. In the GATE window you can see the pink sections that have been marked as content (click on the image for a larger easier to read version).

Whilst this kind of approach is never going to be perfect it seems, from initial testing, that it does indeed help to filter out erroneous results when searching through information extracted from large web based corpora.

If you want to try it out yourself then it's already in the main GATE svn repository and the nightly builds. Details of how to configure the PR can be found in the relevant section of the GATE user guide.