Code from an English Coffee Drinker: March 2011

About The Size Of It

After quite a lot of work I've now managed to bring some semblance of order (and documentation) to the last of the GATE plugins that I've been trying to clean up for general release. So as of the most recent nightly build of GATE there is now a Measurements Tagger which you can load from the Tagger_Measurements plugin. I'm not going to attempt to give a full description of the PR here, so if you want the full details have a look at the user guide where there are three whole pages you can read.

In essence the PR annotates measurements appearing in text and normalizes the extracted information to allow for the easy comparison of measurements defined using different units. Now while that description is accurate it probably doesn't make much sense so here are a few examples.

Imagine that you wanted to find all distance measurements less than 3 metres appearing in a document. The Measurements Tagger makes this really simple. You could annotate your documents and then look at the unit and value features of all the Measurement annotations to find those where the unit is "metre" and the value is less than 3, but this would miss lots of valid measurements. For example, 3cm is less than three metres but uses a prefix to make writing the measurement easier. Or how about 4.5 inches? This is clearly less than 3 metres but is specified in an entirely different system of units. Fortunately as well as annotating measurements with the unit and value specified in the document, this new PR also normalizes (where possible) the measurement to it's base form.

The base form of a unit usually consists solely of SI units. This means, for example, that all lengths are normalized to metres, times to seconds, and speeds to metres per second (which is classed as a derived unit but is made up only of SI units).

In our example this means that 3cm is normalized to 0.03m and 4.5 inches to 0.1143m which allows them to both be recognized as being less than 3 metres. Under the hood the PR uses a modified version of the Java port of the GNU Units package to recognize and normalize the measurements. This approach makes it easy to add new units or to customize the parser for a specific domain, providing a very flexible solution.

The PR doesn't actually contain code for recognizing the value of a measurement, rather it relies on the annotations produced by the Numbers Tagger I cleaned up and released back in February. This means that this new PR can also recognize numbers written in many different ways allowing for measurements such as "forty-five miles per hour", "three thousand nanometres" and "2 1/2 pints".

Both the Numbers and Measurement taggers were originally developed for annotating a large corpus of patent documents. Once annotated the corpus could then be searched via another GATE technology called Mímir. Mímir, is a multiparadim IR system which allows searching over text, annotations, and knowledge base data. There are a couple of demo indexes (including a subset of the patent corpus) that you can try, and this video does a good job of explaining how the measurement annotations can be really useful.

If you find the whole topic of measurements interesting then I'd recommend reading "About The Size Of It" by Warwick Cairns. It's only a short book but it explains why we use the measurements we do and how they have evolved over time. I found it interesting, but then I quite like reading non-fiction.

Hopefully the new measurement PR will turn out to be really useful for a lot of people/projects. If you benefit from using GATE in general, or these new PRs in particular, then why not consider making a donation to help support future development.

Hudson Becomes Jenkins

I've upgraded the Hudson instance I use to compile most of my software to the newest version which, after a dispute with Oracle, is now called Jenkins. As well as upgrading the software I've changed the URL to match. I'm using J2EP in order to rewrite the old URLs to their new forms so hopefully all existing links will continue to work as before, but if you spot anything that doesn't seem to work properly please leave a comment so I can fix things.

When Was Yesterday?

Today sees the release of another of the GATE plugins I've been working on cleaning up over the last few months. Unlike the other plugins I've talked about recently this one has a much longer history as I wrote the core code back when I was a PhD student.

Many information extraction (IE) tasks benefit from or require the extraction of accurate date information. While ANNIE (the IE system that comes with GATE) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. My PhD focused on open-domain question answering, an IE task in which dates can play an import role; any "when" question, or questions starting "who is..." benefit from accurate date information. The problem was that I couldn't find a good Java library for parsing dates into a common format, so of course I set about writing one.

The library I wrote is unimaginatively called Date Parser and has been freely available since around 2005. You can currently find the parser being built by my Hudson server. Without going into too many technical details (the Javadoc is available for those who like that kind of thing) the parser takes a string and attempts to parse it as a date starting from a given offset. Unlike the built in DateFromat class which is limited to parsing one date format at a time my parser attempts to handle as many date formats as possible. Of course there are only so many ways you can re-arrange three pieces of information, but the parser also handles relative dates and dates which are not fully specified. For example, "April 2011" would be parsed into a Date object representing the 1st of April 2011. Possibly more interesting though is that fact that words/phrases such as yesterday, today, next Wednesday, and 3 days ago are all also parsed and recognized. In these instances the actual date being mentioned is calculated based upon a context date supplied to the parser. So if the word yesterday appears in the context of the 3rd of March 2011 the string will be recognized as referring to the 2nd of March 2011.

The parser worked really well during my PhD work and has seen numerous improvements since then as well. It started to be used in GATE projects a year or so ago and was initially used in conjunction with ANNIE. ANNIE adds Date annotations to document and I wrote a JAPE grammar that would find these annotations and then run the parser over the underlying text adding the normalized date value (if found) as a new feature. The code eventually moved to being a PR (rather than JAPE) for performance reasons and to support some new features. The problem, however, was that the dates the parser could handle and the dates that ANNIE finds don't always align. This meant that adding a new date format required changes to both ANNIE and the Date Parser. So when I started to clean up the code for release I made the decision to re-write the PR as a standalone component that no longer relies on ANNIE.

Surprisingly it was very easy to convert the existing code to remove the reliance on ANNIE and I think the performance (both time and accuracy) have been improved as a result. This isn't to say that ANNIE is bad at finding dates, just that it does some things differently and it also annotates times with Date annotations which for this task can confuse the issue.

Full documentation is available in the user guide and the PR is already available in the nightly builds of GATE (you need to load the Tagger_DateNormalizer plugin) so feel free to have a play and let me know what you think.