Code from an English Coffee Drinker

Trawling The Heap

I've spent a good few hours over the last week trying to track down a memory leak in a web application I've been working on. As far as I could tell from the code all the relevant resources were being freed when finished with, but still after a few hours the tomcat instance in which the app was running would grind to a halt as the available free memory inched ever closer to zero. In the end I decided that that only solution was to trawl through a heap dump to find out exactly what was being leaked and what was holding a reference to it.

Now it used to be that taking exploring the Java heap was a tedious and horrid process. Fortunately, the JDK now comes with VisualVM that makes working with the heap really easy.

VisualVM can attach to any running Java process and monitor it's memory usage, which in itself can be useful, but it can also take a heap dump and then provides an easy tool for navigating through the often vasts amount of information provided. Now in theory you should be able to use VisualVM to examine the heap of the tomcat server running a troublesome web app. Now try as I might I couldn't get this to work. The problem stems from the fact that I'm running tomcat under a different user account than my own, an account that you can't actually log in to (for the curious I installed tomcat under Ubuntu using the default package which runs tomcat under the tomcat6 user). I could monitor the memory usage but no matter what I tried (and believe me I tried all sorts of things) I couldn't manage to get a heap dump.

In the end I resorted to manually creating a core dump using the unix gcore utility and then loading this into VisualVM which could then generate a heap dump. This actually works quite nicely. The only downside is that it requires you to know the process ID of the tomcat web server and this changes everytime the server is restarted, which if you are debugging a problem, can be quite often. So to make my life a little easier I've written a small bash script that makes tomcat dump it's heap, which I've cleverly called tomscat!

#!/bin/bash

pid=`ps -u tomcat6 | grep java | sed 's/ .*$//g'`
count=0

ls -1 tomcat*.$pid > /dev/null 2>&1

[ $? -eq 0 ] && count=`ls -1 tomcat*.$pid | wc -l`

gcore -o tomcat$count $pid

This script firstly finds the pid for the tomcat process then works out if there are already any core dumps for this instance of tomcat and then generates a core dump into a nicely named file. Currently there is little in the way of error handling so if it doesn't work any errors may be cryptic! Anyway hopefully other people might find this script useful, I know it made the process of creating a bunch of heap dumps quite easy, and once I had the heap dumps tracking down the leak was fairly easy (turns out the the leak was due to large cache objects associated with database connections not being made available for garbage collection).

People in the News

Back in May I was involved in producing a demo for a show-and-tell session at the GATE training course. The idea was to try and demonstrate the process of defining an application, developing an annotation pipeline, annotating a large corpus, and then providing search over the documents, annotations and associated semantic information.

The idea we settled upon was to extend the basic ANNIE application, that is bundled with GATE, to annotate BBC News articles and to link the entities within them to DBpedia. This would then allow us to search the documents both for textual information, the same as any other search engine, but to also restrict the search based on information that might not be present in the documents but which is encoded in DBpedia. This worked well and allowed us to demonstrate the use of GATE Developer, the GATE Cloud Parallelizer (GCP) and GATE Mímir.

The combination of text, annotations and semantic information allow us to search the documents in interesting ways. You can play with the basic Mímir interface (referred to as GUS) over the demo index to see for yourself how useful the combination can be. Given that not many people reading this will already know the Mímir query syntax, and those that do probably won't know what annotations etc. are in the index, here are few example queries to get you started:

People Born in Sheffield: {Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield>}"}
The Location of Steel Industries: {Organization sparql = "SELECT ?inst WHERE { ?inst :industry <http://dbpedia.org/resource/Steel>}"} [0..4] in {Location}
A BBC Scotland document, written after the start of 2011, in which a Labour Party member is being quoted: ({Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29>}"} root:say) IN ({Document date > 20110000} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})

As you can see from these examples, as the queries get more complex they quickly become unwieldy. The problem is that Mímir provides a very rich query syntax and the basic GUS interface does nothing to hide the syntax from the user. Whenever we demo Mímir people love it but we always have to stress that GUS is not an end user search tool -- it is a development tool to enable you to check the contents of an index and to develop complex queries. In other words...

GUS is not the interface you are looking for!

Now I really like the demo we put together but trying to teach people the Mímir query syntax is difficult, especially if they don't already know any SPARQL. Also it is difficult to explain to potential partners/customers how they could take a Mímir index and produce their own custom interfaces. Whilst these thoughts have been festering in the back of my mind for a while I've only just found the time to go back to the demo and to build a custom interface (partly because next week I'm going to be teaching some people how to build custom Mímir interfaces, so I thought it best to have built at least one).

Given how rich the query syntax is, it is unlikely that a custom interface will be able to expose all the information within the index. Instead a number of interfaces may be developed, for the same index, in order to provide different types of search. Given this I decided to focus on searching for people within the BBC News articles. I used GUS to explore the index (which is what GUS is really for) and built up a number of complex person related queries. I then set about breaking these queries down into sections that could be easily represented in a form based fashion.

Once the form was complete it was trivial to reconstruct the complex queries from the form elements. All that was left to do was to interface with the Mímir index. Fortunately as well as GUS Mímir comes with an XML based RESTful interface. So the demo now builds complex queries from the form elements submits the query to Mímir via it's RESTful interface and then displays the results all without the user having to know anything about Mímir's query syntax.

The completed demo is unimaginatively called People in the News and you should feel free to play around with it. Some example queries include; criminals called Jonathan, Russian astronauts, and (my favourite complex example) politicians born in Sheffield mentioned in BBC Scotland documents from April 2011. The nice thing about the new interface is how easy it is to fill in the form to run these queries. That last example would otherwise entail you entering the following into GUS:

(({Person sparql="SELECT DISTINCT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield> . { ?inst a :Politician } UNION { ?inst a :OfficeHolder . ?inst a <http://xmlns.com/foaf/0.1/Person> } }"}) IN {Content}) IN ({Document date >= 20110401 date <= 20110430} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})

It is still something of a work-in-progress so if you have any ideas for improvements, or you find any bugs/oddities please do let me know.

Cranium

Over the last few weeks I've been trying to hunt down a memory leak in a servlet based web application. Periodically the Java virtual machine in which Tomcat was running would inexplicably run out of PermGen space and become so unresponsive that the only solution was to kill and restart the server process. After a lot of hunting through logs and trawling the Internet for pointers, I've found that the problem actually occurs when a web application is redeployed, although the out of memory error may occur later (which is why it was difficult to spot in the logs).

It turns out that when an application is redeployed the old classloader should be garbage collected which should free up both heap and PermGen memory by removing all the information related to the discarded web application. Unfortunately if something outside your web application holds a reference to even one class within the application which was loaded via the applications classloader then the classloader itself, and hence all the class information it has loaded, will not become eligible for garbage collection and this, eventually, results in exhaustion of the PermGen memory pool. If that isn't initially clear, never fear, as Frank Kieviet wrote a brilliant article (with diagrams) which explains the problem in more detail.

Looking back through the Tomcat logs it seems as if something within one of the libraries I was using is leaking a Timer instance which stops the classloader being garbage collected. I haven't actually managed to fix the problem yet but I did learn quite a few things along the way which I've collected together and turned into....

Cranium is a web application (distributed as a WAR file) that provides information on the memory usage of the servlet container in which it is being hosted. This includes information on all the memory pools (both heap and non-heap) as well as class loading and garbage collection. It also incorporates two different ways of triggering garbage collection to help monitor for memory leaks etc. Rather than trying to explain in detail what Cranium allows you to monitor I'm hosting it as a demo for you to look at (although I've disabled the garbage collection tools so that they cannot be used to make the server unstable).

As with most of my software Cranium is open-source and you can grab the code from my SVN repository or you can simply grab a pre-built WAR file. If you want to track development of Cranium then you can monitor it via my Jenkins server which also produces a bleeding edge WAR file on each build.

I know a lot of the information Cranium displays is available through other tools but I'm already finding it really useful and I hope that at least one other person does too!

Gordon's Alive!

A few years ago I wrote a small servlet to allow QuickTime movies to be converted into Flash video on the fly, specifically to support playback on the Wii -- I gave it the rather unimaginative name Quick As A Flash. I've had little need to update the code until recently when I upgraded my main PC from Windows to Ubuntu.

I have a web app that I wrote and use to index/search all the DVDs I own. It interfaces with Amazon to get artwork and reviews and allows for linking trailers to each film. I had been using Apple's QuickTime for Java to get the dimensions and duration of the QuickTime trailers I was adding to the index. Unfortunately this has a) never been available under Linux and b) has been deprecated by Apple. So I decided to revisit Quick As A Flash and add support for extracting this information to the servlet.

Quick As A Flash uses FFmpeg to do the transcoding to Flash and it is trivial to read the dimensions and duration of the movie from the FFmpeg output. In a simply case of I-could-so-I-did I've also added support for generating a thumbnail image from the QuickTime movie.

I've no idea if anyone else is using this code or would ever find it useful but if you are interested then you can grab a binary copy or track progress on the Jenkins build page.

About The Size Of It

After quite a lot of work I've now managed to bring some semblance of order (and documentation) to the last of the GATE plugins that I've been trying to clean up for general release. So as of the most recent nightly build of GATE there is now a Measurements Tagger which you can load from the Tagger_Measurements plugin. I'm not going to attempt to give a full description of the PR here, so if you want the full details have a look at the user guide where there are three whole pages you can read.

In essence the PR annotates measurements appearing in text and normalizes the extracted information to allow for the easy comparison of measurements defined using different units. Now while that description is accurate it probably doesn't make much sense so here are a few examples.

Imagine that you wanted to find all distance measurements less than 3 metres appearing in a document. The Measurements Tagger makes this really simple. You could annotate your documents and then look at the unit and value features of all the Measurement annotations to find those where the unit is "metre" and the value is less than 3, but this would miss lots of valid measurements. For example, 3cm is less than three metres but uses a prefix to make writing the measurement easier. Or how about 4.5 inches? This is clearly less than 3 metres but is specified in an entirely different system of units. Fortunately as well as annotating measurements with the unit and value specified in the document, this new PR also normalizes (where possible) the measurement to it's base form.

The base form of a unit usually consists solely of SI units. This means, for example, that all lengths are normalized to metres, times to seconds, and speeds to metres per second (which is classed as a derived unit but is made up only of SI units).

In our example this means that 3cm is normalized to 0.03m and 4.5 inches to 0.1143m which allows them to both be recognized as being less than 3 metres. Under the hood the PR uses a modified version of the Java port of the GNU Units package to recognize and normalize the measurements. This approach makes it easy to add new units or to customize the parser for a specific domain, providing a very flexible solution.

The PR doesn't actually contain code for recognizing the value of a measurement, rather it relies on the annotations produced by the Numbers Tagger I cleaned up and released back in February. This means that this new PR can also recognize numbers written in many different ways allowing for measurements such as "forty-five miles per hour", "three thousand nanometres" and "2 1/2 pints".

Both the Numbers and Measurement taggers were originally developed for annotating a large corpus of patent documents. Once annotated the corpus could then be searched via another GATE technology called Mímir. Mímir, is a multiparadim IR system which allows searching over text, annotations, and knowledge base data. There are a couple of demo indexes (including a subset of the patent corpus) that you can try, and this video does a good job of explaining how the measurement annotations can be really useful.

If you find the whole topic of measurements interesting then I'd recommend reading "About The Size Of It" by Warwick Cairns. It's only a short book but it explains why we use the measurements we do and how they have evolved over time. I found it interesting, but then I quite like reading non-fiction.

Hopefully the new measurement PR will turn out to be really useful for a lot of people/projects. If you benefit from using GATE in general, or these new PRs in particular, then why not consider making a donation to help support future development.

Hudson Becomes Jenkins

I've upgraded the Hudson instance I use to compile most of my software to the newest version which, after a dispute with Oracle, is now called Jenkins. As well as upgrading the software I've changed the URL to match. I'm using J2EP in order to rewrite the old URLs to their new forms so hopefully all existing links will continue to work as before, but if you spot anything that doesn't seem to work properly please leave a comment so I can fix things.

When Was Yesterday?

Today sees the release of another of the GATE plugins I've been working on cleaning up over the last few months. Unlike the other plugins I've talked about recently this one has a much longer history as I wrote the core code back when I was a PhD student.

Many information extraction (IE) tasks benefit from or require the extraction of accurate date information. While ANNIE (the IE system that comes with GATE) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. My PhD focused on open-domain question answering, an IE task in which dates can play an import role; any "when" question, or questions starting "who is..." benefit from accurate date information. The problem was that I couldn't find a good Java library for parsing dates into a common format, so of course I set about writing one.

The library I wrote is unimaginatively called Date Parser and has been freely available since around 2005. You can currently find the parser being built by my Hudson server. Without going into too many technical details (the Javadoc is available for those who like that kind of thing) the parser takes a string and attempts to parse it as a date starting from a given offset. Unlike the built in DateFromat class which is limited to parsing one date format at a time my parser attempts to handle as many date formats as possible. Of course there are only so many ways you can re-arrange three pieces of information, but the parser also handles relative dates and dates which are not fully specified. For example, "April 2011" would be parsed into a Date object representing the 1st of April 2011. Possibly more interesting though is that fact that words/phrases such as yesterday, today, next Wednesday, and 3 days ago are all also parsed and recognized. In these instances the actual date being mentioned is calculated based upon a context date supplied to the parser. So if the word yesterday appears in the context of the 3rd of March 2011 the string will be recognized as referring to the 2nd of March 2011.

The parser worked really well during my PhD work and has seen numerous improvements since then as well. It started to be used in GATE projects a year or so ago and was initially used in conjunction with ANNIE. ANNIE adds Date annotations to document and I wrote a JAPE grammar that would find these annotations and then run the parser over the underlying text adding the normalized date value (if found) as a new feature. The code eventually moved to being a PR (rather than JAPE) for performance reasons and to support some new features. The problem, however, was that the dates the parser could handle and the dates that ANNIE finds don't always align. This meant that adding a new date format required changes to both ANNIE and the Date Parser. So when I started to clean up the code for release I made the decision to re-write the PR as a standalone component that no longer relies on ANNIE.

Surprisingly it was very easy to convert the existing code to remove the reliance on ANNIE and I think the performance (both time and accuracy) have been improved as a result. This isn't to say that ANNIE is bad at finding dates, just that it does some things differently and it also annotates times with Date annotations which for this task can confuse the issue.

Full documentation is available in the user guide and the PR is already available in the nightly builds of GATE (you need to load the Tagger_DateNormalizer plugin) so feel free to have a play and let me know what you think.