Code from an English Coffee Drinker: 2011

I Do Exist, Honest!

After a stressful day I appear to exist once again. If you tried to visit my blog yesterday you would have seen that it had been removed. Worse than that Google had decided to suspended my entire account. That meant I lost access not only to all my blogs, but photos in picasa, my calendars, my gmail account, and my Google+ profile. It was almost as if I no longer existed (online). I filled in the contact us form to request my account back and then almost a day later I got this e-mail from Google:

We apologize for any inconvenience you may have experienced. The issue you described should now be resolved.

I don't think I've ever been quite so relieved. While the whole experience was a bit of a nightmare it did get me thinking about how I could keep backups of all my Google data in case something similar happened again in the future. After a little bit of hunting around the web (via Google of course) I came across the Data Liberation Front.

The Data Liberation Front is a Google engineering team who are trying to make it easy to liberate your data from Google products. This is useful for either moving your data to a competing service or for simply backing it up locally.

They have step by step instructions on exporting your data from most of Google's products, but they are also developing Google Takeout. Google Takeout brings together a number of the export tools into a simple interface allowing you to select which data you want to download. It then collects the data together into a single zip file that you can easily download and archive. Currently it supports: +1’d sites, Buzz, Contacts and Circles, Picasa Web Albums, Profile, and Google+ Stream. As far as I can tell the aim is to add more products to this list and I'd certainly appreciate Blogger being included.

So if you don't have a recent backup of your Google hosted data maybe now is the time to do something about it.

Striped Clouds

I've spent quite a bit of my spare time over the last week or two doing some Java GUI programming, the reasons for which will become clear in a later post. Quite a lot of the GUI is table based and so I've spent quite a bit of time playing with custom rendering code for different data types to make things easier to visualize and edit. There are plenty of tutorials on how to do this spread over the web but one thing I found quite difficult was writing a renderer that worked reliably across different Java Look and Feels (L&F). The one renderer I wrote that highlights most of the problems I had was for displaying a checkbox in a table.

By default Java will display a checkbox for boolean data types, but unfortunately it doesn't disable the checkbox when it can't be edited. This leads to a situation where you can't change the state of the checkbox but there is no visual feedback to tell you this. So I wrote a simple renderer that would disable the checkbox if it wasn't editable. The first problem I found was that under the GTK+ L&F the background of the cell didn't change when the row was selected. It did under the default Metal L&F and after a little bit of debugging I discovered the problem. Every Swing component has an opaque property which determines if it's background is drawn or not. It turns out that the default value is dependent on the L&F. So under Metal checkbox's have an opaque background while under GTK+ they don't. Fortunately this is easy to fix simply by calling setOpaque(true). A similar problem occurs with the focus rectangle around the outside of each table cell, but again it's an easily fixed by calling setBorderPainted(true). These tweaks gave me a cell renderer which seemed to work well until, that is, I tried it under the Nimbus L&F.

The Nimbus L&F was introduced in Java SE 6 Update 10, and was meant to be the new cross platform L&F that would replace the aging Metal which has been the default since Swing was first developed. One of the nice things about Nimbus is that it is resolution independent and uses vector graphics rather than bitmaps. This should, in theory, lead to a crisper interface. Personally I'm not a fan, and as yet it hasn't replaced Metal as the default L&F. It seemed, however, sensible to make sure my rendering code worked correctly under Nimbus as well as Metal, GTK+ and CDE/Motif (these are the four L&Fs available by default when running Java under Ubuntu). Unfortunately it didn't.

Nimbus, in an attempt to be different, colours the background of table rows alternating colours -- by default white and a light gray. This is instead of drawing a border around the cells. The problem is that my renderer (and almost every example I've seen) gets the background colour for the cell from the table by calling either table.getBackground() or table.getSelectionBackground(). The selected background colour works correctly but the unselected cells get drawn with a dark gray background. There are three tricks to work around this while leaving the code working under the other L&Fs. The first is to get the alternative background colour from the UIManager class. The second is to recreate the unselected background colour to make it display correctly. Finally we use the modulus operator to determine which row colour we should be using. Adding these workarounds gives me the following cell renderer which seems to work under the four L&Fs available by default under Ubuntu as well as the Windows L&F.

import java.awt.Color;
import java.awt.Component;

import javax.swing.BorderFactory;
import javax.swing.JCheckBox;
import javax.swing.JTable;
import javax.swing.UIManager;
import javax.swing.border.Border;
import javax.swing.table.TableCellRenderer;

/**
 * A TableCellRenderer for JCheckBox that disables the checkbox when the
 * cell isn't editable to make it clear that you can't click on it
 * 
 * @author Mark A. Greenwood
 */
@SuppressWarnings("serial")
public class CheckBoxTableCellRenderer extends JCheckBox implements
                                                    TableCellRenderer {

  private static final Border NO_FOCUS =
    BorderFactory.createEmptyBorder(1, 1, 1, 1);;

  public CheckBoxTableCellRenderer() {
    super();
    setHorizontalAlignment(JCheckBox.CENTER);
    setBorderPainted(true);
    setOpaque(true);
  }

  public Component getTableCellRendererComponent(JTable table,
    Object value, boolean isSelected, boolean hasFocus,
    int row, int column) {

    // this is needed for Nimbus which has alternative rows in different
    // colors hopefully other L&Fs that also do this use the same key
    Color alternate = UIManager.getColor("Table.alternateRowColor");

    // strangely the background color from nimbus doesn't render properly
    // unless we convert it in this way. I'm guessing the problem is to do
    // with the DerivedColor class that Nimbus uses
    Color normal = new Color(table.getBackground().getRGB());

    if(isSelected) {
      setForeground(table.getSelectionForeground());
      setBackground(table.getSelectionBackground());
    } else {
      setForeground(table.getForeground());
      setBackground(alternate != null && row % 2 == 0 ? alternate : normal);
    }

    setEnabled(table.isCellEditable(row, column));
    setSelected(value != null && (Boolean)value);

    if(hasFocus) {
      setBorder(UIManager.getBorder("Table.focusCellHighlightBorder"));
    } else {
      setBorder(NO_FOCUS);
    }

    return this;
  }
}

One thing to note is that I've used the alternative colour for the even numbered rows (i.e. when row % 2 == 0). I've seen some web pages suggesting that you need to use the alternative colour on the odd rows. I'm not sure how Nimbus decides which colour to use for which rows so if you see them switched around for some reason you'll need to tweak the code slightly (i.e. use row % 2 == 1).

Quick As A Fox

I like watching films and I buy a lot of DVDs (there are almost 500 movies in the house). I tend to find though, that often I have a completely different view to most movie critics so I don't tend to read movie reviews or magazines. I do, however, find watching movie trailers useful (the one exception being The Break Up, which I thought looked great but definitely wasn't) so I can easily waste an hour or so on the iTunes Movie Trailers website. Yesterday, having read an article about possible Oscar contenders I went to hunt down the trailer for The Descendants only to find that the trailer wouldn't play.

My main PC runs Ubuntu and in the past Apple have gone out of their way to make it difficult to watch the trailers on Linux so I just assumed they had deliberately broken something. I had a quick hunt around the web and couldn't see anyone else complaining that things had changed recently so I decided it had to be a problem on my machine. It took me a while to track down the problem but I thought it worth mentioning here just in case it trips up anyone else. Essentially VLC was the cause of the problem.

I use VLC as the main video player on my computer as it works well with DVDs and almost any file format/codec combination you throw at it. However, I'd recently had a problem with the audio and video going out of sync. While trying to fix that problem I had done a complete reinstall of VLC. As well as reinstalling the main application this had also reinstalled too Firefox plugins. It turns out that one of these plugins seems to interfere with the Totem based QuickTime plugin. I'm guessing that the VLC plugin takes presedence and then fails for some reason. The easy solution is to simply disable one of the VLC based Firefox plugins. You can do this from the addons page in Firefox (accessed by entering about:addons in the address bar). You should find two plugins named 'VLC Multimedia Plug-in'. The difference between them is that one states it is compatible with Totem. Leave this one alone and disable the other one and hey presto! QuickTime movies should start playing again.

Given that before the reinstall of VLC trailers had been playing properly, I'm guessing that I'd worked this problem out before and then forgotten all about it (memory like swiss cheese and all that), so hopefully this post should at least remind me of the solution next time I have the same problem even if it doesn't help anyone else.

Off The Tracks

Error handling is always important. Nobody likes it when an application they are using crashes so badly that it stops working. Of course some application crashes are more embarrassing or public than others. On my way back from a work meeting in Prague last week I saw a perfect example in Huddersfield train station.

Fortunately a simple ArrayIndexOutOfBoundsException didn't seem to stop the trains running on time.

Browser Detection: How Not To Do It

Web browsers used to behave so differently from one another that it was common to write code to detect which browser type and version was being used to display the page so that appropriate code could be run. Fortunately, modern browsers tend to support web standards better and so apart from a few CSS tweaks it is unusual to come across browser detection code (the one exception to this rule being the expense system I have to use at work which doesn't seem to like Firefox at all). So I was a little surprised the other day when I suddenly found myself bounced to a an unsupported browser page on a shopping site that I'd been browsing around a few days before. So being the inquisitive kind I had a dig around and came across this gem.

if (navigator.userAgent.match(/Firefox\/[12]/)) {
   window.location.href = "unsupported.html";
}

For those of you who don't speak JavaScript, this essentially checks (using a regular expression) for the presence of either "Firefox/1" or "Firefox/2" in the useragent string, which identifies the make and version of the web browser requesting the page.

Now I like living on the bleeding edge of browser development and so I run the nightly builds of Firefox. Given the recent change in the release cycle of Firefox, the version number has climbed quite rapidly and the nightly build now sends the following useragent string (you can find out where your browser sends using this website):

Mozilla/5.0 (X11; Linux x86_64; rv:10.0a1) Gecko/20111012 Firefox/10.0a1

So from reading this you should be able to see that I'm actually running version 10.0a1 of Firefox. It should also be clear why the check of my browser resulted in me being bounced to the unsupported browser page: Firefox/10.0a1 contains the string "Firefox/1". This is a good example of why you really shouldn't write your own browser detection code, especially as there are a number of well written and up to date scripts out there that correctly extract the make and version number and which are easy to use. And if you subscribe to the not-invented-here school of thought, then at least make sure you actually implement a sensible solution!

Postvorta: Providing Intelligent Blog Search

The eagle-eyed amongst you may have noticed that about a month ago the search box in the sidebar of this blog changed. I used to use the standard Google search gadget but I now use a gadget powered by Postvorta.

Postvorta was built specifically to enable intelligent searching of blogs. How do I know this you ask? Well I spent the past year building Postvorta in my spare time. The initial motivation was a number of conversations with fellow bloggers about the inadequacies of the Google search gadget and coupled with the fact that my job involves processing natural language documents (I work as part of the GATE group at the University of Sheffield) I thought I was in a position to provide something better.

It is difficult to know exactly how the standard Google search gadget works, but as far as I can tell (both from personal experimentation and from talking to others) it appears to only index the main content of each post. For example, it certainly doesn't index the labels associated with posts. This means that while you can view all posts with a given label you can't search for them using the search gadget. Postvorta, however, indexes all the important content from your blog posts: title, article, labels, and comments. Importantly it does not index the pages you see when you view the blog in a web browser, instead it access the underlying data (via the Google Data APIs) which means that it can ignore the repeated information in the blog template. For example, many blogs contain a gadget which lists recent post titles, these shouldn't be indexed with each post as that makes it much more difficult to search for the actual post. A search can also be restricted by date and/or by the people who commented on a post. I've tried to provide as much flexibility as possible while keeping the full interface relatively simple.

Fortunately when building Postvora I didn't have to start from scratch. One advantage of working in a research group that makes their software available under an open-source license is that I can make use of software I use at work in my own projects. In this case the main indexing and search facilities behind Postvorta are built upon GATE Mímir. I've talked about Mímir before on this blog and if you've read that post then you shouldn't be surprised that as well as searching for words, like Google and the other search engines, you can use Postvorta to search your blogs semantically, i.e. for things. So you can search for any posts containing, for example, the name of a person without knowing what the name was in advance. If you are new to Mímir then Postvorta provides a comprehensive description of the query syntax which becomes available when you choose to use it through the search interface (by default searches are treated as a simple bag-of-words just as with other search engines).

Feel free to have a play with Postvorta through the search gadget on this blog. I'm also using it on my main blog where there are a lot more posts to search through. Postvorta is currently being run as a closed beta (while I evaluate performance, reliability etc.) but if you like what you see then you can register your interest and I'll try and index your blog as soon as possible -- note that currently Postvorta only supports Blogger blogs, although WordPress support should be coming soon.

Let me know what you think.

Blogger's Lightbox Returns To Haunt Our Blogs!

Blogger have now reintroduced lightbox to all blogs. While they claim to have fixed a lot of the bugs/issues that were reported before, they have still turned lightbox on by default. Fortunately we don't need to apply a hacky workaround this time.

If you don't want to use lightbox then you can turn it off for each blog you write through your dashboard. Simply select “No” next to Lightbox in the Settings > Posts and Comments section (new interface) or the Settings > Formatting section (old interface). I still think this feature should be opt-in rather than opt-out but at least they are allowing us to opt-out, so I guess we should be grateful.

Don't Look Down

And now for something completely different -- a posting not at all related to Blogger!

I've recently been spending quite a bit of my free time playing around with GATE Mímir (the reasons for which will become clear in a later post). As I've mentioned before, Mímir is a multi-paradigm indexing and retrieval system which allows us to combine text, annotations and knowledge base data in a single index. Text within Mímir is indexed using MG4J and by default is processed (at both indexing and search time) by a DowncaseTermProcessor which ensures that searches are case insensitive. Unfortunately while case insensitive searching is great there are other common problems when searching text collections, one of which can be nicely illustrated just from the name Mímir.

Whilst the name of the system, Mímir, contains an accented character, most people when searching would probably not go to the bother of figuring out how to enter the accented i and would instead try searching for Mimir. But just as Mímir and Mimir are visually different, so are they different when stored in an MG4J index. In other words if we search using the unaccented version we won't get any results! Whilst Mímir is a slightly unusual case I'm sure that we can all agree that a search for cafe should also bring back documents which mention café.

Now for latin alphabets I could come up with a mapping that would reduce most accented characters down to an unaccented version, but it would be time consuming to build and wouldn't handle the different ways in which accented characters can be encoded using Unicode. So I had a bit of a hunt around and discovered a simple, and I think, elegant way of converting accented characters to their unaccented forms courtesy of a blog posting by Guillaume Laforge. Creating a custom MG4J term processor using this code was trivial and so I now have a way of ensuring that accented characters don't cause me any problems. The one issue was getting Mímir to use the new term processor.

I deploy Mímir in a Tomcat instance by building a WAR file and while I could simply add a JAR file containing my custom term processor to the WEB-INF/lib folder before creating the WAR I'd prefer not to have to. If i included my code within the Mímir WAR then anytime I wanted to make a change would require rebuilding the WAR and redeploying which seems to be more work than necessary. Fortunately the Mímir config file allows you to specify GATE plugins that should be loaded when the web app is started. So it is trivial to create a GATE plugin which references a JAR containing my custom term processor. Unfortunately when I tried this, MG4J threw a ClassNotFoundException. The problem is that Java never looks down.

On the right you can see the ClassLoader hierarchy that is created when Mímir is deployed in Tomcat -- I've added little icons to show which are created by the Java runtime environment, which by Tomcat and which by the GATE embedded within Mímir. As you can see the GATE classloader, which is responsible for loading the plugin containing my custom term processor, is right at the bottom of the hierarchy. The MG4J libraries in the Mímir WEB-INF/lib folder which is the responsibility of the Web App classloader. Each classloader only knows about it's parent and not about any children and when asked to load a class first asks it's parent classloader and only if the class cannot be loaded does it then try loading it itself. The problem I was facing was that when MG4J tried to load my custom term processor it did so by asking the Web App classloader and as it is loaded by a child classloader the class couldn't be found and hence a ClassNotFoundException was thrown. Rather than giving up and simply adding the term processor to the WEB-INF/lib folder I decided to see if I could find a way of injecting the term processor into the right classloader.

Now before we go any further I should point out that one of my collegues has described what follows as evil, and I have to say I agree with him. That said it works and given the way Mímir works I can't see any problems arising, but I wouldn't suggest this as a general solution to the class loading problem described above for reasons I'll detail later. However....

Each class in Java knows which ClassLoader instance was responsible for creating it and we can use this information to forcibly inject code into the right place, using the following method.

private static void codeInjector(Class<?> c1, Class<?> c2)
{
  try
  {
    // Get the class loader which loaded MG4J
    ClassLoader loader = c2.getClassLoader();

    if (!loader.equals(c1.getClassLoader()))
    {
      //Assuming we aren't running inside the MG4J class loader...

      //Get an input stream we can use to read the byte definition of this class				
      InputStream inp = c1.getClassLoader().getResourceAsStream(c1.getName().replace('.', '/') + ".class");

      if (inp != null)
      {
        //If we could get an input stream then...

        //read the class definition into a byte array
        byte[] buf = new byte[1024 * 100]; //assume that the class is no larger than 100KB, this one is only 3.5KB
        int n = inp.read(buf);
        inp.close();

        //get the defineClass method
        Method method = ClassLoader.class.getDeclaredMethod("defineClass", String.class, byte[].class, int.class, int.class);

        //defineClass is protected so we have to make it public before we can call it
        method.setAccessible(true);

        try
        {
          //call defineClass to inject ourselves into the MG4J class loader
          method.invoke(loader, null, buf, 0, n);
        }
        finally
        {
          //set the defineClass method back to being protected
          method.setAccessible(false);
        }
      }
    }
  }
  catch (Exception e)
  {
    //hmm, something has gone badly wrong so throw the exception
    throw new UndeclaredThrowableException(e, "Unable to inject " + c1.getName() + " into the same class loader as " + c2.getName());
  }
}

In essence this method injects the definition of class c1 into the classloader responsible for class c2. So in my term processor I call it from a static initializer as follows:

static
{
  codeInjector(NormalizingTermProcessor.class,
    it.unimi.dsi.mg4j.index.Index.class);
}

So how does this all work. Well hopefully the code comments will help but...

Firstly we check that the classes are defined by different classloaders (lines 6-8)
Then we convert the class name into the path to the class file and try and open a stream to read from that file (line 13). If we can't read the class file then it means we have already injected the class which is why the classloader can't find the class file.
We then read the class file into a byte array (lines 20-22)
To inject code into a classloader we need to use the defineClass method, which unfortunately is protected. So we retrieve a handle to the method and remove the protected restriction (lines 25-29)
We now call deifneClass on the classloader we want to know about the class passing in the bytes we read in from the original class file (line 33)
Finally we put the protected restriction back so we leave things as they were when we found them (line 38)

Now there are a couple of things to be aware of which could trip you up if you try and do something similar:

If there is a security manager in place then you may find that you can't call the defineClass method even when the protected restriction is removed.
This code will result in the same class being defined in two classloaders (which after all was the whole point) but instances of the class cannot be shared between the classloaders. If you try to you will get an exception (can't actually remember which one).

Neither of these seem to be an issue with loading custom MG4J term processors into Mímir, so this seems to be a nice, albeit evil, way of allowing me to add functionality without having to add to the Mímir WAR file. Success!

Followers, What Followers?

Now I don't want people to think that this blog is just about code for fixing problems with Blogger but... for this post I'll be fixing a problem with Blogger!

I don't tend to read my own blog that often, which means I don't spend much time looking at it. So it wasn't until I started investigating how to disable Blogger's forced lightbox viewer that I spotted I had a problem with the followers gadget on some of my blogs. Developing my original fix to that problem involved reloading my blog over and over again as I tried different things. Sometimes the followers gadget appeared and sometimes it didn't. There didn't seem to be any pattern but there was definitely a problem.

Given that a lot of the gadgets you find on blogs are JavaScript based the first thing I checked was the JavaScript console (I use Firebug in Firefox for most of my web development work) which showed the following two errors whenever the followers gadget failed to appear; window.googleapisv0 is undefined and google.friendconnect.container is undefined.

Given that the problem was intermittent I made a guess that this was some form of race condition. Most browsers will try and load at least two files required to render a web page in parallel. This means that sometimes files download and become available to the browser in a different order. Usually this doesn't matter but my gut feeling was that this was causing the problem. A quick look at the list of JavaScript files associated with my blog showed that there were quite a few related to the followers widget and that were all loaded at roughly the same time. I experimented by manually adding each of these scripts to the head section of the HTML template until I eventually found a solution.

Strangely it wasn't a script associated directly with the followers gadget that solved the problem, but rather a script for the +1 button. This did, however, explain why I was only seeing the problem on my blogs which included the sharing buttons below each post. So if you find that your followers gadget sometimes doesn't appear then it might be worth trying the following fix.

You need to edit the HTML version of your template (you might be able to do this via a HTML/JavaScript gadget as well but I've had less success) so in the old style Blogger dashboard go to the design page or in the new style go to the template page and then click to edit the HTML version of your template. Now directly after <head> insert the following:

<script src='https://apis.google.com/js/plusone.js' type='text/javascript'/>

This fixed the problem for me, and I've suggested the fix in a couple of threads on the Blogger forum and it seems to have worked there as well, so hopefully it should stop your followers gadget disappearing for good!

Fixing Blogger's Mistakes

UPDATE: Blogger have finally stopped forcing the lightbox viewer upon us, which means the fix detailed in this post is no longer required! I'll be watching though and if it reappears as the default option, with no ability to turn it off, then I'll update the fix so that we can continue to choose how our images are displayed.

Yesterday Blogger introduced a new 'feature' to some blogs. Now images appear in a Lightbox powered overlay. Unfortunately a lot of people think that this feature is actually a bug. On one of my other blogs, it is a really problem due to the fact that I was already using a different script to achieve a similar affect. With the new feature I now get two popup copies of each image which really is horrid. So I spent a good hour trying to find a hack or workaround, until Blogger sees fit to allow us to disable the bug.

The main part of the fix is a simple piece of Javascript.

<script type='text/javascript'>
//<![CDATA[
 var images = document.getElementsByTagName('img');
  for (var i = 0 ; i < images.length ; ++i) {
    images[i].parentNode.innerHTML = images[i].parentNode.innerHTML;
  }
//]]>
</script>

The fix works because the new Blogger code adds an onClick function to the actual image, whereas most people wrap the images in a link. What I wanted to do was simply remove the onClick function but I couldn't figure out how (and believe me I tried), but simply recreating the image removes any registered events. The problem is ensuring that this code runs after the code Blogger used to add the lightbox viewer.

The trick to getting this code in the right place (thanks to Bonjour Tristesse for this bit) involves editing the HTML version of your template. From the Design page in the old Blogger dashboard or from the Template page in the new version bring up the HTML version of your template and then place the code almost at the very end, right between </body> and </html>

If you aren't happy editing the HTML version of your template then you can also add the fix via a gadget. Simply go to the layout editor and add a new HTML/Javascript gadget (it doesn't matter where). Leave the title of the gadget blank and paste in the following code.

<script type="text/javascript">
//<![CDATA[
var lightboxIsDead = false;
function killLightbox() {
  if (lightboxIsDead) return;
  lightboxIsDead = true;
  var images = document.getElementsByTagName('img');
  for (var i = 0 ; i < images.length ; ++i) {
     images[i].parentNode.innerHTML = images[i].parentNode.innerHTML;
  }
}
 
if (document.addEventListener) {
  document.addEventListener('DOMContentLoaded', killLightbox, false);
} else {
  document.attachEvent('onDOMContentLoaded', killLightbox);
  window.attachEvent('onload', killLightbox);
}
//]]>
</script>

Save the gadget and you are done. The fix will have been applied and things should be back to how they were before Blogger introduced this bug/feature. If/when Blogger see sense and allow us to disable this feature then you can easily delete my workaround simply be deleting the gadget from your layout. Note that applying the fix by editing the HTML version of your template is slightly more reliable, but in most cases you won't see any difference between the two.

Now I'm quite happy to let each individual blog owner choose how to display their photos, and some might even like the new photo viewer. From reading the forums, however, it is clear that some people just really hate the new viewer and would prefer not to see it even on other people's blogs. Well it turns out that the above fix also works when used as a Greasemonkey script. If you already have Greasemonkey installed in your browser then you can simply install the script to kill Blogger's lightbox on all Blogspot hosted blogs. If you don't have Greaemonkey installed then the Wikipedia page should point you to a version for your favorite browser.

UPDATED 17th September: I've simplified the script slightly and added a fix so that if the mouse was already within an image when the page loaded the fix will still apply if you click the image, assuming you move the mouse at least one pixel in any direction.

UPDATED 17th September: I've edited the post to suggest that the fix is used via a HTML/Javascript gadget so that new readers don't have to wade through the comments to find this out.

UPDATED 17th September: Now we specify false in the addEventListener call to ensure better backwards compatibility with older versions of Firefox.

UPDATED 20th September: Added Bonjour Tristesse's much better fix as the main suggested workaround.

UPDATED 21st September: Added the section on using the newest fix as a Greasemonkey script to kill Lightbox on all Blogspot hosted blogs.

UPDATED 21st September: Simplified the new fix slightly to do the replace inside body instead of the main div. This means that it will work even if you have heavily modified a template to no longer have the named div assumed by the previous version.

UPDATED 21st September: The old method now registers the function so it is fired when the DOM is loaded not the page. This should mean it works even before the page has fully loaded.

UPDATED 21st September: Simplified the short fix, as the replacement isn't actually required to make it work. This cuts down on the number of bytes served and should run quicker as well!

UPDATED 21st September: Switched back to recommending the gadget based fix (albeit a simpler version) because Bonjour Tristesse's version actually breaks other widgets within the posts, such as the Google +1 button in the post sharing widget. Fortunately the new and improved gadget version is applied much quicker and so seeing the viewer is much less likely than before.

UPDATED 22nd September: Only replace the actual image, not the entire content of the parent element. This should reduce the number of situations in which there is a chance of breaking any other scripts or gadgets.

UPDATED 22nd September: Attach to both onDOMContentLoaded and onLoad when running under IE to ensure the code gets run regardless of which version of IE we are using, but make sure we don't try and run twice as that is pointless.

UPDATED 22nd September: Rewrote the post to show that the same fix can be applied both by editing the HTML template or by adding a gadget. The difference from before is that now the HTML template based fix won't break the sharing buttons etc.

UPDATED 22nd September: No longer use cloneNode as IE actually clones the event handlers so the viewer still appears.

Trawling The Heap

I've spent a good few hours over the last week trying to track down a memory leak in a web application I've been working on. As far as I could tell from the code all the relevant resources were being freed when finished with, but still after a few hours the tomcat instance in which the app was running would grind to a halt as the available free memory inched ever closer to zero. In the end I decided that that only solution was to trawl through a heap dump to find out exactly what was being leaked and what was holding a reference to it.

Now it used to be that taking exploring the Java heap was a tedious and horrid process. Fortunately, the JDK now comes with VisualVM that makes working with the heap really easy.

VisualVM can attach to any running Java process and monitor it's memory usage, which in itself can be useful, but it can also take a heap dump and then provides an easy tool for navigating through the often vasts amount of information provided. Now in theory you should be able to use VisualVM to examine the heap of the tomcat server running a troublesome web app. Now try as I might I couldn't get this to work. The problem stems from the fact that I'm running tomcat under a different user account than my own, an account that you can't actually log in to (for the curious I installed tomcat under Ubuntu using the default package which runs tomcat under the tomcat6 user). I could monitor the memory usage but no matter what I tried (and believe me I tried all sorts of things) I couldn't manage to get a heap dump.

In the end I resorted to manually creating a core dump using the unix gcore utility and then loading this into VisualVM which could then generate a heap dump. This actually works quite nicely. The only downside is that it requires you to know the process ID of the tomcat web server and this changes everytime the server is restarted, which if you are debugging a problem, can be quite often. So to make my life a little easier I've written a small bash script that makes tomcat dump it's heap, which I've cleverly called tomscat!

#!/bin/bash

pid=`ps -u tomcat6 | grep java | sed 's/ .*$//g'`
count=0

ls -1 tomcat*.$pid > /dev/null 2>&1

[ $? -eq 0 ] && count=`ls -1 tomcat*.$pid | wc -l`

gcore -o tomcat$count $pid

This script firstly finds the pid for the tomcat process then works out if there are already any core dumps for this instance of tomcat and then generates a core dump into a nicely named file. Currently there is little in the way of error handling so if it doesn't work any errors may be cryptic! Anyway hopefully other people might find this script useful, I know it made the process of creating a bunch of heap dumps quite easy, and once I had the heap dumps tracking down the leak was fairly easy (turns out the the leak was due to large cache objects associated with database connections not being made available for garbage collection).

People in the News

Back in May I was involved in producing a demo for a show-and-tell session at the GATE training course. The idea was to try and demonstrate the process of defining an application, developing an annotation pipeline, annotating a large corpus, and then providing search over the documents, annotations and associated semantic information.

The idea we settled upon was to extend the basic ANNIE application, that is bundled with GATE, to annotate BBC News articles and to link the entities within them to DBpedia. This would then allow us to search the documents both for textual information, the same as any other search engine, but to also restrict the search based on information that might not be present in the documents but which is encoded in DBpedia. This worked well and allowed us to demonstrate the use of GATE Developer, the GATE Cloud Parallelizer (GCP) and GATE Mímir.

The combination of text, annotations and semantic information allow us to search the documents in interesting ways. You can play with the basic Mímir interface (referred to as GUS) over the demo index to see for yourself how useful the combination can be. Given that not many people reading this will already know the Mímir query syntax, and those that do probably won't know what annotations etc. are in the index, here are few example queries to get you started:

People Born in Sheffield: {Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield>}"}
The Location of Steel Industries: {Organization sparql = "SELECT ?inst WHERE { ?inst :industry <http://dbpedia.org/resource/Steel>}"} [0..4] in {Location}
A BBC Scotland document, written after the start of 2011, in which a Labour Party member is being quoted: ({Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29>}"} root:say) IN ({Document date > 20110000} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})

As you can see from these examples, as the queries get more complex they quickly become unwieldy. The problem is that Mímir provides a very rich query syntax and the basic GUS interface does nothing to hide the syntax from the user. Whenever we demo Mímir people love it but we always have to stress that GUS is not an end user search tool -- it is a development tool to enable you to check the contents of an index and to develop complex queries. In other words...

GUS is not the interface you are looking for!

Now I really like the demo we put together but trying to teach people the Mímir query syntax is difficult, especially if they don't already know any SPARQL. Also it is difficult to explain to potential partners/customers how they could take a Mímir index and produce their own custom interfaces. Whilst these thoughts have been festering in the back of my mind for a while I've only just found the time to go back to the demo and to build a custom interface (partly because next week I'm going to be teaching some people how to build custom Mímir interfaces, so I thought it best to have built at least one).

Given how rich the query syntax is, it is unlikely that a custom interface will be able to expose all the information within the index. Instead a number of interfaces may be developed, for the same index, in order to provide different types of search. Given this I decided to focus on searching for people within the BBC News articles. I used GUS to explore the index (which is what GUS is really for) and built up a number of complex person related queries. I then set about breaking these queries down into sections that could be easily represented in a form based fashion.

Once the form was complete it was trivial to reconstruct the complex queries from the form elements. All that was left to do was to interface with the Mímir index. Fortunately as well as GUS Mímir comes with an XML based RESTful interface. So the demo now builds complex queries from the form elements submits the query to Mímir via it's RESTful interface and then displays the results all without the user having to know anything about Mímir's query syntax.

The completed demo is unimaginatively called People in the News and you should feel free to play around with it. Some example queries include; criminals called Jonathan, Russian astronauts, and (my favourite complex example) politicians born in Sheffield mentioned in BBC Scotland documents from April 2011. The nice thing about the new interface is how easy it is to fill in the form to run these queries. That last example would otherwise entail you entering the following into GUS:

(({Person sparql="SELECT DISTINCT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield> . { ?inst a :Politician } UNION { ?inst a :OfficeHolder . ?inst a <http://xmlns.com/foaf/0.1/Person> } }"}) IN {Content}) IN ({Document date >= 20110401 date <= 20110430} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})

It is still something of a work-in-progress so if you have any ideas for improvements, or you find any bugs/oddities please do let me know.

Cranium

Over the last few weeks I've been trying to hunt down a memory leak in a servlet based web application. Periodically the Java virtual machine in which Tomcat was running would inexplicably run out of PermGen space and become so unresponsive that the only solution was to kill and restart the server process. After a lot of hunting through logs and trawling the Internet for pointers, I've found that the problem actually occurs when a web application is redeployed, although the out of memory error may occur later (which is why it was difficult to spot in the logs).

It turns out that when an application is redeployed the old classloader should be garbage collected which should free up both heap and PermGen memory by removing all the information related to the discarded web application. Unfortunately if something outside your web application holds a reference to even one class within the application which was loaded via the applications classloader then the classloader itself, and hence all the class information it has loaded, will not become eligible for garbage collection and this, eventually, results in exhaustion of the PermGen memory pool. If that isn't initially clear, never fear, as Frank Kieviet wrote a brilliant article (with diagrams) which explains the problem in more detail.

Looking back through the Tomcat logs it seems as if something within one of the libraries I was using is leaking a Timer instance which stops the classloader being garbage collected. I haven't actually managed to fix the problem yet but I did learn quite a few things along the way which I've collected together and turned into....

Cranium is a web application (distributed as a WAR file) that provides information on the memory usage of the servlet container in which it is being hosted. This includes information on all the memory pools (both heap and non-heap) as well as class loading and garbage collection. It also incorporates two different ways of triggering garbage collection to help monitor for memory leaks etc. Rather than trying to explain in detail what Cranium allows you to monitor I'm hosting it as a demo for you to look at (although I've disabled the garbage collection tools so that they cannot be used to make the server unstable).

As with most of my software Cranium is open-source and you can grab the code from my SVN repository or you can simply grab a pre-built WAR file. If you want to track development of Cranium then you can monitor it via my Jenkins server which also produces a bleeding edge WAR file on each build.

I know a lot of the information Cranium displays is available through other tools but I'm already finding it really useful and I hope that at least one other person does too!

Gordon's Alive!

A few years ago I wrote a small servlet to allow QuickTime movies to be converted into Flash video on the fly, specifically to support playback on the Wii -- I gave it the rather unimaginative name Quick As A Flash. I've had little need to update the code until recently when I upgraded my main PC from Windows to Ubuntu.

I have a web app that I wrote and use to index/search all the DVDs I own. It interfaces with Amazon to get artwork and reviews and allows for linking trailers to each film. I had been using Apple's QuickTime for Java to get the dimensions and duration of the QuickTime trailers I was adding to the index. Unfortunately this has a) never been available under Linux and b) has been deprecated by Apple. So I decided to revisit Quick As A Flash and add support for extracting this information to the servlet.

Quick As A Flash uses FFmpeg to do the transcoding to Flash and it is trivial to read the dimensions and duration of the movie from the FFmpeg output. In a simply case of I-could-so-I-did I've also added support for generating a thumbnail image from the QuickTime movie.

I've no idea if anyone else is using this code or would ever find it useful but if you are interested then you can grab a binary copy or track progress on the Jenkins build page.

About The Size Of It

After quite a lot of work I've now managed to bring some semblance of order (and documentation) to the last of the GATE plugins that I've been trying to clean up for general release. So as of the most recent nightly build of GATE there is now a Measurements Tagger which you can load from the Tagger_Measurements plugin. I'm not going to attempt to give a full description of the PR here, so if you want the full details have a look at the user guide where there are three whole pages you can read.

In essence the PR annotates measurements appearing in text and normalizes the extracted information to allow for the easy comparison of measurements defined using different units. Now while that description is accurate it probably doesn't make much sense so here are a few examples.

Imagine that you wanted to find all distance measurements less than 3 metres appearing in a document. The Measurements Tagger makes this really simple. You could annotate your documents and then look at the unit and value features of all the Measurement annotations to find those where the unit is "metre" and the value is less than 3, but this would miss lots of valid measurements. For example, 3cm is less than three metres but uses a prefix to make writing the measurement easier. Or how about 4.5 inches? This is clearly less than 3 metres but is specified in an entirely different system of units. Fortunately as well as annotating measurements with the unit and value specified in the document, this new PR also normalizes (where possible) the measurement to it's base form.

The base form of a unit usually consists solely of SI units. This means, for example, that all lengths are normalized to metres, times to seconds, and speeds to metres per second (which is classed as a derived unit but is made up only of SI units).

In our example this means that 3cm is normalized to 0.03m and 4.5 inches to 0.1143m which allows them to both be recognized as being less than 3 metres. Under the hood the PR uses a modified version of the Java port of the GNU Units package to recognize and normalize the measurements. This approach makes it easy to add new units or to customize the parser for a specific domain, providing a very flexible solution.

The PR doesn't actually contain code for recognizing the value of a measurement, rather it relies on the annotations produced by the Numbers Tagger I cleaned up and released back in February. This means that this new PR can also recognize numbers written in many different ways allowing for measurements such as "forty-five miles per hour", "three thousand nanometres" and "2 1/2 pints".

Both the Numbers and Measurement taggers were originally developed for annotating a large corpus of patent documents. Once annotated the corpus could then be searched via another GATE technology called Mímir. Mímir, is a multiparadim IR system which allows searching over text, annotations, and knowledge base data. There are a couple of demo indexes (including a subset of the patent corpus) that you can try, and this video does a good job of explaining how the measurement annotations can be really useful.

If you find the whole topic of measurements interesting then I'd recommend reading "About The Size Of It" by Warwick Cairns. It's only a short book but it explains why we use the measurements we do and how they have evolved over time. I found it interesting, but then I quite like reading non-fiction.

Hopefully the new measurement PR will turn out to be really useful for a lot of people/projects. If you benefit from using GATE in general, or these new PRs in particular, then why not consider making a donation to help support future development.

Hudson Becomes Jenkins

I've upgraded the Hudson instance I use to compile most of my software to the newest version which, after a dispute with Oracle, is now called Jenkins. As well as upgrading the software I've changed the URL to match. I'm using J2EP in order to rewrite the old URLs to their new forms so hopefully all existing links will continue to work as before, but if you spot anything that doesn't seem to work properly please leave a comment so I can fix things.

When Was Yesterday?

Today sees the release of another of the GATE plugins I've been working on cleaning up over the last few months. Unlike the other plugins I've talked about recently this one has a much longer history as I wrote the core code back when I was a PhD student.

Many information extraction (IE) tasks benefit from or require the extraction of accurate date information. While ANNIE (the IE system that comes with GATE) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. My PhD focused on open-domain question answering, an IE task in which dates can play an import role; any "when" question, or questions starting "who is..." benefit from accurate date information. The problem was that I couldn't find a good Java library for parsing dates into a common format, so of course I set about writing one.

The library I wrote is unimaginatively called Date Parser and has been freely available since around 2005. You can currently find the parser being built by my Hudson server. Without going into too many technical details (the Javadoc is available for those who like that kind of thing) the parser takes a string and attempts to parse it as a date starting from a given offset. Unlike the built in DateFromat class which is limited to parsing one date format at a time my parser attempts to handle as many date formats as possible. Of course there are only so many ways you can re-arrange three pieces of information, but the parser also handles relative dates and dates which are not fully specified. For example, "April 2011" would be parsed into a Date object representing the 1st of April 2011. Possibly more interesting though is that fact that words/phrases such as yesterday, today, next Wednesday, and 3 days ago are all also parsed and recognized. In these instances the actual date being mentioned is calculated based upon a context date supplied to the parser. So if the word yesterday appears in the context of the 3rd of March 2011 the string will be recognized as referring to the 2nd of March 2011.

The parser worked really well during my PhD work and has seen numerous improvements since then as well. It started to be used in GATE projects a year or so ago and was initially used in conjunction with ANNIE. ANNIE adds Date annotations to document and I wrote a JAPE grammar that would find these annotations and then run the parser over the underlying text adding the normalized date value (if found) as a new feature. The code eventually moved to being a PR (rather than JAPE) for performance reasons and to support some new features. The problem, however, was that the dates the parser could handle and the dates that ANNIE finds don't always align. This meant that adding a new date format required changes to both ANNIE and the Date Parser. So when I started to clean up the code for release I made the decision to re-write the PR as a standalone component that no longer relies on ANNIE.

Surprisingly it was very easy to convert the existing code to remove the reliance on ANNIE and I think the performance (both time and accuracy) have been improved as a result. This isn't to say that ANNIE is bad at finding dates, just that it does some things differently and it also annotates times with Date annotations which for this task can confuse the issue.

Full documentation is available in the user guide and the PR is already available in the nightly builds of GATE (you need to load the Tagger_DateNormalizer plugin) so feel free to have a play and let me know what you think.

More Ice In Your Tea?

I really shouldn't blog when I'm angry or annoyed as I tend to rant a little more than I intend! In retrospect I was a little harsh in my last post -- anyone who freely gives their time to developing free software shouldn't have to put up with me disparaging their work.

So as penance I've now tracked down the source of the weird class loading bug I highlighted and have submitted a detailed bug report, including a proposed fix, to the IcedTea netx project (netx is the name of the open-source Web Start replacement). You can monitor the progress of the bug through their public bug tracker. If I had the right permissions it's such a simple fix that I'd be happy to do it myself, but you have to earn the respect of project maintainers before getting the right to commit code.

Update, 23th February: it's now been fixed in the main code tree although it will take a while before it makes it into an Ubuntu update.

Why You Shouldn't Drink The IcedTea

I'm all for supporting open-source software but there are limits. I've recently switched to using Ubuntu on my main machine at home and have run into two bugs in the same piece of open-source software.

If you are a regular reader of this blog then you are probably aware that I do most of my software development using Java. A default install of Ubuntu (10.10) includes the OpenJDK based IcedTea version of Java 6. This is a version of Java that is covered by an open-source license -- which is in comparison to the Sun/Oracle version of Java for which you can read the source but which was not covered by an open-source licence (it's now "mostly" covered by GPL v2 with the classpath exception). I've never really understood the philosophical argument behind IcedTea and the need for a clean room implementation of Java, although Oracle's recent attack on Android provides some explanation. Anyway, given that it was the default installation of Java I was willing to give it a try. Within minutes though I'd found two show stopping bugs and so have switched back to using the reliable Sun/Oracle release of Java 6.

The first bug is visual and one that I knew existed in earlier versions of IcedTea but which I hoped had been fixed by now. In essence the ImageIO JPEG reader in IcedTea doesn't properly handle JPEG images with embedded colour profiles. What you end up with is an image that looks like a a photographic negative rather than the image you tried to load. This bug basically means that you can't use IcedTea for any application that allows users to load arbitrary JPEG files. For me this means I can't recommend it for running Convert4Frame, TagME, PhotoGrid or 3DAssembler. Also I can't use IcedTea to run the tomcat server in which I host my cookbook web app. What is really annoying about this bug is that it was originally in the main Sun/Oracle distribution, reported all the way back in 2003, but was fixed in Java 5 update 4, you can read all about it in the bug report. If the open-source version can't fix a bug that is around eight years old then why do they even bother!

The second bug is a little stranger but no less annoying. The documentation for the method ClassLoader.loadClass(String name) states that either it returns the resulting Class object or throws a ClassNotFoundException if (wait for it) the class was not found. That all seems perfectly logical to me. Unfortunately there appears to be at least one situation in which IcedTea returns null instead of throwing an exception when the class cannot be found.

I distribute a lot of the open-source Java software that I develop in my spare time via Web Start and once I had Ubuntu up and running I thought I'd check Java by launching 3DAssembler. Unfortunately it failed to load and gave me a rather strange NullPointerException. After a bit of digging around (the version of the app on my website doesn't match my development version and hence the line numbers were out) I eventually tracked the problem back to this try/catch block.

try {
  Class rmClass = Assemble3D.class.getClassLoader().loadClass("org.jdesktop.swinghelper.debug.CheckThreadViolationRepaintManager");
  RepaintManager.setCurrentManager((RepaintManager)rmClass.getConstructor().newInstance());
  System.err.println("EDT Debug Mode Is Active");
}
catch (ClassNotFoundException e) {
  // the debug classes from SwingHelper are not available
}

This code tries to load a class, via reflection, that catches EDT violations (painting Swing components from the wrong thread) and that I only use during development to aid in debugging. I load the class via reflection so that when I distribute the application I can simply leave out the JAR file containing the debug class and everything will continue to work -- the class isn't found so an exception is thrown, caught and ignored and the application continues on. The problem with IcedTea is that when running as a Web Start application the call to loadClass in line 2 returns null instead of throwing a ClassNotFoundException. This means that the catch block isn't triggered and the exception is thrown all the way out of the main method, killing the application. It seems to only be a Web Start issue as running my development copy locally under IcedTea doesn't cause loadClass to return null. Of course this problem I can fix by changing the catch block to trap all exceptions, but the point is I shouldn't have to!

As I said at the beginning of this post I'm all for open-source software, but I believe there are cases where developers who give their time freely to projects should think more about the merits of the project and if it is really needed. The "official" Oracle release of Java is now, for all intense and purposes, under an open-source license for the development of desktop applications (mobile and embedded uses are a different kettle of fish). Given this, is there really any need for a clean room implementation, especially when that implementation is so buggy as to render it useless in many situations?

What's Actually Worth Reading?

Another day, another GATE processing resource -- as you can tell I've been busy tidying up the PRs that I've developed recently. One of the reasons for this spurt of cleaning and documenting code is that a project I'm currently working on is ending soon and the information extraction pipeline we have developed needs to be fully documented. Being able to just point to multiple sections of the GATE user guide for more details on each PR in the application makes the documentation much easier to write. Of course that means that the PRs have to actually have documentation in the user guide!

I won't go into details about the project I'm currently working on with The National Archives (if you want the details then there was a press release and the head of the GATE group, i.e. my boss, has blogged about it) suffice it to say that it involves processing millions of web pages drawn from hundreds of different web sites.

We can extract an awful lot of information from the web pages we are processing, so much so in fact that it can be difficult to search through the information. We have multiple tools to help with searching but one thing we quickly realised is that it would be nice to ignore information extracted from boilerplate content. Most web pages contain text that isn't really part of the content; headers, menus, navigation links etc. These sections can contain entities that we might extract but it is highly unlikely that they will be relevant to the main content of the page. For this reason it would be nice to be able to exclude these in some way when searching through the extracted information.

The approach we choose was to keep everything extracted using the IE pipeline but to also determine the sections of the document that were actually content. This allows us to search for entities within content. It also means that if our ability to determine what is useful content and what isn't is flawed in any way we have still extracted the entities appearing in other parts of the document.

Rather than implementing a content detection system from scratch I decided to base the PR on an existing Java library called boilerpipe. The boilerpipe library contains a number of different algorithms for detecting content most of which are available through the new GATE PR. There are some features that are not available due to it currently not being possible to map them directly to a GATE document.

To give you a better idea of what the new PR does here is a screen shot of a web page loaded into both a browser and GATE. In the GATE window you can see the pink sections that have been marked as content (click on the image for a larger easier to read version).

Whilst this kind of approach is never going to be perfect it seems, from initial testing, that it does indeed help to filter out erroneous results when searching through information extracted from large web based corpora.

If you want to try it out yourself then it's already in the main GATE svn repository and the nightly builds. Details of how to configure the PR can be found in the relevant section of the GATE user guide.

Numbers Have Real Value

So here is a question for you...

What do the following numbers all have in common? 3^2, 2³, 101, 3.3e3, 1/4, 9^1/2, 4x10^3, 5.5*4^5, thirty one, three hundred, four thousand one hundred and two, 3 million, and fünfundzwanzig.

The answer is that they can all be recognized, annotated and converted to a real number representation (a Java Double) by a new GATE PR that has just been released and that I've just finished documenting for the user guide. You may never have really thought about this before but it turns out that there are so many ways of writing numbers in text that recognising them is actually really quite difficult. If you also want to know the value of the number you have recognised then this adds an extra layer of complexity especially when the number is written out in words rather than digits.

The PR actually started life back in 2009 for recognising numbers in patent documents as a precursor to recognising and normalizing measurements but since then has seen lots of development to extend the range of numbers that can be recognised. This new version is being used on a number of projects both to recognise numbers simply for the sake of finding numbers but also to help find drug doses, government spending and lots of generic measurements.

Requests for code to recognising numbers and determine their value has cropped up a number of times on the GATE mailing list and whilst we had been using this code internally for a while we knew that there were issues with it and it had never been tidied up or documented to the extent where we would be happy to show it to other people! Having discovered yet-another-bug in the code a fortnight ago I decided to take the time to rewrite large chunks of the code in order to fix most of the outstanding issues and to increase the range of numbers we could recognise. Hopefully this has led to a more useful PR. If you'd like to try it out then you can find this PR in the Tagger_Numbers plugin within the main GATE svn repository and it's in the nightly builds as well.

The plugin actually contains two PRs; Numbers Tagger and Roman Numerals Tagger. As you can guess by the name this second PR annotates Roman numerals appearing in documents. As with the main PR this also calculates the numeric value of the Roman numerals. I'm guessing that this PR is probably less useful than the main Numbers Tagger but we have found it to be helpful in the past when trying to recognise document sections, tables, figures etc. which can often be labelled with Roman numerals instead of Arabic numbers, e.g. Section VI, Table IV, Figure IIIa. If you are interested in the Roman Numerals Tagger then you can find more details in the user guide.

Schema Enforcer

In my previous post I introduced you to GATE, the software I use and help to develop at work. Over the last ten years I've developed a number of processing resources (PRs are like plugins) for GATE. Some of these plugins have made it into the main GATE distribution (the Chemistry Tagger and the Noun Phrase Chunker being the most successful) whilst I've allowed others to slowly die. I still have quite a few that I've developed either for my own pet projects or for work that should really be made available for everyone to use. The problem tends to be that they need cleaning up and documenting before they are released. I've now made a start on cleaning up the PRs that I think are useful and in this post I'll introduce you to the first of these that I've managed to commit to the main GATE SVN repository; the Schema Enforcer.

The idea for the Schema Enforcer started to germinate in my head during a long afternoon trying to teach people how to manually annotate documents using GATE Teamware. In essence we want people who are familiar with a set of documents to markup the entities within the documents that they believe are interesting/relevant to a given task. We then treat these manually annotated documents as a gold standard for evaluating automatic systems that create the same annotations.

It turns out that if you can pre-annotate the documents with an automatic system and have the annotators correct and add to existing annotations they not only find the task easier to understand but they tend to be able to annotate a document quicker which usualy saves us money.

When processing a document in GATE you tend to find that applications create a lot of annotations that are not actually required. For example, GATE creates a SpaceToken annotation over each blank space. These can be really useful when creating other more complex annotations but no human is ever going to need to look at them. So when pre-annotating documents for Teamware what I (and most other people do) is to simply create a new annotation set into which we copy any annotation types which we are asking the annotators to create or correct (we usually do this using the Annotation Set Transfer PR rather than by hand). The problem with simply copying annotations from one set to another is that this does nothing to check that the annotation features conform to any set of guidelines. Whilst odd features are less of an issue than intermediate or temporary annotations they can still be quite distracting.

In Teamware, when starting an annotation process, you specify the annotations that can be created using XML based annotation schmeas. These define the type of the annotation, it's features, and for some features the set of permitted values. For example here is a schema for defining a Location annotation.

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2000/10/XMLSchema">
  <element name="Location">
    <complexType>
      <attribute name="locType" use="required" value="other">
        <simpleType>
          <restriction base="string">
            <enumeration value="region"/>
            <enumeration value="airport"/>
            <enumeration value="city"/>
            <enumeration value="country"/>
            <enumeration value="county"/>
            <enumeration value="other"/>
          </restriction>
        </simpleType>  
      </attribute>

      <attribute name="requires-attention" use="optional" type="boolean"/>         
      <attribute name="comment"  use="optional" type="string"/>
    </complexType>
  </element>
</schema>

You should be able to see from this that a Location annotation can have three features (referred to as attributes in the schema); locType, requires-attention, and comment. The last two features are fairly self explanatory but the locType feature requires a little explanation. Basically locType is an enumerated feature, that is it can only take on one of the six values specified in the schema. What this means is that an annotator cannot decide to create a Location annotation with a locType set to, for instance, beach as that is not one of the defined values. In this case they would probably set locType to other and use the comment feature to say that it is actually a beach. Also note that locType is a required feature which means you can't choose not to set it's value.

The idea I had should now be obvious; why not use the schemas to drive the copying of annotations from one annotation set to another. After a little bit of experimenting this idea became the Schema Enforcer PR. Details of exactly how to use the PR can be found in the main GATE manual but in essense the Schema Enforcer will copy an annotation if and only if....

the type of the annotation matches one of the supplied schemas, and
all required features are present and valid (i.e. meet the requirements for being copied to the 'clean' annotation)

Each feature of an annotation is copied to the new annotation if and only if....

the feature name matches a feature in the schema describing the annotation,
the value of the feature is of the same type as specified in the schema, and
if the feature is defined, in the schema, as an enumerated type then the value must match one of the permitted values

I've now made use of this PR in two different projects and it really does make life easier. Not only can I be sure that annotations people get to correct in Teamware actually match the annotation guidelines, but it provides a really easy way of producing a 'clean' annotation set as the output of a GATE application, but don't just take my word for it!

nice one, mark - very useful! i've had these problems before too, but used jape grammars instead - your approach is much nicer!

I think it would be nice if whoever gets to teach Teamware at FIG doesn't get snagged by the non-standard annotations that came up on Tuesday. ;-)

So if you already develop GATE applications and think that you'd like to add the Schema Enforcer to your pipeline you can find it in the main GATE SVN repository or just grab a recent nightly build.

GATE: General Architecture for Text Engineering

So far I've only talked about code that I've developed or played around with in my own time. In preparation for future blog posts I thought I'd spend a little time talking about the code I'm paid to work on.

As some of you may already know I work in the Department of Computer Science at the University of Sheffield. I work in the Natural Language Processing Group (NLP) where my interests have focused on information extraction -- getting useful information about entities and events from unstructured text such as newspaper articles or blog posts. The main piece of software that makes this work possible is GATE.

GATE is a General Architecture for Text Engineering. This means that it provides both the basic components required for building applications that work with natural language as well as a framework in which these components can be easily linked together and reused. The fact that I never have to worry about basic processing such as tokenization (splitting text into individual words and punctuation), sentence splitting, and part-of-speech tagging means that I'm free to concentrate on extracting information from the text. I've used GATE since 2001 when I started work on my PhD. For the last two years I've been employed as part of the core GATE team. Technically I'm not paid to develop GATE (I don't think any of us actually are) but the projects we work on all rely on GATE and so we contribute new plugins or add new features as the need arises.

One of the things I really like about working on GATE is that it is open-source software (released under the LGPL) which means not only am I free to talk about the work I do but also anyone is able to freely use and contribute to the development. This also means that GATE has been adopted by a large number of companies and universities around the world for all sorts of interesting tasks -- I'm currently involved in three projects that involve GATE being used for cancer research, mining of medical records and government transparency.

So if you are interested in text engineering and you haven't heard of GATE 1) shame on you and 2) go try it out and see just what it can do. And for those of you who don't do need to process text at least you'll know what I'm talking about when I refer to it in future posts.

SVN Paths Are Case Sensitive

Over the last couple of days I've been busy re-installing the computer that runs my SVN repository (it runs other things as well but that isn't so important). It's a Windows machine and it had finally reached the point where the only solution to the BSODs it kept suffering was a full re-install.

I've never been particularly good at making regular backups of things and while I've suffered a fair amount of hardware failures over the years I've never really lost anything important. In fact the SVN repository itself has saved me from a disk crash recently. So at the same time as the re-install I thought I should setup a proper back schedule and organize my data a little more carefully.

So I now have two disks in the machine that are not used for day-to-day stuff. One drive holds the live copy of the SVN repository (as well as the Hudson home dir, Tomcat webapps and associated MySQL data). The other drive holds backup copies of everything.

This re-organization meant that local path access to the SVN repository changed (the external svn:// URL stayed the same), which meant I had to update the Hudson configurations (which use local file access for performance) to use the new paths.

So I went through each of the 12 jobs in Hudson and changed the paths accordingly. I checked a few of the projects and they built without any problems so I assumed that was job done. Then this morning I noticed that all 12 jobs were being built every 10 minutes as polling SVN always reported that the workspace didn't contain a checkout of the correct folder. The path it was showing was correct so at first glance nothing appeared wrong. After messing around at the command line for a bit I eventually figured out the problem.

Basically I'd changed from URLs starting file:///z:/SVN to URLs starting file:///L:/SVN. For some reason I'd typed in the drive letter as a capital (the way Windows displays it) rather than in lower case. It turns out that while SVN is happy to do a checkout from the capital letter version it stores the URL in the checked out copy using a lowercase drive letter, and hence on future updates the two don't match. Fixing the jobs to access URLs starting file:///l:/SVN fixed the problem.

Bizarrely Hudson didn't complain about the problem so the builds all succeeded it's just that there was an awful lot of wasted CPU time over the last day or so!