Schema Enforcer

In my previous post I introduced you to GATE, the software I use and help to develop at work. Over the last ten years I've developed a number of processing resources (PRs are like plugins) for GATE. Some of these plugins have made it into the main GATE distribution (the Chemistry Tagger and the Noun Phrase Chunker being the most successful) whilst I've allowed others to slowly die. I still have quite a few that I've developed either for my own pet projects or for work that should really be made available for everyone to use. The problem tends to be that they need cleaning up and documenting before they are released. I've now made a start on cleaning up the PRs that I think are useful and in this post I'll introduce you to the first of these that I've managed to commit to the main GATE SVN repository; the Schema Enforcer.

The idea for the Schema Enforcer started to germinate in my head during a long afternoon trying to teach people how to manually annotate documents using GATE Teamware. In essence we want people who are familiar with a set of documents to markup the entities within the documents that they believe are interesting/relevant to a given task. We then treat these manually annotated documents as a gold standard for evaluating automatic systems that create the same annotations.

It turns out that if you can pre-annotate the documents with an automatic system and have the annotators correct and add to existing annotations they not only find the task easier to understand but they tend to be able to annotate a document quicker which usualy saves us money.

When processing a document in GATE you tend to find that applications create a lot of annotations that are not actually required. For example, GATE creates a SpaceToken annotation over each blank space. These can be really useful when creating other more complex annotations but no human is ever going to need to look at them. So when pre-annotating documents for Teamware what I (and most other people do) is to simply create a new annotation set into which we copy any annotation types which we are asking the annotators to create or correct (we usually do this using the Annotation Set Transfer PR rather than by hand). The problem with simply copying annotations from one set to another is that this does nothing to check that the annotation features conform to any set of guidelines. Whilst odd features are less of an issue than intermediate or temporary annotations they can still be quite distracting.

In Teamware, when starting an annotation process, you specify the annotations that can be created using XML based annotation schmeas. These define the type of the annotation, it's features, and for some features the set of permitted values. For example here is a schema for defining a Location annotation.

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2000/10/XMLSchema">
  <element name="Location">
    <complexType>
      <attribute name="locType" use="required" value="other">
        <simpleType>
          <restriction base="string">
            <enumeration value="region"/>
            <enumeration value="airport"/>
            <enumeration value="city"/>
            <enumeration value="country"/>
            <enumeration value="county"/>
            <enumeration value="other"/>
          </restriction>
        </simpleType>  
      </attribute>

      <attribute name="requires-attention" use="optional" type="boolean"/>         
      <attribute name="comment"  use="optional" type="string"/>
    </complexType>
  </element>
</schema>

You should be able to see from this that a Location annotation can have three features (referred to as attributes in the schema); locType, requires-attention, and comment. The last two features are fairly self explanatory but the locType feature requires a little explanation. Basically locType is an enumerated feature, that is it can only take on one of the six values specified in the schema. What this means is that an annotator cannot decide to create a Location annotation with a locType set to, for instance, beach as that is not one of the defined values. In this case they would probably set locType to other and use the comment feature to say that it is actually a beach. Also note that locType is a required feature which means you can't choose not to set it's value.

The idea I had should now be obvious; why not use the schemas to drive the copying of annotations from one annotation set to another. After a little bit of experimenting this idea became the Schema Enforcer PR. Details of exactly how to use the PR can be found in the main GATE manual but in essense the Schema Enforcer will copy an annotation if and only if....
  • the type of the annotation matches one of the supplied schemas, and
  • all required features are present and valid (i.e. meet the requirements for being copied to the 'clean' annotation)
Each feature of an annotation is copied to the new annotation if and only if....
  • the feature name matches a feature in the schema describing the annotation,
  • the value of the feature is of the same type as specified in the schema, and
  • if the feature is defined, in the schema, as an enumerated type then the value must match one of the permitted values

I've now made use of this PR in two different projects and it really does make life easier. Not only can I be sure that annotations people get to correct in Teamware actually match the annotation guidelines, but it provides a really easy way of producing a 'clean' annotation set as the output of a GATE application, but don't just take my word for it!
nice one, mark - very useful! i've had these problems before too, but used jape grammars instead - your approach is much nicer!
I think it would be nice if whoever gets to teach Teamware at FIG doesn't get snagged by the non-standard annotations that came up on Tuesday. ;-)
So if you already develop GATE applications and think that you'd like to add the Schema Enforcer to your pipeline you can find it in the main GATE SVN repository or just grab a recent nightly build.

GATE: General Architecture for Text Engineering

So far I've only talked about code that I've developed or played around with in my own time. In preparation for future blog posts I thought I'd spend a little time talking about the code I'm paid to work on.

As some of you may already know I work in the Department of Computer Science at the University of Sheffield. I work in the Natural Language Processing Group (NLP) where my interests have focused on information extraction -- getting useful information about entities and events from unstructured text such as newspaper articles or blog posts. The main piece of software that makes this work possible is GATE.

GATE is a General Architecture for Text Engineering. This means that it provides both the basic components required for building applications that work with natural language as well as a framework in which these components can be easily linked together and reused.  The fact that I never have to worry about basic processing such as tokenization (splitting text into individual words and punctuation), sentence splitting, and part-of-speech tagging means that I'm free to concentrate on extracting information from the text. I've used GATE since 2001 when I started work on my PhD. For the last two years I've been employed as part of the core GATE team. Technically I'm not paid to develop GATE (I don't think any of us actually are) but the projects we work on all rely on GATE and so we contribute new plugins or add new features as the need arises.

One of the things I really like about working on GATE is that it is open-source software (released under the LGPL) which means not only am I free to talk about the work I do but also anyone is able to freely use and contribute to the development. This also means that GATE has been adopted by a large number of companies and universities around the world for all sorts of interesting tasks -- I'm currently involved in three projects that involve GATE being used for cancer research, mining of medical records and government transparency.

So if you are interested in text engineering and you haven't heard of GATE 1) shame on you and 2) go try it out and see just what it can do. And for those of you who don't do need to process text at least you'll know what I'm talking about when I refer to it in future posts.

SVN Paths Are Case Sensitive

Over the last couple of days I've been busy re-installing the computer that runs my SVN repository (it runs other things as well but that isn't so important). It's a Windows machine and it had finally reached the point where the only solution to the BSODs it kept suffering was a full re-install.

I've never been particularly good at making regular backups of things and while I've suffered a fair amount of hardware failures over the years I've never really lost anything important. In fact the SVN repository itself has saved me from a disk crash recently. So at the same time as the re-install I thought I should setup a proper back schedule and organize my data a little more carefully.

So I now have two disks in the machine that are not used for day-to-day stuff. One drive holds the live copy of the SVN repository (as well as the Hudson home dir, Tomcat webapps and associated MySQL data). The other drive holds backup copies of everything.

This re-organization meant that local path access to the SVN repository changed (the external svn:// URL stayed the same), which meant I had to update the Hudson configurations (which use local file access for performance) to use the new paths.

So I went through each of the 12 jobs in Hudson and changed the paths accordingly. I checked a few of the projects and they built without any problems so I assumed that was job done. Then this morning I noticed that all 12 jobs were being built every 10 minutes as polling SVN always reported that the workspace didn't contain a checkout of the correct folder. The path it was showing was correct so at first glance nothing appeared wrong. After messing around at the command line for a bit I eventually figured out the problem.

Basically I'd changed from URLs starting file:///z:/SVN to URLs starting file:///L:/SVN. For some reason I'd typed in the drive letter as a capital (the way Windows displays it) rather than in lower case. It turns out that while SVN is happy to do a checkout from the capital letter version it stores the URL in the checked out copy using a lowercase drive letter, and hence on future updates the two don't match. Fixing the jobs to access URLs starting file:///l:/SVN fixed the problem.

Bizarrely Hudson didn't complain about the problem so the builds all succeeded it's just that there was an awful lot of wasted CPU time over the last day or so!