GATE: General Architecture for Text Engineering

So far I've only talked about code that I've developed or played around with in my own time. In preparation for future blog posts I thought I'd spend a little time talking about the code I'm paid to work on.

As some of you may already know I work in the Department of Computer Science at the University of Sheffield. I work in the Natural Language Processing Group (NLP) where my interests have focused on information extraction -- getting useful information about entities and events from unstructured text such as newspaper articles or blog posts. The main piece of software that makes this work possible is GATE.

GATE is a General Architecture for Text Engineering. This means that it provides both the basic components required for building applications that work with natural language as well as a framework in which these components can be easily linked together and reused.  The fact that I never have to worry about basic processing such as tokenization (splitting text into individual words and punctuation), sentence splitting, and part-of-speech tagging means that I'm free to concentrate on extracting information from the text. I've used GATE since 2001 when I started work on my PhD. For the last two years I've been employed as part of the core GATE team. Technically I'm not paid to develop GATE (I don't think any of us actually are) but the projects we work on all rely on GATE and so we contribute new plugins or add new features as the need arises.

One of the things I really like about working on GATE is that it is open-source software (released under the LGPL) which means not only am I free to talk about the work I do but also anyone is able to freely use and contribute to the development. This also means that GATE has been adopted by a large number of companies and universities around the world for all sorts of interesting tasks -- I'm currently involved in three projects that involve GATE being used for cancer research, mining of medical records and government transparency.

So if you are interested in text engineering and you haven't heard of GATE 1) shame on you and 2) go try it out and see just what it can do. And for those of you who don't do need to process text at least you'll know what I'm talking about when I refer to it in future posts.

2 comments:

  1. Hi Good Morning,
    You can helpme in a question please,
    You're the smart one in this are

    Right now I have a problem with TreeTagger Gate and Spanish.
    First I was showing this error:
    gate.creole.ExecutionException: Script C: \ usr \ local \ durmtools \ TreeTagger \ cmd \ tree-tagger-english does not exist
    at gate.taggerframework.GenericTagger.buildCommandLine (GenericTagger.java: 286)
    at gate.taggerframework.GenericTagger.execute (GenericTagger.java: 242)
    at gate.util.Benchmark.executeWithBenchmarking (Benchmark.java: 291)
    at gate.creole.ConditionalSerialController.runComponent (ConditionalSerialController.java: 154)
    at gate.creole.SerialController.executeImpl (SerialController.java: 153)
    at
    at gate.creole.AbstractController.execute (AbstractController.java: 75)
    at gate.util.Benchmark.executeWithBenchmarking (Benchmark.java: 291)
    at $ gate.gui.SerialControllerEditor RunAction $ 1.run (SerialControllerEditor.java: 1619)
    at java.lang.Thread.run (Unknown Source)



    Then I change the path of "TreeTagger-ES-Tokenization" and place TaggerBinary Url:
    file :/ C :/ gate / plugins / Tagger_Framework / resources / TreeTagger / tree-tagger-english-gate

    Also drop the folder with their files C: \ TreeTagger
    with their respective directories:
    C: \ TreeTagger \ bin
    C: \ TreeTagger \ cmd
    C: \ TreeTagger \ lib

    finally create a file called "build.properties" with that information:
    run.shell.path: C \: \ \ cygwin \ \ bin \ \ sh.exe
    Right now the program is running, but only in the document that I examined, showing me that in the "Annotation Sets".
    -Split
    -Sentence
    but I need more, but I need more info, thanks

    ReplyDelete
    Replies
    1. You'd be much better off posting your question to the GATE users mailing list. Whilst I did help to write the TaggerFramework PR I've never been an expert at getting it to work with the TreeTagger. There are, however, plenty of people on the mailing list that use the TreeTagger and they should be able to help you figure out what the problem is.

      Having said that, one thing to check is that the path you specified in build.properties does actually match where you installed cygwin to otherwise it won't make any difference.

      Delete