Code from an English Coffee Drinker: What's Actually Worth Reading?

Another day, another GATE processing resource -- as you can tell I've been busy tidying up the PRs that I've developed recently. One of the reasons for this spurt of cleaning and documenting code is that a project I'm currently working on is ending soon and the information extraction pipeline we have developed needs to be fully documented. Being able to just point to multiple sections of the GATE user guide for more details on each PR in the application makes the documentation much easier to write. Of course that means that the PRs have to actually have documentation in the user guide!

I won't go into details about the project I'm currently working on with The National Archives (if you want the details then there was a press release and the head of the GATE group, i.e. my boss, has blogged about it) suffice it to say that it involves processing millions of web pages drawn from hundreds of different web sites.

We can extract an awful lot of information from the web pages we are processing, so much so in fact that it can be difficult to search through the information. We have multiple tools to help with searching but one thing we quickly realised is that it would be nice to ignore information extracted from boilerplate content. Most web pages contain text that isn't really part of the content; headers, menus, navigation links etc. These sections can contain entities that we might extract but it is highly unlikely that they will be relevant to the main content of the page. For this reason it would be nice to be able to exclude these in some way when searching through the extracted information.

The approach we choose was to keep everything extracted using the IE pipeline but to also determine the sections of the document that were actually content. This allows us to search for entities within content. It also means that if our ability to determine what is useful content and what isn't is flawed in any way we have still extracted the entities appearing in other parts of the document.

Rather than implementing a content detection system from scratch I decided to base the PR on an existing Java library called boilerpipe. The boilerpipe library contains a number of different algorithms for detecting content most of which are available through the new GATE PR. There are some features that are not available due to it currently not being possible to map them directly to a GATE document.

To give you a better idea of what the new PR does here is a screen shot of a web page loaded into both a browser and GATE. In the GATE window you can see the pink sections that have been marked as content (click on the image for a larger easier to read version).

Whilst this kind of approach is never going to be perfect it seems, from initial testing, that it does indeed help to filter out erroneous results when searching through information extracted from large web based corpora.

If you want to try it out yourself then it's already in the main GATE svn repository and the nightly builds. Details of how to configure the PR can be found in the relevant section of the GATE user guide.

Code from an English Coffee Drinker

What's Actually Worth Reading?

0 comments:

Post a Comment