People in the News

Back in May I was involved in producing a demo for a show-and-tell session at the GATE training course. The idea was to try and demonstrate the process of defining an application, developing an annotation pipeline, annotating a large corpus, and then providing search over the documents, annotations and associated semantic information.

The idea we settled upon was to extend the basic ANNIE application, that is bundled with GATE, to annotate BBC News articles and to link the entities within them to DBpedia. This would then allow us to search the documents both for textual information, the same as any other search engine, but to also restrict the search based on information that might not be present in the documents but which is encoded in DBpedia. This worked well and allowed us to demonstrate the use of GATE Developer, the GATE Cloud Parallelizer (GCP) and GATE Mímir.

The combination of text, annotations and semantic information allow us to search the documents in interesting ways. You can play with the basic Mímir interface (referred to as GUS) over the demo index to see for yourself how useful the combination can be. Given that not many people reading this will already know the Mímir query syntax, and those that do probably won't know what annotations etc. are in the index, here are few example queries to get you started:
People Born in Sheffield
{Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace <>}"}
The Location of Steel Industries
{Organization sparql = "SELECT ?inst WHERE { ?inst :industry <>}"} [0..4] in {Location}
A BBC Scotland document, written after the start of 2011, in which a Labour Party member is being quoted
({Person sparql = "SELECT ?inst WHERE { ?inst :party <>}"} root:say) IN ({Document date > 20110000} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})
As you can see from these examples, as the queries get more complex they quickly become unwieldy. The problem is that Mímir provides a very rich query syntax and the basic GUS interface does nothing to hide the syntax from the user. Whenever we demo Mímir people love it but we always have to stress that GUS is not an end user search tool -- it is a development tool to enable you to check the contents of an index and to develop complex queries. In other words...
GUS is not the interface you are looking for!

Now I really like the demo we put together but trying to teach people the Mímir query syntax is difficult, especially if they don't already know any SPARQL. Also it is difficult to explain to potential partners/customers how they could take a Mímir index and produce their own custom interfaces. Whilst these thoughts have been festering in the back of my mind for a while I've only just found the time to go back to the demo and to build a custom interface (partly because next week I'm going to be teaching some people how to build custom Mímir interfaces, so I thought it best to have built at least one).

Given how rich the query syntax is, it is unlikely that a custom interface will be able to expose all the information within the index. Instead a number of interfaces may be developed, for the same index, in order to provide different types of search. Given this I decided to focus on searching for people within the BBC News articles. I used GUS to explore the index (which is what GUS is really for) and built up a number of complex person related queries. I then set about breaking these queries down into sections that could be easily represented in a form based fashion.

Once the form was complete it was trivial to reconstruct the complex queries from the form elements. All that was left to do was to interface with the Mímir index. Fortunately as well as GUS Mímir comes with an XML based RESTful interface. So the demo now builds complex queries from the form elements submits the query to Mímir via it's RESTful interface and then displays the results all without the user having to know anything about Mímir's query syntax.

The completed demo is unimaginatively called People in the News and you should feel free to play around with it. Some example queries include; criminals called Jonathan, Russian astronauts, and (my favourite complex example) politicians born in Sheffield mentioned in BBC Scotland documents from April 2011. The nice thing about the new interface is how easy it is to fill in the form to run these queries. That last example would otherwise entail you entering the following into GUS:
(({Person sparql="SELECT DISTINCT ?inst WHERE { ?inst :birthPlace <> . { ?inst a :Politician } UNION { ?inst a :OfficeHolder . ?inst a <> } }"}) IN {Content}) IN ({Document date >= 20110401 date <= 20110430} OVER {DocumentClassification sparql = "SELECT ?inst WHERE { ?inst a bbc:Classification . FILTER (?inst = bbc:Scotland)}"})
It is still something of a work-in-progress so if you have any ideas for improvements, or you find any bugs/oddities please do let me know.