Today sees the release of another of the GATE plugins I've been working on cleaning up over the last few months. Unlike the other plugins I've talked about recently this one has a much longer history as I wrote the core code back when I was a PhD student.
Many information extraction (IE) tasks benefit from or require the extraction of accurate date information. While ANNIE (the IE system that comes with GATE) does produce Date annotations no attempt is made to normalize these dates, i.e. to firmly fix all dates, even partial or relative ones, to a timeline using a common date representation. My PhD focused on open-domain question answering, an IE task in which dates can play an import role; any "when" question, or questions starting "who is..." benefit from accurate date information. The problem was that I couldn't find a good Java library for parsing dates into a common format, so of course I set about writing one.
The library I wrote is unimaginatively called Date Parser and has been freely available since around 2005. You can currently find the parser being built by my Hudson server. Without going into too many technical details (the Javadoc is available for those who like that kind of thing) the parser takes a string and attempts to parse it as a date starting from a given offset. Unlike the built in DateFromat class which is limited to parsing one date format at a time my parser attempts to handle as many date formats as possible. Of course there are only so many ways you can re-arrange three pieces of information, but the parser also handles relative dates and dates which are not fully specified. For example, "April 2011" would be parsed into a Date object representing the 1st of April 2011. Possibly more interesting though is that fact that words/phrases such as yesterday, today, next Wednesday, and 3 days ago are all also parsed and recognized. In these instances the actual date being mentioned is calculated based upon a context date supplied to the parser. So if the word yesterday appears in the context of the 3rd of March 2011 the string will be recognized as referring to the 2nd of March 2011.
The parser worked really well during my PhD work and has seen numerous improvements since then as well. It started to be used in GATE projects a year or so ago and was initially used in conjunction with ANNIE. ANNIE adds Date annotations to document and I wrote a JAPE grammar that would find these annotations and then run the parser over the underlying text adding the normalized date value (if found) as a new feature. The code eventually moved to being a PR (rather than JAPE) for performance reasons and to support some new features. The problem, however, was that the dates the parser could handle and the dates that ANNIE finds don't always align. This meant that adding a new date format required changes to both ANNIE and the Date Parser. So when I started to clean up the code for release I made the decision to re-write the PR as a standalone component that no longer relies on ANNIE.
Surprisingly it was very easy to convert the existing code to remove the reliance on ANNIE and I think the performance (both time and accuracy) have been improved as a result. This isn't to say that ANNIE is bad at finding dates, just that it does some things differently and it also annotates times with Date annotations which for this task can confuse the issue.
Full documentation is available in the user guide and the PR is already available in the nightly builds of GATE (you need to load the Tagger_DateNormalizer plugin) so feel free to have a play and let me know what you think.
But this is still English only solution, despite possibility to set things like month or weekday names through locale. DateParser.p attribute contains regexps describing English dates. Or am I missing something?
ReplyDeleteNo this isn't just an English solution. The underlying Date Parser gets the names of the days of the week and months of the year from the locale so will work with Foreign languages. You are right that some of the regular expressions have hard coded English strings in them but I'm already working on abstracting those out into resource bundles so hopefully even this should change soon.
ReplyDeleteOf course even being limited to just having the days of week and months of the year being configurable already gives you the option of finding more dates than you would get from just using ANNIE.
ReplyDelete