So here is a question for you...
What do the following numbers all have in common? 3^2, 23, 101, 3.3e3, 1/4, 91/2, 4x10^3, 5.5*4^5, thirty one, three hundred, four thousand one hundred and two, 3 million, and fünfundzwanzig.
The answer is that they can all be recognized, annotated and converted to a real number representation (a Java Double) by a new GATE PR that has just been released and that I've just finished documenting for the user guide. You may never have really thought about this before but it turns out that there are so many ways of writing numbers in text that recognising them is actually really quite difficult. If you also want to know the value of the number you have recognised then this adds an extra layer of complexity especially when the number is written out in words rather than digits.
The PR actually started life back in 2009 for recognising numbers in patent documents as a precursor to recognising and normalizing measurements but since then has seen lots of development to extend the range of numbers that can be recognised. This new version is being used on a number of projects both to recognise numbers simply for the sake of finding numbers but also to help find drug doses, government spending and lots of generic measurements.
Requests for code to recognising numbers and determine their value has cropped up a number of times on the GATE mailing list and whilst we had been using this code internally for a while we knew that there were issues with it and it had never been tidied up or documented to the extent where we would be happy to show it to other people! Having discovered yet-another-bug in the code a fortnight ago I decided to take the time to rewrite large chunks of the code in order to fix most of the outstanding issues and to increase the range of numbers we could recognise. Hopefully this has led to a more useful PR. If you'd like to try it out then you can find this PR in the Tagger_Numbers plugin within the main GATE svn repository and it's in the nightly builds as well.
The plugin actually contains two PRs; Numbers Tagger and Roman Numerals Tagger. As you can guess by the name this second PR annotates Roman numerals appearing in documents. As with the main PR this also calculates the numeric value of the Roman numerals. I'm guessing that this PR is probably less useful than the main Numbers Tagger but we have found it to be helpful in the past when trying to recognise document sections, tables, figures etc. which can often be labelled with Roman numerals instead of Arabic numbers, e.g. Section VI, Table IV, Figure IIIa. If you are interested in the Roman Numerals Tagger then you can find more details in the user guide.
Hi Mark,
ReplyDeleteI will definitely give it a try. Am working on the normalisation of dates at the moment so your plugin will definitely help simplify things.
Have had a look at your Boilerpipe plugin and saw that you reimplemented the handling of the html markup on the GATE side, which is great as it means that we don't necessarily need to rely on the SAX events in Boilerpipe.
Great stuff!
Julien
Hi Julien, glad you think the PR will be useful.
ReplyDeleteYou may also be interested to know that one of the other PRs currently being cleaned up is for doing date normalization. Again it's a PR that has been used internally for a number of projects but hasn't yet made it into the main distribution. It's based around my own DateParser library which is already available though.
Hi Mark,
ReplyDeleteThanks for the pointer to the DateParser, I've finished working on mine but I'll definitely have a look at yours as well. In particular I'm wondering whether it supports small/medium/bigEndian formats and how it deals with ambiguous cases e.g. 02/01/10 which can be the 2nd of Jan or the first of Feb or the 10th of Jan, depending on which side of the Atlantic you happen to be. I'll probably rewrite it to use the output of the Numbers plugin so that I can handle nominal forms as well as numerical ones.
Julien
Hi Julien,
ReplyDeleteNot entirely sure what you mean by small/medium/bigEndian formats but the ambiguous cases are handled by reference to a locale. By default it uses the locale of the JVM to determine month/day ordering, but you can override this either at the document or PR level. It also gets the names for the months and days of the week from the locale and so can handle foreign languages to some extent as well. I'm actually doing some work on it today so hopefully it should be ready for "public" consumption in a day or so.
Hi Mark
ReplyDeleteSee http://en.wikipedia.org/wiki/Calendar_date for the endian formats. I did exactly the same as you did i.e. JVM Locale overridden at document level. Will be interesting to compare the outputs of our PRs.
BTW the URL above should be a good way of testing the accuracy and coverage of your plugin
Ah I see now, thanks!
ReplyDeleteThe underlying date parser works with most of those date formats, the problem is that it only tries to parse text underlying Date annotations and currently ANNIE doesn't seem to handle some of those formats very well :(