So here is a question for you...
What do the following numbers all have in common? 3^2, 23, 101, 3.3e3, 1/4, 91/2, 4x10^3, 5.5*4^5, thirty one, three hundred, four thousand one hundred and two, 3 million, and fünfundzwanzig.
The answer is that they can all be recognized, annotated and converted to a real number representation (a Java Double) by a new GATE PR that has just been released and that I've just finished documenting for the user guide. You may never have really thought about this before but it turns out that there are so many ways of writing numbers in text that recognising them is actually really quite difficult. If you also want to know the value of the number you have recognised then this adds an extra layer of complexity especially when the number is written out in words rather than digits.
The PR actually started life back in 2009 for recognising numbers in patent documents as a precursor to recognising and normalizing measurements but since then has seen lots of development to extend the range of numbers that can be recognised. This new version is being used on a number of projects both to recognise numbers simply for the sake of finding numbers but also to help find drug doses, government spending and lots of generic measurements.
Requests for code to recognising numbers and determine their value has cropped up a number of times on the GATE mailing list and whilst we had been using this code internally for a while we knew that there were issues with it and it had never been tidied up or documented to the extent where we would be happy to show it to other people! Having discovered yet-another-bug in the code a fortnight ago I decided to take the time to rewrite large chunks of the code in order to fix most of the outstanding issues and to increase the range of numbers we could recognise. Hopefully this has led to a more useful PR. If you'd like to try it out then you can find this PR in the Tagger_Numbers plugin within the main GATE svn repository and it's in the nightly builds as well.
The plugin actually contains two PRs; Numbers Tagger and Roman Numerals Tagger. As you can guess by the name this second PR annotates Roman numerals appearing in documents. As with the main PR this also calculates the numeric value of the Roman numerals. I'm guessing that this PR is probably less useful than the main Numbers Tagger but we have found it to be helpful in the past when trying to recognise document sections, tables, figures etc. which can often be labelled with Roman numerals instead of Arabic numbers, e.g. Section VI, Table IV, Figure IIIa. If you are interested in the Roman Numerals Tagger then you can find more details in the user guide.