Automatically Generating HTML5 Microdata

The majority of the code I write as part of my day job revolves around trying to extract useful semantic information from text. Typical examples of what is referred to as "semantic annotation" include spotting that a sequence of characters represents the name of a person, organization or location and probably then linking this to an ontology or some other knowledge source. While extracting the information can be a task in itself, usually you want to do something with the information often to enrich or make the original text document more accessible in some way. In the applications we develop this usually entails indexing the document along with the semantic annotations to allow for a richer search experiance and I've blogged about this approach before. Such an approach assumes, however, that the consumer of the semantic annotations will be a human, but what if another computer programme wants to make use of the information we have just extracted. The answer is to use some form of common machine readable encoding.

While there is already an awful lot of text in the world, more and more is being produced everyday, usually in electronic form, and usually published on the internet. Given that we could never read all this text we rely on search engines, such as Google, to help us pinpoint useful or interesting documents. These search engines rely on two main things to find the documents we are interested in, the text and the links between the documents, but what if we could tell them what some of the text actually means?

In the newest version of the HTML specification (which is a work in progress usually referred to as HTML5) web pages can contain semantic information encoded as HTML Microdata. I'm not going to go into the details of how this works as there is already a number of great descriptions available, including this one.

HTML5 Microdata is, in essence, a way of embedding semantic information in a web page, but it doesn't tell a human or a machine what any of the information means, especially as different people could embed the same information using different identifiers or in different ways. What is needed is a common vocabulary that can be used to embed information about common concepts, and currently many of the major search engines have settled on using schema.org as the common vocabulary.

When I first heard about schema.org, back in 2011, I thought it was a great idea, and wrote some code that could be used to embed the output of a GATE application within a HTML page as microdata. Unfortunately the approach I adopted was, to put it bluntly, hacky. So while I had proved it was possible the code was left to rot in a dark corner of my SVN repository.

I was recently reminded of HTML5 microdata and schema.org in particular when one of my colleges tweeted a link to this interesting article. In response I was daft enough to admit that I had some code that would allow people to automatically embed the relevant microdata into existing web pages. It wasn't long before I'd had a number of people making it clear that they would be interested in me finishing and releasing the code.

I'm currently in a hotel in London as I'm due to teach a two day GATE course starting tomorrow (if you want to learn all about GATE then you might be interested in our week long course to be held in Sheffield in June) and rather than watching TV or having a drink in the bar I thought I'd make a start on tidying up the code I started on almost three years ago.

Before we go any further I should point out that while the code works the current interface isn't the most user friendly. As such I've not added this to the main GATE distribution as yet. I'm hopping that any of you who give it a try can leave me feedback so I can finish cleaning things up and integrate it properly. Having said that here is what I have so far...

I find that worked examples usually help convey my ideas better than prose so, lets start with a simple HTML page:
<html>
  <head>
    <title>This is a schema.org test document</title>
  </head>
  <body>
    <h1>This is a schema.org test document</h1>
    <p>
      Mark Greenwood works in Sheffield for the University of Sheffield. 
      He is currently in a Premier Inn in London, killing time by working on a GATE plugin to allow annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
    </p>
  </body>
</html>
As you can see this contains a number of obvious entities (people, organizations and locations) that could be described using the schema.org vocabulary (people, organizations and locations are just some of the schema.org concepts) and which would be found by simply running ANNIE over the document.

Once we have sensible annotations for such a document, probably from running ANNIE, and a mapping between the annotations and their features and the schema.org vocabulary then it is fairly easy to produce a version of this HTML document with the annotations embedded as microdata. The current version of my code generate the following file:
<html>
  <head>
    <title>This is a schema.org test document</title>
  </head>
  <body>
    <h1>This is a schema.org test document</h1>
    <p>
      <span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="male" itemprop="gender"/><meta content="Mark Greenwood" itemprop="name"/>Mark Greenwood</span> works in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="Sheffield" itemprop="name"/>Sheffield</span> for the <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="University of Sheffield" itemprop="name"/>University of Sheffield</span>.
      He is currently in a <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="Premier" itemprop="name"/>Premier</span> Inn in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="London" itemprop="name"/>London</span>, killing time by working on a GATE plugin to allow <span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="female" itemprop="gender"/><meta content="ANNIE" itemprop="name"/>ANNIE</span> annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
    </p>
  </body>
</html>
This works nicely and the embeded data can be extracted by the search engines, as proved using the Google rich snippets tool.

As I said earlier while the code works, the current integration with the rest of GATE definitely needs improving. If you load the plugin (details below) then right clicking on a document will allow you to Export as HTML5 Microdata... but it won't allow you to customize the mapping between annotations and a vocabulary. Currently the ANNIE annotations are mapped to the schema.org vocabulary using a config file in the resources folder. If you want to change the mapping you have to change this file. In the future I plan to add some form of editor (or at least the ability to choose a different file) as well as the ability to export a corpus not just a single file.

So if you have got all the way to here then you probably want to get your hands on the current plugin, so here it is. Simply load it into GATE in the usual way and it will add the right-click menu option to documents (you'll need to use a nightly build of GATE, or a recent SVN checkout, as it uses the resource helpers that haven't yet made it into a release version).

Hopefully you'll find it useful but please do let me know what you think, and if you have any suggestions for improvements, especially around the integration the GATE Developer GUI.