In Out, In Out, Shake It All About

In the very abstract sense text analysis can be divided into three main tasks; load some text, process it, export the result. Out of the box GATE (both the GUI and the API) provides excellent support for both loading documents and processing them, but until now we haven't provided many options when it comes to exporting processed documents.

Traditionally GATE has provided two methods of exporting processed documents; a lossless XML format that can be reloaded into GATE but is rather verbose, or the "save preserving format" option which essentially outputs XML representing the original document (i.e. the annotations in the Original markups set) plus the annotations generated by your application. Neither of these options were particularly useful if you wanted to pass the output on to some other process and, without a standard export API, this left people having to write custom processing resources just to export their results.

To try and improve the support for exporting documents recent nightly builds of GATE now include a common export API in the gate.DocumentExporter class. Before we go any further it is worth mentioning that this code is in a nightly build so is subject to change before the next release of GATE. Having said that I have now used it to implement exporters for a number of different formats so I don't expect the API to change drastically.

If you are a GATE user, rather than a software developer, than all you need to know is that an exporter is very similar to the existing idea of document formats. This means that they are CREOLE resources and so new exporters are made available by loading a plugin. Once an exporter has been loaded then it will be added to the "Save as..." menu of both documents and corpora and by default exporters for GATE XML and Inline XML (i.e. the old "Save preserving format) are provided even when no plugins have been loaded.

If you are a developer and wanting to make use of an existing exporter, then hopefully the API should be easy to use. For example, to get hold of the exporter for GATE XML and to write a document to a file the following two lines will suffice:
DocumentExporter exporter =
   DocumentExporter.getInstance("gate.corpora.export.GateXMLExporter");

exporter.export(document, file);
There is also a three argument form of the export method that takes a FeatureMap that can be used to configure an exporter. For example, the annotation types the Inline XML exporter saves is configured this way. The possible configuration options for an exporter should be contained in it's documentation, but possibly the easiest way to see how it can be configured is to try it from the GUI.

If you are a developer and want to add a new export format to GATE, then this is fairly straightforward; if you already know how to produce other GATE resources then it should be really easy. Essentially you need to extend gate.DocumentExporter to provide an implementation of it's one abstract method. A simple example showing an exporter for GATE XML is given below:
@CreoleResource(name = "GATE XML Exporter",
   tool = true, autoinstances = @AutoInstance, icon = "GATEXML")
public class GateXMLExporter extends DocumentExporter {

  public GateXMLExporter() {
    super("GATE XML", "xml", "text/xml");
  }

  public void export(Document doc, OutputStream out, FeatureMap options)
          throws IOException {
    try {
      DocumentStaxUtils.writeDocument(doc, out, "");
    } catch(XMLStreamException e) {
      throw new IOException(e);
    }
  }
}
As I said earlier this API is still a work in progress and won't be frozen until the next release of GATE, but the current nightly build now contains export support for Fast Infoset compressed XML (I've talked about this before), JSON inspired by the format Twitter uses, and HTML5 Microdata (an updated version of the code I discussed before). A number of other exporters are also under development and will hopefully be made available shortly.

Hopefully if you use GATE you will find this new support useful and please do let us have any feedback you might have so we can improve the support before the next release when the API will be frozen.

Automatically Generating HTML5 Microdata

The majority of the code I write as part of my day job revolves around trying to extract useful semantic information from text. Typical examples of what is referred to as "semantic annotation" include spotting that a sequence of characters represents the name of a person, organization or location and probably then linking this to an ontology or some other knowledge source. While extracting the information can be a task in itself, usually you want to do something with the information often to enrich or make the original text document more accessible in some way. In the applications we develop this usually entails indexing the document along with the semantic annotations to allow for a richer search experiance and I've blogged about this approach before. Such an approach assumes, however, that the consumer of the semantic annotations will be a human, but what if another computer programme wants to make use of the information we have just extracted. The answer is to use some form of common machine readable encoding.

While there is already an awful lot of text in the world, more and more is being produced everyday, usually in electronic form, and usually published on the internet. Given that we could never read all this text we rely on search engines, such as Google, to help us pinpoint useful or interesting documents. These search engines rely on two main things to find the documents we are interested in, the text and the links between the documents, but what if we could tell them what some of the text actually means?

In the newest version of the HTML specification (which is a work in progress usually referred to as HTML5) web pages can contain semantic information encoded as HTML Microdata. I'm not going to go into the details of how this works as there is already a number of great descriptions available, including this one.

HTML5 Microdata is, in essence, a way of embedding semantic information in a web page, but it doesn't tell a human or a machine what any of the information means, especially as different people could embed the same information using different identifiers or in different ways. What is needed is a common vocabulary that can be used to embed information about common concepts, and currently many of the major search engines have settled on using schema.org as the common vocabulary.

When I first heard about schema.org, back in 2011, I thought it was a great idea, and wrote some code that could be used to embed the output of a GATE application within a HTML page as microdata. Unfortunately the approach I adopted was, to put it bluntly, hacky. So while I had proved it was possible the code was left to rot in a dark corner of my SVN repository.

I was recently reminded of HTML5 microdata and schema.org in particular when one of my colleges tweeted a link to this interesting article. In response I was daft enough to admit that I had some code that would allow people to automatically embed the relevant microdata into existing web pages. It wasn't long before I'd had a number of people making it clear that they would be interested in me finishing and releasing the code.

I'm currently in a hotel in London as I'm due to teach a two day GATE course starting tomorrow (if you want to learn all about GATE then you might be interested in our week long course to be held in Sheffield in June) and rather than watching TV or having a drink in the bar I thought I'd make a start on tidying up the code I started on almost three years ago.

Before we go any further I should point out that while the code works the current interface isn't the most user friendly. As such I've not added this to the main GATE distribution as yet. I'm hopping that any of you who give it a try can leave me feedback so I can finish cleaning things up and integrate it properly. Having said that here is what I have so far...

I find that worked examples usually help convey my ideas better than prose so, lets start with a simple HTML page:
<html>
  <head>
    <title>This is a schema.org test document</title>
  </head>
  <body>
    <h1>This is a schema.org test document</h1>
    <p>
      Mark Greenwood works in Sheffield for the University of Sheffield. 
      He is currently in a Premier Inn in London, killing time by working on a GATE plugin to allow annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
    </p>
  </body>
</html>
As you can see this contains a number of obvious entities (people, organizations and locations) that could be described using the schema.org vocabulary (people, organizations and locations are just some of the schema.org concepts) and which would be found by simply running ANNIE over the document.

Once we have sensible annotations for such a document, probably from running ANNIE, and a mapping between the annotations and their features and the schema.org vocabulary then it is fairly easy to produce a version of this HTML document with the annotations embedded as microdata. The current version of my code generate the following file:
<html>
  <head>
    <title>This is a schema.org test document</title>
  </head>
  <body>
    <h1>This is a schema.org test document</h1>
    <p>
      <span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="male" itemprop="gender"/><meta content="Mark Greenwood" itemprop="name"/>Mark Greenwood</span> works in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="Sheffield" itemprop="name"/>Sheffield</span> for the <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="University of Sheffield" itemprop="name"/>University of Sheffield</span>.
      He is currently in a <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="Premier" itemprop="name"/>Premier</span> Inn in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="London" itemprop="name"/>London</span>, killing time by working on a GATE plugin to allow <span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="female" itemprop="gender"/><meta content="ANNIE" itemprop="name"/>ANNIE</span> annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
    </p>
  </body>
</html>
This works nicely and the embeded data can be extracted by the search engines, as proved using the Google rich snippets tool.

As I said earlier while the code works, the current integration with the rest of GATE definitely needs improving. If you load the plugin (details below) then right clicking on a document will allow you to Export as HTML5 Microdata... but it won't allow you to customize the mapping between annotations and a vocabulary. Currently the ANNIE annotations are mapped to the schema.org vocabulary using a config file in the resources folder. If you want to change the mapping you have to change this file. In the future I plan to add some form of editor (or at least the ability to choose a different file) as well as the ability to export a corpus not just a single file.

So if you have got all the way to here then you probably want to get your hands on the current plugin, so here it is. Simply load it into GATE in the usual way and it will add the right-click menu option to documents (you'll need to use a nightly build of GATE, or a recent SVN checkout, as it uses the resource helpers that haven't yet made it into a release version).

Hopefully you'll find it useful but please do let me know what you think, and if you have any suggestions for improvements, especially around the integration the GATE Developer GUI.

At Sixes And Sevens

At work we are slowly getting ready for a major new release of GATE. In preparation for the release I've been doing a bit of code cleanup and upgrading some of the libraries that we use. After every change I've been running the test suite and unfortunately some of the tests would intermittently fail. Given that none of the other members of the team had reported failing tests and that they were always running successfully on our Jenkins build server I decided the problem must be something related to my computer. My solution then was simply to ignore the failing tests as long as they weren't relevant to the code I was working on, and then have the build server do the final test for me. This worked, but it was exceedingly frustrating that I couldn't track down the problem. Yesterday I couldn't ignore the problem any longer because the same tests suddenly started to randomly fail on the build server as well as my computer and so I set about investigating the problem.

The tests in question are all part of a single standard JUnit test suite that was originally written back in 2001 and which have been running perfectly ever since. Essentially these tests run the main ANNIE components over four test documents checking the result at each stage. Each component is checked in a different test within the suite. Now if you know anything about unit testing you can probably already hear alarm bells ringing. For those of you that don't know what unit testing is, essentially each test should check a single component of the system (i.e. a unit) and should be independent from every other test. In this instance while each test checked a separate component, each relied on all the previous tests in the suite having run successfully.

Now while dependencies between tests isn't ideal it still doesn't explain why they should have worked fine for twelve years but were now failing. And why did they start failing on the build server long after they had been failing on my machine. I eventually tracked the change that caused them to fail when run on the build server back to the upgrade from version 4.10 to 4.11 of JUnit but even with the help of a web search I couldn't figure out what the problem was.

Given that I'd looked at the test results from my machine so many times and not spotted any problems I roped in some colleagues to run the tests for me on their own machines and send me the results to see if I could spot a pattern. The first thing that was obvious was that when using version 4.10 of JUnit the tests only failed for those people running Java 7. GATE only requires Java 6 and those people still with a Java 6 install, which includes the build server (so that we don't accidentally introduce any Java 7 dependencies), were not seeing any failures. If, however, we upgraded JUnit to version 4.11 everyone started to see random failures. The other thing that I eventually spotted was that when the tests failed, the logs seemed to suggest that they had been run in a random order which, given the unfortunate links between the tests, would explain why they then failed. Armed with all this extra information I went back to searching the web and this time I was able to find the problem and an easy solution.

Given that unit tests are meant to be independent from one another, there isn't actually anything within the test suite that stipulates the order in which they should run, but it seems that it always used to be the case that the tests were run in the order in which they were defined in the source code. The tests are extracted from the suite by looking for all methods that start with the word test, and these are extracted from the class definition using the Method[] getDeclaredMethods() method from java.lang.Class. The documentation for this method includes the following description:

Returns an array of Method objects reflecting all the methods declared by the class or interface represented by this Class object. This includes public, protected, default (package) access, and private methods, but excludes inherited methods. The elements in the array returned are not sorted and are not in any particular order.

This makes it more than clear that we should never have assumed that the tests would be run in the same order they were defined, but it turns out that this was the order in which the methods were returned when using the Sun/Oracle versions of Java up to and including Java 6 (update 30 is the last version I've tested). I've written the following simple piece of code that shows the order of the extracted tests as well as info on the version of Java being used:
import java.lang.reflect.Method;

public class AtSixesAndSevens {
  public static void main(String args[]) {
    System.out.println("java version \"" +
      System.getProperty("java.version") + "\"");
    System.out.println(System.getProperty("java.runtime.name") +
      " (build " + System.getProperty("java.runtime.version") + ")");
    System.out.println(System.getProperty("java.vm.name") +
      " (build " + System.getProperty("java.vm.version") + " " +
      System.getProperty("java.vm.info") + ")\n");

    for(Method m : AtSixesAndSevens.class.getDeclaredMethods()) {
      if(m.getName().startsWith("test"))
        System.out.println(m.getName());
    }
  }

  public void testTokenizer() {}
  public void testGazetteer() {}
  public void testSplitter() {}
  public void testTagger() {}
  public void testTransducer() {}
  public void testCustomConstraintDefs() {}
  public void testOrthomatcher() {}
  public void testAllPR() {}
}
Running this on the build server gives the following output:
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02 mixed mode)

testTokenizer
testGazetteer
testSplitter
testTagger
testTransducer
testCustomConstraintDefs
testOrthomatcher
testAllPR
While running it on my machine results in a random ordering of the test methods as you can see here:
java version "1.7.0_51"
OpenJDK Runtime Environment (build 1.7.0_51-b00)
OpenJDK 64-Bit Server VM (build 24.45-b08 mixed mode)

testTagger
testTransducer
testGazetteer
testSplitter
testTokenizer
testCustomConstraintDefs
testOrthomatcher
testAllPR
Interestingly it would seem that the order only changes when the class is re-compiled, which suggests that the ordering may be related to how the methods are stored in the class file, but understanding the inner workings of the class file format is well beyond me. Even more interestingly it seems that even with Java 6 you can see a random ordering if you aren't using a distribution from Sun/Oracle, as here is the output from running under the Java 6 version of OpenJDK:
java version "1.6.0_30"
OpenJDK Runtime Environment (build 1.6.0_30-b30)
OpenJDK 64-Bit Server VM (build 23.25-b01 mixed mode)

testTagger
testTransducer 
testCustomConstraintDefs
testOrthomatcher
testAllPR
testTokenizer
testGazetteer
testSplitter
So this explains why switching from Java 6 to Java 7 could easily cause these related tests to fail, but why should upgrading from JUnit version 4.10 to 4.11 while staying with Java 6 cause a problem?

It turns out that in the new version of JUnit the developers decided to change the default behaviour away from relying on the non-deterministic method ordering provided by Java. Their default approach is now to use a deterministic ordering to guarantee the tests are always run in the same order; as far as I can tell this orders the methods by sorting on the hashCodes of the method names. While this may at least remove the randomness from the test order it doesn't keep them in the same order they are defined in the source file, and so our tests were always failing under JUnit 4.11. Fortunately the developers also allow you to add a class level annotation to force the methods to be ordered alphabetically. I've now renamed the tests so that when sorted alphabetically they are run in the right order (by adding a three digit number after the initial test in their name), and the class definition now looks like:
@FixMethodOrder(MethodSorters.NAME_ASCENDING)
public class TestPR extends TestCase {
  ...
}
So I guess there are two morals to this story. Firstly unit tests are called unit tests for a reason and they really should be independent of one another, but more importantly reading the documentation for the language or library you are using and not making assumptions about how they work (especially when the documentation tells you not to rely on something always being true) would make life easier.

Improving Your Thumbnails

Now before you all get confused this isn't a post about health or beauty tips for your thumbnails, but rather about a little trick I sometimes use to improve the thumbnails Blogger generates for each post.

When you publish a post through Blogger not only does the post appear on your blog but Blogger also inserts the content into two news feeds (RSS and Atom formats) which people can subscribe to in order to know when you have published something new. These feeds are also used to fill in the blog list widget that many people include in their blog template; you can see mine under the heading "I'm Also..." on the right.

If you have included an image in your post, and the image is hosted by Blogger (i.e. has blogspot in the URL) then as well as putting the post content into the news feed Blogger will generate a small thumbnail to represent the post. It appears to do this by creating a 72 pixel square thumbnail from the first image in the post. Specifically it scales the image so that the short edge is 72 pixels long and then crops the other dimension to retain the middle of the image. You can see this working with the image of a Chinese tin of spam from one of my recent posts.


To make the images easier to see I've used a width of the 300 pixels instead of 72 but the result is the same. On the left you can see the original image sized to 300 pixels wide, whereas on the right we have a 300 pixel thumbnail generated by scaling the height first and then cropping the width. In this example the cropping isn't too bad as it has retained almost all of the important content, but you can easily imagine images where the cropping could result in thumbnails that were far from ideal; chopping off peoples heads is a common example.

Fortunately it is easy to control the thumbnail that is generated by ensuring that the first image in your post is already square and cropped to your satisfaction. Now of course that would often leave you with an image you don't actually want to use, but that is alright as my trick doesn't actually result in the image appearing in the post anyway.

To customize the thumbnail all you have to do is upload the square image you want to use (it can be any size although as it will always be displayed as a 72 pixel square there isn't much point making it too big) through the Blogger interface and then switch to the HTML editor view. Now we don't need all the HTML that Blogger generates as all we need is the img tag. So you can remove everything surround the image and move the rest to the very beginning of your post. So for this post that looks something like (I've trimmed the URL to fit the screen):
<img src="http://3.bp.blogspot.com/.../thumnail.jpeg" />
By placing this at the beginning of the post we ensure that this image is the one Blogger uses when it generates the thumbnail, and we can hide it for all other purposes by adding some CSS to the image as follows:
<img style="display:none;" src="http://3.bp.blogspot.com/.../thumnail.jpeg" />
In this case the CSS is fairly self-explanatory as it simply turns off the display of the image. And that is it, a very simple trick but one that can make your blog look better in other peoples news feeds.

As well as using this to customize a thumbnail that Blogger would already generate you can of course use it to generate thumbnails in cases where Blogger otherwise wouldn't. The two main cases where this might be useful are firstly where you host your images somewhere else (maybe flicker) or, and this is where I most often use this trick, if you have embedded a YouTube video instead of an image in your post. In neither case does Blogger generate a thumbnail for you, but you should be able to see how easy it would be to add your own.

The Other Kiwi's Grass Is Greener

I've recently been building a simple proof-of-concept web application which may or may not see the light of day, but essentially it builds an SVG image directly within the browser through user interaction. One useful feature would be to allow the user to download the completed SVG file (as a backup or for use elsewhere). After quite a bit of experimentation with some of the newer HTML features I now have a (mostly) working solution, which I've reduced down to a simple example using the Kiwi to the left (the SVG for this image comes from a useful tutorial by Chris Coyier). If you can't see a Kiwi then your browser doesn't supporting displaying SVG images and certainly won't support the techniques in the rest of this post, so if you want to follow along you'll need to find a more modern browser.

Given that the SVG image is being built through standard DOM manipulation it is easy to get the text of the SVG file directly via JavaScript. For example, if we assume that the SVG element has the id svg then the following will get the full text of the image.
// get a copy of the DOM element with the id "svg"
var svg = document.getElementById("svg").cloneNode();

// get the text of this element, including the element and all children
var content = svg.outerHTML;

if (content == undefined) {
  // if there is no content then the outerHTML property isn't supported...

  // so create a new DIV element
  var div = document.createElement("div");

  // append the SVG to the DIV
  div.appendChild(svg);

  // use the widely supported innerHTML to get at the text of the SVG
  content = div.innerHTML;
}
This is a little more convoluted than I would like due to the fact that while all browsers support outerHTML not all of them (specifically Chrome) support it on the SVG element. Either way this code allows us to get the text we need to return as an SVG file to the user. Note that I've cloned the SVG node (line 2) rather than just referencing it directly, because changing a nodes parent in Chrome causes it to disappear from the document, which clearly isn't what we want. The more complex part of the problem then is figuring out how to return the text as an SVG to the user.

Given that I was looking at this in the context of a large web application, probably the easiest option would have been to send the text back to the server, which could then return it to the browser with a sensible filename and mime type (image/svg+xml), but this would clearly be rather inefficient, especially as the SVG image grows in size. It would also preclude the technique from being used in simple client side applications which didn't use a web server.

I stumbled upon an almost working example by accident. It turns out that if you simply return the SVG data from within a JavaScript href link the SVG will open within the browser, allowing the user to simply save the page to download the SVG. So we can wrap up the code above in a function:
function saveAsSVG(id) {
  var svg = document.getElementById(id).cloneNode();

  var content = svg.outerHTML;

  if (content == undefined) {
    var div = document.createElement("div");
    div.appendChild(svg);
    content = div.innerHTML;
  }

  return content;
}
And then call this function from a link
<a href="javascript:saveAsSVG('svg');">Download</a>
And you can see what happens with this live version of the above. If you've just tried that you will have hopefully noticed a few problems:
  • the image is missing all it's colour information (the kiwi and grass are both black)
  • when you save the image the browser suggests a HTML filename
  • the file you save is actually a HTML file not an SVG as the browser adds some extra HTML wrapping
While these are all clearly problems I was encouraged that I was at least heading in roughly the right direction. The missing colours I had expected (and will come back to later) and so I focused on the final two related problems first. After searching the web I came across the HTML5 download attribute that the specification seemed to suggest was exactly what I needed. Essentially the new download attribute, when applied to a link, signals to the browser that the item the link points to should be downloaded rather than display. It also allows you to specify a filename that the browser should suggest when the user clicks the link. So using this with the above gets us:
<a href="javascript:saveAsSVG('svg');" download="kiwi.svg">Download</a>
which again you can try for your self with this live version of the above. Now I don't know which browser you are using, but I was developing this in Firefox, and I was amazed to find that this works; clicking the link pops up a "save as" dialog with the correct file name and the saved file is the correct SVG file. Unfortunately trying the same thing in both Opera and Chrome doesn't work. In Opera, the attribute seems to be simply ignored as it doesn't change the behaviour, while in Chrome when you click the link nothing happens, and no errors are reported. As far as Opera is concerned the problem seems to be that I'm running the latest version available for Ubuntu which is v12, while the most recent version for Windows and Mac is 18; so if Opera are abandoning Linux I'm going to abandon testing with Opera.

Now while I could mandate specific browser versions for the app I was working on I really did want to try and find a more general solution. Also I had noticed that even once you have saved the SVG in Firefox the throbber never stops spinning, which usually suggests that the browser isn't really happy with what you have just done! Fortunately the search that led me to the download attribute also pointed me at a related new HTML feature; the blob.

A blob "represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format", which sounds like exactly what I need, especially as a blob can be referenced by a URL. I'm not going to detail all the possible features or ways of using blobs, but the following single line of code can be used to create a blob from the SVG content we already know how to extract:
var blob = new Blob([content], {type: "image/svg+xml"});
If you remember from before the variable content holds the text of the SVG file we want to create, so this line creates a binary blob from the SVG data with the right mime type. It is also easy to create and destroy URLs that point to these blobs. For example, the following creates and then immediately destroys a URL pointing to the blob.
var url = window.URL.createObjectURL(blob);
window.URL.revokeObjectURL(url);
You can probably get away without explicitly revoking every URL you create as they should be automatically revoked when the page is unloaded, but it is certainly a good idea to release them if you know that a) the URL is invalid and b) you are creating a lot of these URLs. One thing to note is that Chrome (and other WebKit based browsers) use a vendor prefix for the URL object (i.e. it's called webkitURL) so we need to make sure that regardless of the browser we are using we can access the functions we need. Fortunately this is easy to do by adding the following line (outside of any function):
window.URL = window.URL || window.webkitURL;

So the remaining question is how do we string all these bits together so that clicking a link on a page, builds a blob, generates a URL and then allows the user to download the URL. All the solutions I saw while searching the web did this in two steps. Essentially they had a link or button that when clicked would build the blob, generate the URL, and then add a new link to the document. This new link would use the download attribute and the blob URL to allow the user to download the file. While this works it doesn't seem particularly user friendly. Fortunately it is easy to combine all this functionality into a single link.

The trick to doing everything within a single link is to use the onclick event handler of the link to build the blob and generate the URL which it then sets as the href attribute of the link. As long as the onclick event handler returns true then the browser will follow the href which now points at the blob URL and when we combine this with the download attribute the result is that the user is prompted to download the blob. So the function I came up with looks like the following:
function saveAsSVG(id, link) {

  if (link.href != "#") {
    window.URL.revokeObjectURL(link.href);
  }
      
  var svg = document.getElementById(id).cloneNode();

  var content = svg.outerHTML;

  if (content == undefined) {
    var div = document.createElement("div");
    div.appendChild(svg);
    content = div.innerHTML;
  }
  
  var blob = new Blob([content], {type: "image/svg+xml"});

  link.href = window.URL.createObjectURL(blob);

  return true;
}
Essentially this is just all the code we have already seen assembled into a single function, although it assumes that the initial href is # so that it doesn't try and revoke an invalid URL (from testing it seems that trying to revoke something that isn't valid is simply ignored, but this is slightly safer). We can then call this function from a link as follows:
<a onclick="return saveAsSVG('svg',this);" href="#" download="kiwi.svg">Download</a>
And again you can try this for yourself using this live version of the above. Now this should work in all browsers that support these new HTML features which, according to the excellent Can I Use... website, should be most modern browsers. The only remaining issue is that the kiwi and grass are still both black in the downloadable SVG.

The reason the colour information is lost is that it was never part of the SVG to start with. In this example, the SVG is styled using CSS within the web page rather than directly within the SVG. When you download the SVG file you don't get the associated CSS styles and hence the elements revert back to being black. If we had used a self contained SVG image then everything would work and we would have no need to go any further. Fortunately, even when we style the SVG using CSS it is fairly easy to merge the styles into the SVG before the user downloads the image. In this example the page contains the following CSS style element.
<style title="kiwi" type="text/css">
  .ground {
    fill: #94d31b; 
  }

  .kiwi {
    fill: #C19A6B;
  }
</style>
Fortunately it is easy to add CSS styles to the SVG image before creating the blob object. If we assume that there is a function that returns a DOM style element then we can extend our existing function to embed this into the SVG as follows:
function saveAsStyledSVG(id, styles, link) {

  if (link.href != "#") {
    window.URL.revokeObjectURL(link.href);
  }
      
  var svg = document.getElementById(id).cloneNode();
  svg.insertBefore(getStyles(styles),svg.firstChild);

  var content = svg.outerHTML;

  if (content == undefined) {
    var div = document.createElement("div");
    div.appendChild(svg);
    content = div.innerHTML;
  }
  
  var blob = new Blob([content], {type: "image/svg+xml"});

  link.href = window.URL.createObjectURL(blob);

  return true;
}
Here you can see that on line 8 we call a function getStyles to get the DOM element and then append this as the first child of the root element of the SVG document, but the rest of the function is identical (other than the name as we can't overload functions in JavaScript). Now all we need to do is to define the getStyles function, which we do as follows:
function getStyles(names) {

  // create an empty style DOM element
  var styles = document.createElement("style");

  // set the type attribute to text/css
  styles.setAttribute("type","text/css");

  // get easy access to the Array slice method
  var slice = Array.prototype.slice;

  for (var i = 0 ; i < document.styleSheets.length ; ++i) {
    // for each style sheet in the document
    var sheet = document.styleSheets[i];

    if (names == undefined || names.indexOf(sheet.title) != -1) {
      // if we are including all styles or this sheet is one we want

      slice.call(document.styleSheets[i].cssRules).forEach(
        // slice the style sheet into separate rules
        function(rule){
          // create a new text node with the text of each CSS rule
          var text = document.createTextNode("\n"+rule.cssText);

          // add the rule to the style sheet we are building
          styles.appendChild(text);
        }
      );
    }
  }

  // return the completed style sheet
  return styles;
}
Hopefully the comments make it clear how this works, but essentially is finds all the rules within a named set of style sheets and adds them to a newly created style DOM element. It's worth noting that it is much safer to explicitly specify the style sheets you want to use because a) it will be quicker, b) the resultant file will be smaller but most importantly c) you can't access the content of style sheets loaded from another domain than the current page as this goes against the browser security model.

So given these new functions we can now use a link defined as follows to allow the kiwi to be downloaded in all it's glorious colours:
<a onclick="return saveAsStyledSVG('svg',['kiwi'],this);" href="#" download="kiwi.svg">Download</a>
And again here is a live version of the above for you to try, which should allow you to download the full colour SVG.

Hopefully you have found this a useful and interesting exploration of some of the newer features of HTML, but it is worth noting that the techniques discussed could be used for downloading a whole range of different file types without requiring server interaction.

Why Aren't They Spamming The Chinese?

Whilst trying to drink my first cup of coffee this morning, I was rudely interrupted by click-jacking malware affecting my wife’s computer. All she was trying to do was look at some Google search results, but clicking on them would take her to a suspicious looking shopping search site. From a little bit of Googling it looked as if it might be a real nasty trojan which would have taken ages to clean up. Fortunately it turned out that all the pages she was having the problem with had been infected with the same bit of malicious JavaScript. I'm not sure how (probably through a malicious banner ad or something) but a reference to the following JavaScript had been inserted at the very end (after the </html>) of each affected page:
if (navigator.language)
  var language = navigator.language;
else
  var language = navigator.browserLanguage;

if(language.indexOf('zh') == -1) { 
  var regexp = /\.(aol|google|youdao|yahoo|bing|118114|biso|gougou|ifeng|ivc|sooule|niuhu|biso|ec21)(\.[a-z0-9\-]+){1,2}\//ig;
  var where = document.referrer;
  if (regexp.test(where)) {
    window.location.href="http://www.bbc.co.uk/news";
  }
}
To make the script easier to read I've reformatted it, and replaced the redirect with a safe URL (who doesn't trust the BBC?) rather than giving the spammers free advertising, but I haven't changed any of the functional aspects of the script.

Essentially all it does is check the URL that you were on when you clicked the link leading you to the current page, and if that looks like a search results page from one of 14 different companies, then it redirects you. The regular expression it uses to check the referring page is simple yet effective and will catch any of the sub-domains of these search services as well. What I find weird is why the script checks the language of the browser.

The first four lines of the script get the language the browser is using. There are two ways of doing this depending on which browser you are using hence the if statement. On my machine this gets me en-US (which means I need to figure out why it has switched from en-UK which is what I thought I'd set it to). Line 6 then checks to make sure the language doesn't include the string zh, which according to Wikipedia is Chinese. I'm assuming that the spammers behind the script are Chinese and don't want to be inconvenienced by their own script, but it seems odd, especially when you consider that at least one of the search engines covered by the regular expression (118114 on many different top-level domains) seems to be a Chinese site.

Looking at this script there is of course another way to defeat it, other than disabling JavaScript. One of the privacy or security options in most browsers concerns the referer (yes I know it is spelt wrong, but that is the correct spelling in the HTTP spec) header. Essentially this header tells a web server the page you were on when you clicked the link leading to the page you are requesting. Some sites will use this to provide functionality so disabling it can cause problems but it does mitigate against scripts like this one. Because it can cause problems it's often an advanced setting, for example here are the details for Firefox.

Serializing To A Human Interface Device

If you've read my previous post you'll know that I've been looking at a cheap and simple way of adding serial communication to a breadboard Arduino clone (such as this one). To summarise the situation so far; adding true RS-232 serial communication is both expensive and difficult as the required part is only available as a surface mount component but I discovered V-USB which allows me to emulate low speed USB devices. The end result was that I managed to use V-USB to emulate a USB keyboard. Being able to pass data from the Arduino to the PC by simply emulating key presses is useful but a) it is rather slow and b) different keyboard mappings lead to different characters being and typed and more importantly c) it doesn't allow me to send data to the Arduino. So on we go...

Let's start with what I haven't managed to achieve; a USB CDC ACM device for RS-232 communication. Unfortunately CDC ACM devices require bulk endpoints (these allow for large sporadic transfers using all remaining available bandwidth, but with no guarantees on bandwidth or latency) and these are only officially supported for high speed USB devices. V-USB only allows me to emulate low speed USB devices, and while most operating systems used to allow low speed devices to create bulk endpoints, even though this is contrary to the spec, modern versions of Linux (and possibly Windows) do not. I did manage to get a device configured correctly but as soon as I plugged it in the bulk endpoints were detected and converted to interrupt endpoints which stopped the device from working. However, all is not lost as I do have a solution which I think is just as good; serializing data to and from a generic USB Human Interface Device.

The USB specification defines the USB Human Interface Device (HID) class to support, as the name suggests, devices with some form of human interface. This doesn't mean sticking a USB cable into your arm, but rather defines common devices such as keyboards, mice and game controllers as well as devices like exercise machines, audio controllers and medical instruments. While such devices may communicate data in a variety of forms it all passes to and from the device using the same protocol. This means that when you plug any such device into practically any computer with a USB port it will be recognised and basic drivers will be loaded.

Writing code to communicate with a USB HID device isn't that much more complex than interfacing with a classic serial port and given the standard driver support we can rely on the operating system taking care of most of the communication for us.

For what follows I'm assuming the same basic USB circuit that I described in the previous post as we know it works and it is cheap to build.


Now we have the circuit let's move on to the software we need to write. Unlike with the USBKeyboard library, that powered The Caffeine Button, we will need both firmware for the Arduino and host software that will run on the PC and interface with the basic HID drivers the operating system provides. Given that you can't test the host software until we have a working device we'll start by looking at the firmware.

The first thing you have to do when constructing a HID is to define its descriptor. The descriptor is how the device presents itself to the operating system and defines the type of device as well as the size and type of any communication messages. Now you will probably never need to edit this but I thought it was worth showing you the full descriptor we are using:
PROGMEM char usbHidReportDescriptor[USB_CFG_HID_REPORT_DESCRIPTOR_LENGTH] = {
    0x06, 0x00, 0xff,              // USAGE_PAGE (Generic Desktop)
    0x09, 0x01,                    // USAGE (Vendor Usage 1)
    0xa1, 0x01,                    // COLLECTION (Application)
    0x15, 0x00,                    //   LOGICAL_MINIMUM (0)
    0x26, 0xff, 0x00,              //   LOGICAL_MAXIMUM (255)
    0x75, 0x08,                    //   REPORT_SIZE (8)
    0x95, OUT_BUFFER_SIZE,         //   REPORT_COUNT (currently 8)
    0x09, 0x00,                    //   USAGE (Undefined)  
    0x82, 0x02, 0x01,              //   INPUT (Data,Var,Abs,Buf)
    0x95, IN_BUFFER_SIZE,          //   REPORT_COUNT (currently 32)
    0x09, 0x00,                    //   USAGE (Undefined)        
    0xb2, 0x02, 0x01,              //   FEATURE (Data,Var,Abs,Buf)
    0xc0                           // END_COLLECTION
};
In this descriptor we define two reports of different sizes which we will use for transferring data to and from the device. The first thing to point out is that the specification defines input and output with respect to the host PC and not the device. So an input message is actually used for writing out from the device rather than for receiving data. Given this, we can see that the descriptor defines two message types. Firstly (lines 7 to 10) we define an 8 byte (OUT_BUFFER_SIZE is defined as 8) input report (the size is defined in bits so we have 8 bits times the count to give 8 bytes) which means we can write 8 byte blocks of data back to the PC we are connected to. The second message type is defined as a FEATURE message of 32 bytes (because IN_BUFFER_SIZE is defined as 32 and the REPORT_SIZE hasn't been redefined so it still 8 bits) which we will use for passing data from the PC to the USB device. As I said you will probably never need to edit this structure especially as you can tweak the message sizes, if necessary, by adjusting the two constants instead. If you do decided to change the descriptor it is worth noting that some operating systems are more forgiving than others. For example, with Linux if you have a defined a message of 8 bytes but only have two to send then you can do that and everything will work. Under Windows, however, if you only send two bytes the device will simply stop functioning altogether so you will need to pad the message to be exactly 8 bytes. This also means that it is easy to check that your descriptor matches what you are actually doing by quickly testing under Windows (I've been doing this with a copy of Windows XP running under VirtualBox).

Now that we know the size of the messages we will send and receive we still need to decide upon their format, i.e. the protocol we will use for our data that we are sending on top of the USB protocol. If we were only interested in sending textual data then we could send null terminated data (i.e. put a zero value byte into the array after the last byte of data), but if we want to send arbitrary bytes then using 0 as an end of data marker seems an odd choice. For this reason I've opted to set the first byte of each message to the length of the valid data in the array. This is both simple to use and results in firmware code that is slightly simpler (and hence smaller) than checking for the null terminator. This does of course mean that in an 8 byte message we can only fit 7 bytes of actual data plus the length marker (this is no different than with null terminated data of course). If you know the messages you want to send will always be of a fixed length then tweaking the buffer sizes to suit might make for a more efficient transfer of data. In general, as you will see shortly, as a user of the library a lot of these details are dealt with for you.

From the very beginning my aim was to find a drop-in replacement for the standard Arduino Serial object and so I've made the USBSerial library implement the same Stream interface. This means you can use any of the methods defined in the Stream interface for reading and writing data and the details about buffer sizes etc. are hidden within the library.

To show how easy the library is to use, here is a simple example where the sketch simply echos back any bytes that it is sent.
#include <USBSerial.h>

char buffer[IN_BUFFER_SIZE];

void setup() {
    USBSerial.begin();
}

void loop() {
  USBSerial.update();
  if(USBSerial.available() > 0) {
    int size = USBSerial.readBytes(buffer, IN_BUFFER_SIZE);
    if (size > 0) {
      USBSerial.write((const uint8_t*)buffer, size);
    }
  }
}
Note that I've used the same IN_BUFFER_SIZE constant in this example as within the library itself, as there is no reason to define a buffer that is bigger than we can ever expect to fill. The only line that you wouldn't find in a similar example using the standard Serial object is line 10 where we make sure that the USB connection is up to date (you need to do this approximately once every 50ms to keep the connection alive). Before we move on to looking at the host software there are a few things you need to know before trying to use the library.

Unfortunately the V-USB part of the library needs customizing for each project, so you can't simply drop the library into the Arduino sketchbook folder, as the USB manufacturer and product identifiers have to be unique for different devices. These identifiers are set in the usbconfig.h file. V-USB doesn't actually provide a copy of usbconfig.h what is provided is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME     'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
#define USB_CFG_VENDOR_NAME_LEN 8
and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME     'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
#define USB_CFG_DEVICE_NAME_LEN 8
These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.
Once properly configured you should be able to compile (I recommend using arduino-mk instead of the Arduino IDE) and use the library without issue, and without understanding how it actually works internally (if you are interested in the details then both my code and the V-USB library contain vast amounts of code comments which should help you get a better understanding) so let's move on to looking at the host software.

As I've already mentioned connecting the device to a PC usually causes generic HID drivers to be loaded by the operating system. This means that you should be able to use any programming language you like to write the host software as long as it can talk to the generic USB drivers. I've included host software written in Java using javahidapi but, for instance, you could also use PyUSB if you prefer to program using Python. The important thing to remember is the protocol for passing data that we defined earlier: data to the USB device is sent as 32 byte feature requests with the first byte being the length of the valid data in the rest of the array, while data from the USB device is in 8 byte chunks again with the first byte being the length of the valid data.

As with the firmware code we have already discussed, I've written a simple Java library to hide most of the details behind standard interfaces, which allow you to read and write data using the standard Java InputStream and OutputStream interfaces. Full details of the available methods can be found in the Javadoc but a simple echo example shows most of the important details.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.PrintStream;

import englishcoffeedrinker.arduino.USBSerial;

public class SimpleEcho {
  public static void main(String[] args) throws Exception {
    // get an instance of the USBSerial class for the specified device
    USBSerial serial =
        USBSerial.getInstance("englishcoffeedrinker.co.uk", "EchoChamber");

    // open the underlying USB connection to the device
    serial.connect();

    // create an output stream to write characters to the device
    PrintStream out = new PrintStream(serial.getOutputStream());

    // send a simple message
    out.println("hello world!");

    // ensure the message has been sent and not buffered internally somewhere
    out.flush();

    // create a reader for getting characters back from the device
    BufferedReader in =
        new BufferedReader(new InputStreamReader(serial.getInputStream()));

    String line;
    while((line = in.readLine()) == null) {
      // keep checking the device until a line of text is returned
    }

    // display the message sent from the device
    System.out.println(line);

    // we have finished so disconnect our connection to the device
    serial.disconnect();
  }
}
Essentially, lines 10 to 14 get an instance of the USBSerial class for a specific device, in this case found via the manufacturer and product identifiers although other methods are available, and then opens the connection. Lines 16 to 23 then use the OutputStream to write data to the device while lines 26 to 35 read it back, with line 38 cleaning up by closing the connection. For anyone who is happy programming in Java this should look no different than reading or writing to and from any other type of stream, which means it should be easy to integrate within any project where you want to communicate with an Arduino.

To make life a little easier I've also included a slightly more useful demo application that effectively reproduces the Serial Monitor from the Arduino IDE. You can see it running here connected to a device running the simple echo sketch from earlier in this post, but it should work with any device that uses the USBSerial library.

I've included another example with the USBSerial library that shows you can use this for more than just echo commands. This is the CmdMsg example, which is an example I've talked about before on this blog, but this version uses the USBSerial library, and hence can be controlled through this new USB Serial Monitor, rather than using the standard Serial library.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository. Do let me know if you find it useful or if you have any suggestions for improvement.

When I set out to try and add serial support to a breadboarded Arduino (specifically this circuit) I did have a device I wanted to build in mind, so I'm sure at some point I'll blog again about using this library in a real device rather than just the simple examples included with the library that do nothing more than prove everything works.