Why Aren't They Spamming The Chinese?

Whilst trying to drink my first cup of coffee this morning, I was rudely interrupted by click-jacking malware affecting my wife’s computer. All she was trying to do was look at some Google search results, but clicking on them would take her to a suspicious looking shopping search site. From a little bit of Googling it looked as if it might be a real nasty trojan which would have taken ages to clean up. Fortunately it turned out that all the pages she was having the problem with had been infected with the same bit of malicious JavaScript. I'm not sure how (probably through a malicious banner ad or something) but a reference to the following JavaScript had been inserted at the very end (after the </html>) of each affected page:
if (navigator.language)
  var language = navigator.language;
  var language = navigator.browserLanguage;

if(language.indexOf('zh') == -1) { 
  var regexp = /\.(aol|google|youdao|yahoo|bing|118114|biso|gougou|ifeng|ivc|sooule|niuhu|biso|ec21)(\.[a-z0-9\-]+){1,2}\//ig;
  var where = document.referrer;
  if (regexp.test(where)) {
To make the script easier to read I've reformatted it, and replaced the redirect with a safe URL (who doesn't trust the BBC?) rather than giving the spammers free advertising, but I haven't changed any of the functional aspects of the script.

Essentially all it does is check the URL that you were on when you clicked the link leading you to the current page, and if that looks like a search results page from one of 14 different companies, then it redirects you. The regular expression it uses to check the referring page is simple yet effective and will catch any of the sub-domains of these search services as well. What I find weird is why the script checks the language of the browser.

The first four lines of the script get the language the browser is using. There are two ways of doing this depending on which browser you are using hence the if statement. On my machine this gets me en-US (which means I need to figure out why it has switched from en-UK which is what I thought I'd set it to). Line 6 then checks to make sure the language doesn't include the string zh, which according to Wikipedia is Chinese. I'm assuming that the spammers behind the script are Chinese and don't want to be inconvenienced by their own script, but it seems odd, especially when you consider that at least one of the search engines covered by the regular expression (118114 on many different top-level domains) seems to be a Chinese site.

Looking at this script there is of course another way to defeat it, other than disabling JavaScript. One of the privacy or security options in most browsers concerns the referer (yes I know it is spelt wrong, but that is the correct spelling in the HTTP spec) header. Essentially this header tells a web server the page you were on when you clicked the link leading to the page you are requesting. Some sites will use this to provide functionality so disabling it can cause problems but it does mitigate against scripts like this one. Because it can cause problems it's often an advanced setting, for example here are the details for Firefox.

Serializing To A Human Interface Device

If you've read my previous post you'll know that I've been looking at a cheap and simple way of adding serial communication to a breadboard Arduino clone (such as this one). To summarise the situation so far; adding true RS-232 serial communication is both expensive and difficult as the required part is only available as a surface mount component but I discovered V-USB which allows me to emulate low speed USB devices. The end result was that I managed to use V-USB to emulate a USB keyboard. Being able to pass data from the Arduino to the PC by simply emulating key presses is useful but a) it is rather slow and b) different keyboard mappings lead to different characters being and typed and more importantly c) it doesn't allow me to send data to the Arduino. So on we go...

Let's start with what I haven't managed to achieve; a USB CDC ACM device for RS-232 communication. Unfortunately CDC ACM devices require bulk endpoints (these allow for large sporadic transfers using all remaining available bandwidth, but with no guarantees on bandwidth or latency) and these are only officially supported for high speed USB devices. V-USB only allows me to emulate low speed USB devices, and while most operating systems used to allow low speed devices to create bulk endpoints, even though this is contrary to the spec, modern versions of Linux (and possibly Windows) do not. I did manage to get a device configured correctly but as soon as I plugged it in the bulk endpoints were detected and converted to interrupt endpoints which stopped the device from working. However, all is not lost as I do have a solution which I think is just as good; serializing data to and from a generic USB Human Interface Device.

The USB specification defines the USB Human Interface Device (HID) class to support, as the name suggests, devices with some form of human interface. This doesn't mean sticking a USB cable into your arm, but rather defines common devices such as keyboards, mice and game controllers as well as devices like exercise machines, audio controllers and medical instruments. While such devices may communicate data in a variety of forms it all passes to and from the device using the same protocol. This means that when you plug any such device into practically any computer with a USB port it will be recognised and basic drivers will be loaded.

Writing code to communicate with a USB HID device isn't that much more complex than interfacing with a classic serial port and given the standard driver support we can rely on the operating system taking care of most of the communication for us.

For what follows I'm assuming the same basic USB circuit that I described in the previous post as we know it works and it is cheap to build.

Now we have the circuit let's move on to the software we need to write. Unlike with the USBKeyboard library, that powered The Caffeine Button, we will need both firmware for the Arduino and host software that will run on the PC and interface with the basic HID drivers the operating system provides. Given that you can't test the host software until we have a working device we'll start by looking at the firmware.

The first thing you have to do when constructing a HID is to define its descriptor. The descriptor is how the device presents itself to the operating system and defines the type of device as well as the size and type of any communication messages. Now you will probably never need to edit this but I thought it was worth showing you the full descriptor we are using:
    0x06, 0x00, 0xff,              // USAGE_PAGE (Generic Desktop)
    0x09, 0x01,                    // USAGE (Vendor Usage 1)
    0xa1, 0x01,                    // COLLECTION (Application)
    0x15, 0x00,                    //   LOGICAL_MINIMUM (0)
    0x26, 0xff, 0x00,              //   LOGICAL_MAXIMUM (255)
    0x75, 0x08,                    //   REPORT_SIZE (8)
    0x95, OUT_BUFFER_SIZE,         //   REPORT_COUNT (currently 8)
    0x09, 0x00,                    //   USAGE (Undefined)  
    0x82, 0x02, 0x01,              //   INPUT (Data,Var,Abs,Buf)
    0x95, IN_BUFFER_SIZE,          //   REPORT_COUNT (currently 32)
    0x09, 0x00,                    //   USAGE (Undefined)        
    0xb2, 0x02, 0x01,              //   FEATURE (Data,Var,Abs,Buf)
    0xc0                           // END_COLLECTION
In this descriptor we define two reports of different sizes which we will use for transferring data to and from the device. The first thing to point out is that the specification defines input and output with respect to the host PC and not the device. So an input message is actually used for writing out from the device rather than for receiving data. Given this, we can see that the descriptor defines two message types. Firstly (lines 7 to 10) we define an 8 byte (OUT_BUFFER_SIZE is defined as 8) input report (the size is defined in bits so we have 8 bits times the count to give 8 bytes) which means we can write 8 byte blocks of data back to the PC we are connected to. The second message type is defined as a FEATURE message of 32 bytes (because IN_BUFFER_SIZE is defined as 32 and the REPORT_SIZE hasn't been redefined so it still 8 bits) which we will use for passing data from the PC to the USB device. As I said you will probably never need to edit this structure especially as you can tweak the message sizes, if necessary, by adjusting the two constants instead. If you do decided to change the descriptor it is worth noting that some operating systems are more forgiving than others. For example, with Linux if you have a defined a message of 8 bytes but only have two to send then you can do that and everything will work. Under Windows, however, if you only send two bytes the device will simply stop functioning altogether so you will need to pad the message to be exactly 8 bytes. This also means that it is easy to check that your descriptor matches what you are actually doing by quickly testing under Windows (I've been doing this with a copy of Windows XP running under VirtualBox).

Now that we know the size of the messages we will send and receive we still need to decide upon their format, i.e. the protocol we will use for our data that we are sending on top of the USB protocol. If we were only interested in sending textual data then we could send null terminated data (i.e. put a zero value byte into the array after the last byte of data), but if we want to send arbitrary bytes then using 0 as an end of data marker seems an odd choice. For this reason I've opted to set the first byte of each message to the length of the valid data in the array. This is both simple to use and results in firmware code that is slightly simpler (and hence smaller) than checking for the null terminator. This does of course mean that in an 8 byte message we can only fit 7 bytes of actual data plus the length marker (this is no different than with null terminated data of course). If you know the messages you want to send will always be of a fixed length then tweaking the buffer sizes to suit might make for a more efficient transfer of data. In general, as you will see shortly, as a user of the library a lot of these details are dealt with for you.

From the very beginning my aim was to find a drop-in replacement for the standard Arduino Serial object and so I've made the USBSerial library implement the same Stream interface. This means you can use any of the methods defined in the Stream interface for reading and writing data and the details about buffer sizes etc. are hidden within the library.

To show how easy the library is to use, here is a simple example where the sketch simply echos back any bytes that it is sent.
#include <USBSerial.h>

char buffer[IN_BUFFER_SIZE];

void setup() {

void loop() {
  if(USBSerial.available() > 0) {
    int size = USBSerial.readBytes(buffer, IN_BUFFER_SIZE);
    if (size > 0) {
      USBSerial.write((const uint8_t*)buffer, size);
Note that I've used the same IN_BUFFER_SIZE constant in this example as within the library itself, as there is no reason to define a buffer that is bigger than we can ever expect to fill. The only line that you wouldn't find in a similar example using the standard Serial object is line 10 where we make sure that the USB connection is up to date (you need to do this approximately once every 50ms to keep the connection alive). Before we move on to looking at the host software there are a few things you need to know before trying to use the library.

Unfortunately the V-USB part of the library needs customizing for each project, so you can't simply drop the library into the Arduino sketchbook folder, as the USB manufacturer and product identifiers have to be unique for different devices. These identifiers are set in the usbconfig.h file. V-USB doesn't actually provide a copy of usbconfig.h what is provided is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME     'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME     'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.
Once properly configured you should be able to compile (I recommend using arduino-mk instead of the Arduino IDE) and use the library without issue, and without understanding how it actually works internally (if you are interested in the details then both my code and the V-USB library contain vast amounts of code comments which should help you get a better understanding) so let's move on to looking at the host software.

As I've already mentioned connecting the device to a PC usually causes generic HID drivers to be loaded by the operating system. This means that you should be able to use any programming language you like to write the host software as long as it can talk to the generic USB drivers. I've included host software written in Java using javahidapi but, for instance, you could also use PyUSB if you prefer to program using Python. The important thing to remember is the protocol for passing data that we defined earlier: data to the USB device is sent as 32 byte feature requests with the first byte being the length of the valid data in the rest of the array, while data from the USB device is in 8 byte chunks again with the first byte being the length of the valid data.

As with the firmware code we have already discussed, I've written a simple Java library to hide most of the details behind standard interfaces, which allow you to read and write data using the standard Java InputStream and OutputStream interfaces. Full details of the available methods can be found in the Javadoc but a simple echo example shows most of the important details.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.PrintStream;

import englishcoffeedrinker.arduino.USBSerial;

public class SimpleEcho {
  public static void main(String[] args) throws Exception {
    // get an instance of the USBSerial class for the specified device
    USBSerial serial =
        USBSerial.getInstance("englishcoffeedrinker.co.uk", "EchoChamber");

    // open the underlying USB connection to the device

    // create an output stream to write characters to the device
    PrintStream out = new PrintStream(serial.getOutputStream());

    // send a simple message
    out.println("hello world!");

    // ensure the message has been sent and not buffered internally somewhere

    // create a reader for getting characters back from the device
    BufferedReader in =
        new BufferedReader(new InputStreamReader(serial.getInputStream()));

    String line;
    while((line = in.readLine()) == null) {
      // keep checking the device until a line of text is returned

    // display the message sent from the device

    // we have finished so disconnect our connection to the device
Essentially, lines 10 to 14 get an instance of the USBSerial class for a specific device, in this case found via the manufacturer and product identifiers although other methods are available, and then opens the connection. Lines 16 to 23 then use the OutputStream to write data to the device while lines 26 to 35 read it back, with line 38 cleaning up by closing the connection. For anyone who is happy programming in Java this should look no different than reading or writing to and from any other type of stream, which means it should be easy to integrate within any project where you want to communicate with an Arduino.

To make life a little easier I've also included a slightly more useful demo application that effectively reproduces the Serial Monitor from the Arduino IDE. You can see it running here connected to a device running the simple echo sketch from earlier in this post, but it should work with any device that uses the USBSerial library.

I've included another example with the USBSerial library that shows you can use this for more than just echo commands. This is the CmdMsg example, which is an example I've talked about before on this blog, but this version uses the USBSerial library, and hence can be controlled through this new USB Serial Monitor, rather than using the standard Serial library.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository. Do let me know if you find it useful or if you have any suggestions for improvement.

When I set out to try and add serial support to a breadboarded Arduino (specifically this circuit) I did have a device I wanted to build in mind, so I'm sure at some point I'll blog again about using this library in a real device rather than just the simple examples included with the library that do nothing more than prove everything works.

The Caffeine Button

In a couple of previous posts (here and here) I've shown how easy and cheap it is to go from a prototyped setup using an Arduino to a standalone circuit built from just a handful of components. While those posts were more of an academic exercise to prove it was possible I've also now built, and blogged about, such a circuit that I'm actually using in anger. The problem is that an actual Arduino isn't just an Atmel ATMega328P-PU it also has accompanying electronics which enable you to talk to a computer via the USB connection which is great both for debugging and for interfacing external hardware to a PC.

When you connect an Arduino to a PC what actually happens is that the supporting circuitry creates a USB CDC ACM device which emulates a good old fashioned RS-232 serial port. If I wanted to add similar functionality to my standalone circuits then the most common way of doing so would be to use an FT232RL, but the chip alone would almost double the cost of the circuit plus it is only available as a surface mount part making it difficult to experiment with on a breadboard and I'm not sure my soldering skills are good enough to deal with surface mount parts either.

After pondering this for a bit and doing a little research I came across a potential solution in the form of V-USB. V-USB is a software only implementation of low speed USB for Atmel microcontrollers, such as the ATMega328P-PU. Unfortunately the distribution doesn't directly support the Arduino (it supports the ATMega328P-PU but not through the Arduino IDE etc.), however, I did find a previous attempt to add Arduino support although this project seems to have been abandoned as it hasn't seen any updates in over three years. It did, however, give me a good point to start from.

So far I haven't managed to emulate a serial port, but I have managed to make the Arduino behave like a USB keyboard which means that I can use it for debugging by having it pretend to press lots of keys in sequence, which is better than nothing. Before we get to looking at how to make use of V-USB we need to wire a USB plug up to the Arduino.

USB Connection Parts List
PartUnit CostQuantity
USB Socket, Type B£0.501
3.6V, 0.5W Zener Diode£0.0712
68Ω Resistor£0.0082
2.2kΩ Resistor£0.0081
As you can see it isn't a particularly complex circuit to build. Essentially we have the two data lines D- and D+ linked to pins 2 and 4 of the Arduino and as the USB spec states that these run on 3.3V we restrict the voltage using two 3.6V zener diodes to step down from the 5V output of the Arduino. The Arduino itself is powered from the other two USB pins which provide 5V/GND (this also means that you could use USB to power a standalone ATMega328P-PU without needing a voltage regulator and smoothing capacitors). The final connection is between pin 5 and D- via a 2.2kΩ pull-up resistor which allows the connection to be connected/disconnected from within software (in theory you can link this to V+ instead if you don't need this flexibility but I found that this meant that the hardware wasn't always recognized correctly when it was connected to a PC). For the USB plug itself, I opted to use a Type B plug (the same as the Arduino) so that I could get away with a single cable snaking across my desk, but you should be able to use any USB plug with the same circuit. One thing to be careful with when constructing the circuit is that in most cases the metal case of the USB plug is connected to the ground pin, so be careful that you don't end up with any of the other connections catching the case otherwise you might end up with them pulled to ground which will cause weird things to happen; depending which pin is involved either your PC won't recognize you have anything plugged in or it might end up thinking you have a high speed or high power device connected and things won't work properly.

Now we have the hardware what we need is some working software we can upload to the Arduino. As I said above I'm using a previous attempt to get this working as my starting point. This code hasn't been updated for over three years and I'm guessing that a number of things have changed within the Arduino libraries since then as the code didn't just work. After a bit of trial and error I discovered that the problems were mostly related to timer code that had been added to get V-USB to work with the Arduino and which was now doing the exact opposite. Removing this code got me to a point where when I plugged in the cable the Arduino was now being recognized as a USB v1.01 HID Keyboard, which I know is my device because of the vendor and product strings being shown in this screenshot.

Having got the code running I then took the plunge to update the version of V-USB being used to the latest version (currently 20121206) which, after a few little tweaks, was a success. Unfortunately because of the way the library has to be configured a single copy can't be shared between multiple projects, but in the rest of this post I'll explain how to customize the library and show you a fully worked example: TheCaffeineButton.

Firstly I've had problems compiling the library through the Arduino IDE, so I'd recommend compiling any sketches that use the library via the arduino-mk project that I discussed in a previous post. This allows you to have libraries local to a sketch by putting them in a libs subfolder. A basic sketch to use the libary then looks like the following:
// pull in the USB Keyboard library
#include <USBKeyboard.h>

void setup() {
  // TODO: setup like stuff in here...

void loop() {
  // poll the USB connection

  // TODO: whatever you want in here...
You simply include the library (line 2) and then poll the connection every time through the main loop (line 6). As I mentioned though, the library needs customizing for each project and so if you try and compile the sketch at this point you'll end up with an error:
libs/USBKeyboard/usbdrv.h:12:23: fatal error: usbconfig.h: No such file or
V-USB doesn't provide a copy of usbconfig.h as it is project specific, what they do provide is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME     'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME     'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.
If you glance back at the screenshot I showed earlier of my example device being recognized then you can see that I followed these rules by setting the vendor name to englishcoffeedrinker.co.uk and the device name to TheCaffeineButton.

So having created a valid usbconfig.h file you can now compile the basic sketch shown above. Of course this does nothing other than allow the Arduino to be recognized as a USB keyboard. If you want to actually do something interesting then you need a little more code. As an example, I'll finally introduce you to The Caffeine Button!

Caffeine (usually in the form of coffee) is my one true addiction and so I thought I'd build a device with a single button that when pressed types out the chemical formula for caffeine: C8H10N4O2. The full sketch is fairly simple and assumes the basic circuit above with a push button between pin 12 and ground (it uses the internal pull-up resistor to keep the part list down to the single button):
// pull in the USB Keyboard library
#include <USBKeyboard.h>

// using Bounce makes working with buttons nice and easy
// http://www.arduino.cc/playground/Code/Bounce
#include <Bounce.h>

// we have the button on pin 12
#define BUTTON_PIN 12

// create a Bounce instance to manage the button
Bounce button = Bounce(BUTTON_PIN, 5);

void setup() {
  // initialise the pin the button is connected to

  // enable the internal pull-up resistor
  digitalWrite(BUTTON_PIN, HIGH);

void loop() {
  // poll the USB connection

  // check the status of the button

  if (button.fallingEdge()) {
    // if the button has just been pressed then...

    // ...print out the formula for caffeine
There are actually four different methods you can use to emulate pressing keys:
void write(byte keycode);
void write(byte keycode, byte modifiers);
void print(const char* text);
void println(const char* text);
The first two method send a USB key usage code (with or without a modifier, such as the shift key) and then release the key, while the last two methods translate alphanumeric characters (and space) into a sequence of keystrokes (followed by the enter key in the println case) to make the library a little easier to use. Unfortunately there is no guarantee that my simple sketch will always result in C8H10N402 being displayed when you press the button.

When you press a key on a keyboard a keycode is sent to the computer which determines which letter has been pressed. This allows the same physical keyboard to be used for different languages simply by changing the printed labels on each key. Unfortunately this makes it impossible to translate a string into a sequence of keycodes which will always display the same on every computer. For example, if you send keycode 28 on a computer with an English keyboard mapping you'll get a 'y', but if the keyboard mapping is German you'll get a 'z'. I've defined a bunch of constants in USBKeyboard.h (i.e. KEY_A) which work for English, and I've used the same mapping in the print and println methods. If you can set the keyboard mapping for the device to English then these methods will work properly for you, if not you might need to tweak the mappings to get what you need. You can find the full list of mappings in chapter 10 of the Universal Serial Bus HID Usage Tables document should you need more details.

So here we have the final working item. As you can see I built the main USB circuit onto a prototyping shield so I can experiment with lots of different circuits without having to keep recreating the basics every time, and in this case have simply jammed the button between pins 12 and ground.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository.

Whilst I haven't yet been able to emulate a serial port I haven't given up and when/if I'm successful I'm sure there will be a post about it and another Arduino library for you to play with.


Last week I spent two interesting days at the BBC's #newsHACK event at Shoreditch Town Hall in London. The BBC have recently started to open up APIs to a lot of useful indexing and annotation work that they have been doing and this event was aimed at getting news organizations and other interested parties using some of the APIs to produce interesting tech demos.

Our team consisted of Ian Roberts, Dominic Rout and myself from the University of Sheffield and Helen Lippell from the Press Association. We were kind of the odd ones out as our expertise centres around processing large amounts of text to expose interesting information and to make that searchable or useable in some way; which is exactly what some of the BBC APIs already did. None of us claim to be great user interface peopel so there was no point us trying to generate a really fancy interface over the BBC APIs. In the end we decided that we would play to our strengths and so our demo (which you can go and play with) tried to go one step further than the existing BBC APIs by using the power of Mímir to allow for complex searches over text, annotations and Linked Open Data to allow journalists a deeper view into the news archive when writing a story. One of our examples (that seemed to go down well) was; imagine you are writing a story about a CEO who has just been given a £5m bonus and you want to find other people who have been awarded more in the past.

Our demo didn't win any of the categories but we certainly didn't embarrass ourselves and there were some really good ideas presented. It's worth looking at the list of hacks as some of them have really cool demos you can play with.

Savings In Time And Space

In the previous post I talked about a new feature that fixes my main annoyance with the GATE Developer interface. In this post I'm going to address a problem that can affect any use of GATE; the size of GATE XML files.

I'll explain what I mean with this very simple XML document:
<doc>Mark is running a test.</doc>
This document is just 35 bytes in size, and when loaded into GATE contains just one doc annotation in the Original markups set. If we now save this document as GATE XML we end up with this 831 byte file:
<?xml version='1.0' encoding='UTF-8'?>
<GateDocument version="2">
<!-- The document's features-->

  <Name className="java.lang.String">gate.SourceURL</Name>
  <Value className="java.lang.String">file:/home/mark/Desktop/simple-test.xml</Value>
  <Name className="java.lang.String">MimeType</Name>
  <Value className="java.lang.String">text/xml</Value>
<!-- The document content area with serialized nodes -->

<TextWithNodes><Node id="0"/>Mark is running a test.<Node id="23"/></TextWithNodes>
<!-- The default annotation set -->


<!-- Named annotation set -->

<AnnotationSet Name="Original markups">
<Annotation Id="0" Type="doc" StartNode="0" EndNode="23">

Clearly we could save a few bytes by removing the comments and line separators but there is nothing much we can do to get the size down anywhere near the size of the original document. Now of course comparing the two documents isn't really fair as they are encoding very different things.
  • The original document uses inline tags to annotate the document, whereas GATE uses a standoff markup for added flexibility (e.g. inline XML can't handle crossing annotations).
  • GATE XML explicitly encodes information only implicit in the original document; in this case it's MIME type and filename.
Now clearly this is a slightly contrived example where I've tried to make GATE XML look as bad as possible, but this can quickly become a real problem. While it is true that "disk space is cheap" it only holds when you want reasonable sized disks. If you are looking at processing terabytes of text and wanting to store the output then you really don't want to find that your processed documents are 24 times the size of the original! This is a problem we often run into with even reasonably sized corpora, as it becomes difficult to move the processed document around just due to their size. The solution we usually adopt is to simply zip up a folder of GATE XML documents. While this helps with moving them around you still need to unpack them if you want to reload them into GATE. One solution would be to look at creating a new GATE document format that would require less diskspace but be as flexible. Given the amount of effort that would require I'm ruling it out without even thinking about how I would do it. A better solution would be file level compression that would leave the documents in a format that GATE could still load easily.

My first thought was to simply alter GATE to wrap the input/output streams used for writing GATE XML so that they did gzip compression. This would probably work and would give a reasonable space saving (using gzip to compress the above example brings the file size down to 403 bytes), but it would increase the time taken to load/save documents in GATE due to the compression overhead. Now given that we often want to process thousands or millions of documents time is important so I'd prefer not to introduce code that would slow down large scale processing.

Fortunately I'm not the only person who has ever wanted to compress an XML document and there are a number of libraries out there that claim to be able to efficiently compress XML documents. I settled on looking at Fast Infoset because a) it follows a recognised standard and b) there is an easy to use Java library available.

Fast Infoset is a binary encoding (I was going to say lossless but strictly it isn't as the formatting information of the the original XML file is lost) of XML. I won't go into the technical details other than to say that it supports both SAX and StAX access and claims to be more efficient than equivalent reading of text based XML documents (see the spec or the Wikipedia page for more details). What this means in GATE is that we can continue to use the existing methods for reading/writing GATE XML documents by just swapping out one SAX/StAX parser/writer for another. So let's have a look at a more "real world" example; the GATE home page.

I saved a copy of the GATE home page to my local disk (to remove network effects) and then ran a few experiments, which are summarised in the following table.

FileFile SizeTime (ms)

It should be clear from this table that the original HTML file is clearly the smallest representation and the quickest to load, which given the previous discussion shouldn't be surprising. We can also see that the XML encoding sees a huge increase in file size as well as loading time; the XML file is almost 4 times the size of the HTML file. The result of using Fast Infoset, however, is really promising. The file size grew by just 1,921 bytes! Loading time is slower (it takes almost twice as long) but is around twice as fast as loading from XML. The time to save the XML and Fast Infoset versions show little difference which is kind of to be expected (the code is mostly the same so the difference in speed will be down to the number of bytes written, i.e. the speed of my hard drive). It's also worth noting that the speed results were gathered through the GATE GUI and as such are probably not the most rigorous set of data ever collected, although they seemed to be fairly stable when repeated.

From these numbers we would rightly assume that if we just want to store a HTML document efficiently then we should simply store it as HTML; converting it to a GATE document doesn't actually give us any real benefit, and no matter how we store the document on disk we pay a penalty in space and time. Of course storing HTML documents isn't really what GATE is designed for, so let's look at a slightly different scenario; load the same HTML document, run ANNIE over it to produce lots of annotations and then save the processed document. The performance numbers in this scenario are then as follows:

FileFile SizeTime (ms)

In this scenario (where we can't use HTML to save the document) we can see that using Fast Infoset is a lot better than storing the documents as raw XML. Not only can we re-load the document in half the time it would take to load the XML we also make a space saving of 81%! In case you still need convincing that you really do save a lot of disk space using Fast Infoset I'll give a final large scale, real world(ish) example.

As part of the Khresmoi project I'd previously processed 10,284 medical Wikipedia articles. Due to disk space requirements I hadn't kept the processed documents and as the Khresmoi application can take a while to run (it's a lot more complex than ANNIE) I've simply run ANNIE over these documents to generate some annotations (it will certainly produce lots of tokens and sentences even if it doesn't find too many people or organizations in the medical documents). The results are as follows (I haven't shown save time as the GUI doesn't give it when saving a corpus, and the load time represents populating the corpus into a datastore):

FileFile SizeTime (s)

Now if a space saving of 81% (or 11GB) and a time saving of 4 minutes doesn't convince you then I don't know what will!

So assuming you are convinced that storing GATE documents using Fast Infoset instead of standard XML is the way to go, then you'll want to have a look at the new Format_FastInfoset plugin (you'll need either a recent nightly build or a recent SVN checkout). Documents with a .finf extension will automatically load correctly once the plugin is loaded, if you want to use a different extension then you will have to specify the application/fastinfoset MIME type. You can save documents/corpora using the new "Save as Fast Infoset XML..." menu option (courtesy of the new resource helper feature), which will also appear when you load the plugin.

Resource Helpers

For a long time there has been one issue with the GATE Developer interface that has really annoyed me; the inability to add new items to the right click menu of a resource (that you didn't develop) without creating a new visual resource and opening that viewer. This is the reason that you can't run an application in GATE without having the application editor open (the editor adds the run option to the menu). I know from a few conversations that this has annoyed other people in the past as well.

Last year I added support for opening MediaWiki documents to GATE. The MediaWiki markup is used on a lot of wikis, including Wikipedia, so having support for it in GATE is really useful as we often end up processing some or all of Wikipedia. There are two main ways you might encounter MediaWiki markup. Firstly you might just have a piece of text with the markup embedded in it (e.g. the text copied out of the page editor on Wikipedia), but you are more likely to have an XML file containing the content of multiple articles in a single file. I added support for both options to GATE, but the support for multiple articles in a single XML file doesn't work very well; you can get the article content, but you lose the title and any other associated metadata. What I really wanted was to add a new option to all corpus resources that would allow you to properly populate the corpus from a MediaWiki XML dump file, but of course I couldn't (at least not without creating a pointless visual resource, and then the option would only be visible if the corpus viewer was open).

A few days ago I was thinking about this problem again when I had a burst of inspiration and realised that there was a fairly easy way to add support, for what I'm calling Resource Helpers, to GATE. Essentially I'm re-using the idea of a GATE resource being a tool (usually used to add items to the Tools menu) to allow any resource to provide items for the right-click menu of any other resource. This did require some minor changes to the core GATE code, so if you want to try this yourself you will need to use a nightly build (or a recent SVN checkout). There will eventually be some details in the userguide, but in essence all you need to know can be gleaned from the following simple example:

package gate.test;

import gate.Document;
import gate.Resource;
import gate.creole.metadata.AutoInstance;
import gate.creole.metadata.CreoleResource;
import gate.gui.MainFrame;
import gate.gui.NameBearerHandle;
import gate.gui.ResourceHelper;

import java.awt.event.ActionEvent;
import java.util.ArrayList;
import java.util.List;

import javax.swing.AbstractAction;
import javax.swing.Action;
import javax.swing.JOptionPane;

@CreoleResource(name = "Resource Helper Test Tool",
  tool = true,
  autoinstances = @AutoInstance)
public class HelperTest extends ResourceHelper {

  protected List<Action> buildActions(final NameBearerHandle handle) {
    // a list to hold any actions we want to add
    List<Action> actions = new ArrayList<Action>();

    // if the resource isn't a document then we are finished
    if(!(handle.getTarget() instanceof Document)) return actions;

    // create a new action that will add a menu item labelled "Helper Test"
    // which will show a simple message dialog box
    actions.add(new AbstractAction("Helper Test") {
      public void actionPerformed(ActionEvent arg0) {
          "Testing the resource helper for " +

    // return the list of actions
    return actions;

If you already understand how tool support in GATE works, then this should be easy to follow, but basically we are creating a new GATE resource that is marked as a tool, and a single instance of which is automatically created. The resource extends ResourceHelper which forces us to implement the single buildActions method. The example implementation of buildActions shown here, simply checks to see if we are being asked to help a Document instance, and if so adds a single new menu item, labelled "Helper Test", which when clicked pops up a dialog box showing the name of the document.

The one thing to note is that the buildActions method is only ever called once per resource as the return value is cached for performance reasons. If you want to have a truly dynamic menu then you will need to also look at overriding the the getActions method of ResourceHelper, but that is beyond the scope of this post.

Having added this new functionality, as you will have already guessed, I've now added a new "Populate from MediaWiki XML Dump" option to every corpus instance, you just need to load the MediaWiki document format plugin for it to appear.

Arduino Without The IDE

Some of you may remember that I've previously blogged about shrinking down an Arduino by buying the ATmega328P-PU and associated components. This is a great way of using the Arduino in a more permanent setting as it is much cheaper than using a full Arduino. I did, however, have a problem in that the Arduino IDE seemed to use the wrong upload speed when trying to burn the bootloader or upload a sketch. This meant I had to mess around copying the command out of the IDE and into a terminal and then tweaking it so that it used the right upload speed. Honestly while the IDE is easy to use (especially for novices) it isn't really a very good IDE and suffers from a number of problems which really annoy me. So I set out to look for an alternative way of compiling and uploading code to the Arduino.

Given that I already knew that the Arduino IDE is just a front-end to a number of other tools which are responsible for the compiling and uploading (which is why I could copy the relevant commands into a terminal and tweak them) I knew that it was likely someone else had already figured out how to use the Arduino without the IDE. In fact a couple of people have published details on how to do this, but the most comprehensive approach I found was by Tim Marston.

Tim's arduino-mk project supports compiling Arduino code, uploading it to an Arduino, and has a serial monitor as well. As the project name suggests this is all implemented as a classic Makefile meaning that it works from the command line, leaving me free to choose whichever code editor I want to use. The really impressive part though was that it worked first time; not only to upload code to my Arduino, but also to upload code to my hand-built replacement. There were, however, a few missing features.

Firstly the build process didn't pull in any libraries in the users sketchbook folder (which the IDE does), meaning some code I had didn't initially compile. Secondly the project didn't have any support for burning the bootloaders to new chips, something the IDE should be able to do, although in my experience it doesn't work properly (the same upload speed problem). Fortunately arduino-mk is an open-source project and Tim is more than happy to accept patches for bug fixes or new features. Whilst I don't think I've ever written a Makefile from scratch before, I know enough about how they work to add to an existing file and so with a little effort I managed to figure out adding support for both libraries in the sketchbook folder and the burning of bootloaders. Both features have now been incorporated into the latest release which means everyone else now has access to them as well.

If you are happy working from the command line and aren't really a fan of the Arduino IDE then I would definitely recommend trying out arduino-mk.

Silence, Most Definitely, Isn't Golden

I travel quite a bit for work, which means I end up spending quite a lot of time in hotel rooms, often working late into the night. These long evenings are made slightly better with a little music, and I prefer not to spend my time wearing ear phones. Fortunately my laptop has a decent sounding set of speakers.... when they work at all.

When I first got the laptop the speakers were working, but the first thing I did was to do a fresh install of Ubuntu (version 12.04 at the time). It took me a few weeks to notice, but the fresh install had left me with a laptop with no sound.

Periodically I tried to fix the problem always without any success. The final straw came when I had a long morning in Hannover, with only a little work to do before the meeting was to start in the afternoon. It was too cold to spend too long outside so I set about finding a fix.

Three hours and a lot of swearing and rebooting and I finally had sound working. Annoyingly I didn't write down exactly what I did to fix the problem, and so when last week I upgraded Ubuntu to the newest version (13.04) I found that I had no sound and couldn't remember how to fix the problem. It's taken me almost two hours to find the solution again so this time I'm writing this post to both help anyone else having the same problem and to stop me having to figure it all out for a third time next time I upgrade Ubuntu.

I followed almost every suggestion I could find on the web to diagnose sound problems, yet in most cases my laptop was already configured as suggested. I was almost ready to give up when I came across a page which suggested disabling the Auto-Mute Mode in the alsamixer control panel; you can see this in the screenshot. This looked promising but there was one problem -- When I ran alsamixer I didn't have a Auto-Mute Mode option.

Another forty-five minutes of fruitless web searching followed before I eventually discovered a way of changing the audio driver to show all the available options. Unfortunately this uses the daily build of alsa which I expect is why it got disabled after the upgrade to the newest version of Ubuntu. Anyway, the trick is to run the following commands to add the relevant repository and then to install the daily build of alsa.
sudo apt-add-repository ppa:ubuntu-audio-dev/alsa-daily
sudo apt-get update
sudo apt-get install oem-audio-hda-daily-dkms
Once completed you can then run alsamixer and you should find that the Auto-Mute Mode has now helpfully appeared. Simply disable this option and you should find that you're sound starts working. As far as I understand this option is related to sensing if you have plugged headphones in and if the internal speakers should be disabled or not. This means, although I haven't tested this, that if you disable the Auto-Mute then plugging headphones in my not stop sound coming out of the speakers, but that is an easy problem to deal with.

(File) Size Does Matter!

I like PDF files. Everyone (or as near as makes no difference) can open them, and you can be fairly certain that the file will look the same on every computer. Sometimes though I struggle to produce PDF files that are of a high enough quality, but small enough to e-mail or host on a website.

The most recent problem I've faced involved producing a PDF from an SVG file (authored using Inkscape) which included linked digital photos. The SVG file itself is just 77kB in size, and the five photos it links to total just over 7MB. Even if the images were simply included as is, I wouldn't expect the resulting PDF to be more than 8MB in size, although I would hope that some reasonable down sampling would be applied. Printing to a PDF from within Inkscape, however, results in a file of 22MB in size! That's over three times the size of the included photos, which is just ridiculous. Exporting to a PDF instead of printing, results in a file of the same size so isn't much help.

Now I don't know enough about the inner workings of a PDF file, but a 22MB file (which is incidentally a single A4 page) just isn't right, and certainly isn't sensible for trying to send around via e-mail etc. I've tried searching the web for suggestions, but none of the ideas I've found have turned out to be much use; either they don't alter the size of the file, or they reduce the quality so far that the text becomes unreadable.

The first alternative I tried was to produce a PNG image from the SVG instead of a PDF. While an image file wasn't really what I wanted I thought it might be a good place to start. Using Inkscape to produce a 300 DPI image resulted in a file of just 1.7MB in size; a file, I should add, with a quality that is perfectly adequate. The problem with an image file is that it is easy to alter and can be a pain to print at the right size, so I'd still like to use a PDF file. Fortunately, on linux at least, it is easy to create a PDF from a PNG, using the simple command
convert input.png output.pdf
Given that Inkscape can also be controlled from the command line I can easily write a reusable script that converts an SVG direct to a PDF of a reasonable size:

inkscape -f "$1.svg" -e out.png -C -d 300
convert out.png "$1.pdf"
rm out.png
I can then use this by passing in the name of the SVG file (minus the .svg extension) and out will pop a PDF file. Now I know this approach isn't perfect but given my requirements (a PDF of a reasonable size that is really only needed for printing) this works well. I'm intending to see if I can find a better solution (one that will keep the text as text) so if you have any suggestions please feel free to leave a comment.

CORPSE: COld stoRage backuP SoftwarE

I've recently started using Amazon Glacier to store the backup of last resort of all my digital photos (I've discussed the motivation behind this on my other blog). While I think the idea of cold storage, and Glacier in particular, is great there is just one problem; Amazon don't provide a friendly user interface for Glacier. Glacier is part of the Amazon Web Service (AWS) framework, and as such is designed as a service around which other people can develop applications.

A quick web search will bring up links to quite a few different tools for using Glacier. I'm currently uploading my data using SAGU (the Simple Amazon Glacier Uploader). While SAGU does make uploading data to Glacier easy it has a number of shortcomings which the developer is working to address. I've proposed a couple of patches, but I'm also looking at implementing more useful functionality (for example, range based downloads are a must have feature if you want to keep the costs under control). While I'm hopping that some of the ideas I'm working on might be incorporated into SAGU, I'm currently developing them in a separate project which I'm calling CORPSE (which of course stands for COld stoRage backuP SoftwarE).

Even if CORPSE doesn't result in a full application, I'm using it to a) experiment with a number of Java libraries I've not used before (including AWS itself, Jackson, Flyway) and b) using git for version control. In the past I've always used a centralised version control system (be that CVS, PVCS or SVN), but the decentralised nature of git is intriguing and with a free account on GitHub I don't have anything to lose. I'm sure I'll mention CORPSE again on this blog in the future but if you want to take a look at what is there now or to follow development then you can find it in this GitHub project.

To Track, Or Not To Track, That Is The Question

If you read any of my other blogs, then you may well be aware that I've recently opened a shopfront on Shapeways to make a few 3D models available for sale. While I really like this feature of Shapeways I wanted to customize the experience a little and so built a seperate, more customizable, shopfront. One of the options when setting up the Shapeways store was to provide a Google Analytics tracking code. I've never used any tracking code before on any of the websites I've built as I've never really seen a strong reason to. Given I was being prompted to though, I thought I'd investigate, especially as it would allow me to see how often people looked at my models, separate to how often people buy them.

Registering for a Google Analytics account is free and easy (as with most Google services) and within a few minutes I had the tracking information to register with Shapeways. Having got this far I decided to also add the tracking information to the separate shop front I'd built, and of course this is where life got interesting.

The first problem I faced was that I couldn't get Google to recognise that I'd added the tracking code to my website. No matter what I tried it simply kept telling me that I hadn't yet added the relevant JavaScript to my pages, yet I knew it was there. After a lot of head scratching and web searching I eventually found the problem and an easy solution. The domain I purchased doesn't actually include the initial www of most web addresses, but the way I have the DNS records and forwarding configured if you type the bare domain name you will be forwarded to the www version automatically. So when completing the Google Analytics form I included the www, which turns out was my mistake. I'm still not sure why this should make a difference (I'm guessing that Google aren't following the forwarding information when checking the page) but anyway removing the www from the default URL field fixed the problem.

The second problem was that I then wanted to turn the tracking off. Not so that I could comply with the EU directive on cookies or anything (I'm taking the applied consent rule, on the basis that if you visit an online shop you have to assume, at the bare minimum, that your page views are being tracked), but so that I wouldn't end up tracking my own page views.

I tend to have two copies of every website I develop running at the same time; one is the publicly visible (hopefully stable) version, while the second is my internal development version. Once I'm happy with updates to the development version I then push them to the live version. So when it comes to tracking page views not only do I not want to track my page views on the live site, but I definitely don't want to track page views on the development version.

My solution to this is to use server side code to only embed the tracking related JavaScript into the page, if it should be used. For this I wrote a utility function to determine if I should be tracking the page view or not:
public static boolean trackPageViews(HttpServletRequest request)
      throws Exception {

   //get the address the request is coming from
   InetAddress client = InetAddress.getByName(request.getRemoteHost());

   //the client is the same machine (an address such as
   if (client.isLoopbackAddress()) return false;

   //the client is within the same site (i.e. not using a world visible IP)
   if (client.isSiteLocalAddress()) return false;

   //the client is at the same world visible IP address as the server
   if (client.toString().equals(getServerIP(false))) return false;

   //track all other page views
   return true;
Essentially this code checks the request and says not to track page views if it is coming from a loop back address (i.e. always refers to the local machine, no matter what the machine is), a site local address (i.e. within the same local network and hence using a non-public IP address, often something in the 192.168.*.* range), or if the request is from the same public IP as the server is running on. The first two checks (lines 8 and 11) use standard Java libraries to do the checking, and catch all page views of the development versions of my websites (which are always only accessible from within my local network). The third check is, however, slightly more complex.

The problem is that you're machine itself probably doesn't have a world visible IP address; general a home network will be connected to a broadband router which will have a world visible address. So what we need to do is find the external IP address of the router. There are a number of websites available that will show you the world visible IP from which you are connecting and so we could use one of those to figure out the external address of the server, or we could write our own.
package utils;

import java.io.IOException;
import java.net.InetAddress;

import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class IPAddressServlet extends HttpServlet
   public void doGet(HttpServletRequest request,
      HttpServletResponse response) throws IOException {

      InetAddress client = InetAddress.getByName(request.getRemoteHost());
Assuming you access this servlet via a public IP address then it will simply echo the address back to you. We can now fill out the missing getServerIP method from our trackPageViews method.
private static String myip = null;

private static String getServerIP(boolean refresh)
      throws IOException {

   if (myip == null || refresh) {
      URL url = new URL("http", HOST, "/myip");
      BufferedReader in = new BufferedReader(
         new InputStreamReader(url.openStream()));
      myip = in.readLine();
   return myip;
I'm assuming this will work in all cases where you connect to the server using it's standard web address, as that will always be converted via a DNS server to a world visible IP address, although if you have a strange network setup where you might have multiple world visible IPs through which you could connect then it might not work correctly. The important point is it works for me, so I can now happily browse the public version and development version of the site from anywhere within my home network safe in the knowledge that my own page views aren't being tracked, so any analytics Google collects for me are form real visitors; albeit only those with JavaScript enabled, but that's an issue for another day.

Before I forget, thanks to GB for the photo I used to brighten up this post!

Generating Sitemaps

A long time ago (okay 2009) on an entirely different blog, I introduced v0.1 of the SitemapGenerator. This was a command line tool that I was running once an hour to monitor my blog, and if anything had changed it would re-generate a sitemap and notify Google of the changes. I wanted this predominantly so that a Google custom search engine would stay up-to-date, something I no longer need to worry about given that I use Postvorta to index all my blogs. I haven't used the code in a number of years but recently I again need to generate a sitemap. This time though I wanted to generate the file from within the web application I was building rather than via a separate recurring script.

I did a quick search for a better version than my original, and while I found a few Java libraries none really met my requirements; I wanted something lightweight that would allow me to take advantage of the extensions to the protocol Google has introduced for documenting images, videos and news articles. So I went back and re-wrote the guts of my original application. The new and improved version now supports a simple API that I can call from within my web application (although I've maintained the command line tool in case anybody was still using it).

If you are interested then you can grab the new version direct from SVN.