Why Aren't They Spamming The Chinese?

Whilst trying to drink my first cup of coffee this morning, I was rudely interrupted by click-jacking malware affecting my wife’s computer. All she was trying to do was look at some Google search results, but clicking on them would take her to a suspicious looking shopping search site. From a little bit of Googling it looked as if it might be a real nasty trojan which would have taken ages to clean up. Fortunately it turned out that all the pages she was having the problem with had been infected with the same bit of malicious JavaScript. I'm not sure how (probably through a malicious banner ad or something) but a reference to the following JavaScript had been inserted at the very end (after the </html>) of each affected page:
if (navigator.language)
  var language = navigator.language;
else
  var language = navigator.browserLanguage;

if(language.indexOf('zh') == -1) { 
  var regexp = /\.(aol|google|youdao|yahoo|bing|118114|biso|gougou|ifeng|ivc|sooule|niuhu|biso|ec21)(\.[a-z0-9\-]+){1,2}\//ig;
  var where = document.referrer;
  if (regexp.test(where)) {
    window.location.href="http://www.bbc.co.uk/news";
  }
}
To make the script easier to read I've reformatted it, and replaced the redirect with a safe URL (who doesn't trust the BBC?) rather than giving the spammers free advertising, but I haven't changed any of the functional aspects of the script.

Essentially all it does is check the URL that you were on when you clicked the link leading you to the current page, and if that looks like a search results page from one of 14 different companies, then it redirects you. The regular expression it uses to check the referring page is simple yet effective and will catch any of the sub-domains of these search services as well. What I find weird is why the script checks the language of the browser.

The first four lines of the script get the language the browser is using. There are two ways of doing this depending on which browser you are using hence the if statement. On my machine this gets me en-US (which means I need to figure out why it has switched from en-UK which is what I thought I'd set it to). Line 6 then checks to make sure the language doesn't include the string zh, which according to Wikipedia is Chinese. I'm assuming that the spammers behind the script are Chinese and don't want to be inconvenienced by their own script, but it seems odd, especially when you consider that at least one of the search engines covered by the regular expression (118114 on many different top-level domains) seems to be a Chinese site.

Looking at this script there is of course another way to defeat it, other than disabling JavaScript. One of the privacy or security options in most browsers concerns the referer (yes I know it is spelt wrong, but that is the correct spelling in the HTTP spec) header. Essentially this header tells a web server the page you were on when you clicked the link leading to the page you are requesting. Some sites will use this to provide functionality so disabling it can cause problems but it does mitigate against scripts like this one. Because it can cause problems it's often an advanced setting, for example here are the details for Firefox.

Serializing To A Human Interface Device

If you've read my previous post you'll know that I've been looking at a cheap and simple way of adding serial communication to a breadboard Arduino clone (such as this one). To summarise the situation so far; adding true RS-232 serial communication is both expensive and difficult as the required part is only available as a surface mount component but I discovered V-USB which allows me to emulate low speed USB devices. The end result was that I managed to use V-USB to emulate a USB keyboard. Being able to pass data from the Arduino to the PC by simply emulating key presses is useful but a) it is rather slow and b) different keyboard mappings lead to different characters being and typed and more importantly c) it doesn't allow me to send data to the Arduino. So on we go...

Let's start with what I haven't managed to achieve; a USB CDC ACM device for RS-232 communication. Unfortunately CDC ACM devices require bulk endpoints (these allow for large sporadic transfers using all remaining available bandwidth, but with no guarantees on bandwidth or latency) and these are only officially supported for high speed USB devices. V-USB only allows me to emulate low speed USB devices, and while most operating systems used to allow low speed devices to create bulk endpoints, even though this is contrary to the spec, modern versions of Linux (and possibly Windows) do not. I did manage to get a device configured correctly but as soon as I plugged it in the bulk endpoints were detected and converted to interrupt endpoints which stopped the device from working. However, all is not lost as I do have a solution which I think is just as good; serializing data to and from a generic USB Human Interface Device.

The USB specification defines the USB Human Interface Device (HID) class to support, as the name suggests, devices with some form of human interface. This doesn't mean sticking a USB cable into your arm, but rather defines common devices such as keyboards, mice and game controllers as well as devices like exercise machines, audio controllers and medical instruments. While such devices may communicate data in a variety of forms it all passes to and from the device using the same protocol. This means that when you plug any such device into practically any computer with a USB port it will be recognised and basic drivers will be loaded.

Writing code to communicate with a USB HID device isn't that much more complex than interfacing with a classic serial port and given the standard driver support we can rely on the operating system taking care of most of the communication for us.

For what follows I'm assuming the same basic USB circuit that I described in the previous post as we know it works and it is cheap to build.


Now we have the circuit let's move on to the software we need to write. Unlike with the USBKeyboard library, that powered The Caffeine Button, we will need both firmware for the Arduino and host software that will run on the PC and interface with the basic HID drivers the operating system provides. Given that you can't test the host software until we have a working device we'll start by looking at the firmware.

The first thing you have to do when constructing a HID is to define its descriptor. The descriptor is how the device presents itself to the operating system and defines the type of device as well as the size and type of any communication messages. Now you will probably never need to edit this but I thought it was worth showing you the full descriptor we are using:
PROGMEM char usbHidReportDescriptor[USB_CFG_HID_REPORT_DESCRIPTOR_LENGTH] = {
    0x06, 0x00, 0xff,              // USAGE_PAGE (Generic Desktop)
    0x09, 0x01,                    // USAGE (Vendor Usage 1)
    0xa1, 0x01,                    // COLLECTION (Application)
    0x15, 0x00,                    //   LOGICAL_MINIMUM (0)
    0x26, 0xff, 0x00,              //   LOGICAL_MAXIMUM (255)
    0x75, 0x08,                    //   REPORT_SIZE (8)
    0x95, OUT_BUFFER_SIZE,         //   REPORT_COUNT (currently 8)
    0x09, 0x00,                    //   USAGE (Undefined)  
    0x82, 0x02, 0x01,              //   INPUT (Data,Var,Abs,Buf)
    0x95, IN_BUFFER_SIZE,          //   REPORT_COUNT (currently 32)
    0x09, 0x00,                    //   USAGE (Undefined)        
    0xb2, 0x02, 0x01,              //   FEATURE (Data,Var,Abs,Buf)
    0xc0                           // END_COLLECTION
};
In this descriptor we define two reports of different sizes which we will use for transferring data to and from the device. The first thing to point out is that the specification defines input and output with respect to the host PC and not the device. So an input message is actually used for writing out from the device rather than for receiving data. Given this, we can see that the descriptor defines two message types. Firstly (lines 7 to 10) we define an 8 byte (OUT_BUFFER_SIZE is defined as 8) input report (the size is defined in bits so we have 8 bits times the count to give 8 bytes) which means we can write 8 byte blocks of data back to the PC we are connected to. The second message type is defined as a FEATURE message of 32 bytes (because IN_BUFFER_SIZE is defined as 32 and the REPORT_SIZE hasn't been redefined so it still 8 bits) which we will use for passing data from the PC to the USB device. As I said you will probably never need to edit this structure especially as you can tweak the message sizes, if necessary, by adjusting the two constants instead. If you do decided to change the descriptor it is worth noting that some operating systems are more forgiving than others. For example, with Linux if you have a defined a message of 8 bytes but only have two to send then you can do that and everything will work. Under Windows, however, if you only send two bytes the device will simply stop functioning altogether so you will need to pad the message to be exactly 8 bytes. This also means that it is easy to check that your descriptor matches what you are actually doing by quickly testing under Windows (I've been doing this with a copy of Windows XP running under VirtualBox).

Now that we know the size of the messages we will send and receive we still need to decide upon their format, i.e. the protocol we will use for our data that we are sending on top of the USB protocol. If we were only interested in sending textual data then we could send null terminated data (i.e. put a zero value byte into the array after the last byte of data), but if we want to send arbitrary bytes then using 0 as an end of data marker seems an odd choice. For this reason I've opted to set the first byte of each message to the length of the valid data in the array. This is both simple to use and results in firmware code that is slightly simpler (and hence smaller) than checking for the null terminator. This does of course mean that in an 8 byte message we can only fit 7 bytes of actual data plus the length marker (this is no different than with null terminated data of course). If you know the messages you want to send will always be of a fixed length then tweaking the buffer sizes to suit might make for a more efficient transfer of data. In general, as you will see shortly, as a user of the library a lot of these details are dealt with for you.

From the very beginning my aim was to find a drop-in replacement for the standard Arduino Serial object and so I've made the USBSerial library implement the same Stream interface. This means you can use any of the methods defined in the Stream interface for reading and writing data and the details about buffer sizes etc. are hidden within the library.

To show how easy the library is to use, here is a simple example where the sketch simply echos back any bytes that it is sent.
#include <USBSerial.h>

char buffer[IN_BUFFER_SIZE];

void setup() {
    USBSerial.begin();
}

void loop() {
  USBSerial.update();
  if(USBSerial.available() > 0) {
    int size = USBSerial.readBytes(buffer, IN_BUFFER_SIZE);
    if (size > 0) {
      USBSerial.write((const uint8_t*)buffer, size);
    }
  }
}
Note that I've used the same IN_BUFFER_SIZE constant in this example as within the library itself, as there is no reason to define a buffer that is bigger than we can ever expect to fill. The only line that you wouldn't find in a similar example using the standard Serial object is line 10 where we make sure that the USB connection is up to date (you need to do this approximately once every 50ms to keep the connection alive). Before we move on to looking at the host software there are a few things you need to know before trying to use the library.

Unfortunately the V-USB part of the library needs customizing for each project, so you can't simply drop the library into the Arduino sketchbook folder, as the USB manufacturer and product identifiers have to be unique for different devices. These identifiers are set in the usbconfig.h file. V-USB doesn't actually provide a copy of usbconfig.h what is provided is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME     'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
#define USB_CFG_VENDOR_NAME_LEN 8
and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME     'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
#define USB_CFG_DEVICE_NAME_LEN 8
These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.
Once properly configured you should be able to compile (I recommend using arduino-mk instead of the Arduino IDE) and use the library without issue, and without understanding how it actually works internally (if you are interested in the details then both my code and the V-USB library contain vast amounts of code comments which should help you get a better understanding) so let's move on to looking at the host software.

As I've already mentioned connecting the device to a PC usually causes generic HID drivers to be loaded by the operating system. This means that you should be able to use any programming language you like to write the host software as long as it can talk to the generic USB drivers. I've included host software written in Java using javahidapi but, for instance, you could also use PyUSB if you prefer to program using Python. The important thing to remember is the protocol for passing data that we defined earlier: data to the USB device is sent as 32 byte feature requests with the first byte being the length of the valid data in the rest of the array, while data from the USB device is in 8 byte chunks again with the first byte being the length of the valid data.

As with the firmware code we have already discussed, I've written a simple Java library to hide most of the details behind standard interfaces, which allow you to read and write data using the standard Java InputStream and OutputStream interfaces. Full details of the available methods can be found in the Javadoc but a simple echo example shows most of the important details.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.PrintStream;

import englishcoffeedrinker.arduino.USBSerial;

public class SimpleEcho {
  public static void main(String[] args) throws Exception {
    // get an instance of the USBSerial class for the specified device
    USBSerial serial =
        USBSerial.getInstance("englishcoffeedrinker.co.uk", "EchoChamber");

    // open the underlying USB connection to the device
    serial.connect();

    // create an output stream to write characters to the device
    PrintStream out = new PrintStream(serial.getOutputStream());

    // send a simple message
    out.println("hello world!");

    // ensure the message has been sent and not buffered internally somewhere
    out.flush();

    // create a reader for getting characters back from the device
    BufferedReader in =
        new BufferedReader(new InputStreamReader(serial.getInputStream()));

    String line;
    while((line = in.readLine()) == null) {
      // keep checking the device until a line of text is returned
    }

    // display the message sent from the device
    System.out.println(line);

    // we have finished so disconnect our connection to the device
    serial.disconnect();
  }
}
Essentially, lines 10 to 14 get an instance of the USBSerial class for a specific device, in this case found via the manufacturer and product identifiers although other methods are available, and then opens the connection. Lines 16 to 23 then use the OutputStream to write data to the device while lines 26 to 35 read it back, with line 38 cleaning up by closing the connection. For anyone who is happy programming in Java this should look no different than reading or writing to and from any other type of stream, which means it should be easy to integrate within any project where you want to communicate with an Arduino.

To make life a little easier I've also included a slightly more useful demo application that effectively reproduces the Serial Monitor from the Arduino IDE. You can see it running here connected to a device running the simple echo sketch from earlier in this post, but it should work with any device that uses the USBSerial library.

I've included another example with the USBSerial library that shows you can use this for more than just echo commands. This is the CmdMsg example, which is an example I've talked about before on this blog, but this version uses the USBSerial library, and hence can be controlled through this new USB Serial Monitor, rather than using the standard Serial library.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository. Do let me know if you find it useful or if you have any suggestions for improvement.

When I set out to try and add serial support to a breadboarded Arduino (specifically this circuit) I did have a device I wanted to build in mind, so I'm sure at some point I'll blog again about using this library in a real device rather than just the simple examples included with the library that do nothing more than prove everything works.