Raspberry Flavoured Information Extraction

So in my post yesterday about the Raspberry Pi I mentioned that it could run GATE. Given that before the Raspberry Pi was released a lot of people were of the opinion that they wouldn't be powerful enough to run interpreted languages, such as Java, I decided that one of the first things I'd do with mine was show how wrong that is by running GATE. So having booted up the machine and logged in I installed Java with the single command:
sudo apt-get install openjdk-6-jdk
This took a while as it downloaded and installed all the required packages. I then grabbed the binary only copy of GATE's nightly build (while I wanted to run GATE I didn't want to try compiling it). Once downloaded I started up X to give me a graphical interface (the default distribution uses LXDE). From a terminal I then started GATE. The splash screen came up almost immediately, but then there was a bit of a wait while it built the developer interface. Once GATE had started I then loaded up a copy of ANNIE (the default information extraction pipeline). This was the point at which I then took the photo on the left. Now because Amazon seem to have mislaid a parcel I ordered, I'm stuck connecting the Raspberry Pi to my TV via a composite cable (I wanted to use a HDMI to DVI-D cable to attach to my nice monitor) and I can't get the screen to scale correctly. This means that the top and bottom are cut off slightly and it isn't filling the width. Anyway at least you can see GATE loaded and the Raspberry Pi resting on the box of chocolates on the floor.

I was intending to take some proper screenshots to show GATE working but unfortunately I couldn't figure out how (turns out that taking screenshots under LXDE is a slightly convoluted process), but here is a close up photo of the TV having done a small amount of information extraction.

I'm not sure what I'm going to try doing with the Raspberry Pi next so if anyone has anything they'd like to test let me know and I'll see what I can do.

Raspberry Pi: Small Computer, Big Ideas

When I was growing up, computers were only just starting to make inroads into schools and homes, mostly via affordable (not to everyone true, but to a large slice of the population than previously) machines such as the BBC Micro, and the ZX Spectrum. As I've mentioned before my first serious forays into computer programming where actually with an Acorn Electron (a cut down, and hence cheaper, version of the BBC Micro). While these machines were often used as glorified typewriters or for playing games they were also easy to programme. When you turned them on they didn't boot straight into some complex graphical operating system, but to a simple prompt. You could enter commands that would load programmes, or you could enter simple commands that would make the computer do something. If you messed up you just turned the machine off and then on again. There was no danger of you messing the machine up in any way that would stop it working perfectly next time you turned it on. Combined these two things led to a generation of kids who messed around programming computers and who have grown up and become software engineers.

Unfortunately as computers have become more complex, the ability to experiment with programming has reduced. Also it is much easier to mess up a modern computer so that it requires a complete re-install rather than just a quick on-off of the power switch. Together this things seem to have led to a severe drop in the number of kids who are learning to programme. Of course they aren't helped by the focus of IT lessons in schools towards knowing how to use office software rather than teaching computing skills.

Having been involved in university lab classes and marking assignments over the last decade, it is clear even to me that most students arriving for an undergraduate course in computing have very few existing skills. Unfortunately they tend to graduate without learning too many more. Yes they can knock up a programme to solve a particular assignment, but often they pay no attention to details such as maintainability or efficiency (time or space). Let's just say that if I had to build up a team of programmers I doubt I'd be willing to hire most current computer science graduates, given the level of programming ability I've seen. Now I'm not really in a position to affect more than a few students by giving constructive feedback on assignments etc. Fortunately there are plenty of other people in the UK who agree that the level of computing knowledge among today's kids has fallen so far that we are in danger of not producing enough qualified graduates. Their solution to the problem is the Raspberry Pi.

I've been following the Raspberry Pi project for a while, and when they went on sale at the end of February I got my order in at the first possible moment. Due to the overwhelming demand in the device the websites of the two distributors were almost brought down. Anyway I must have been one of the earliest through the ordering process because my Raspberry Pi was one of just 2,000 that have currently been sent out to customers. So what exactly is a Raspberry Pi?

Put simply a Raspberry Pi is a fully fledged computer. It has all the same fundamental components as any regular desktop computer but costs, wait for it, just $35 and is the size of a credit card! The idea being that it is as cheap to kit out an entire class with a Raspberry Pi each as it would be with a textbook each. The computer has 256MB of RAM (not all of it is available as some is used by the GPU), uses a 700MHz ARM CPU (similar to that which you'd find in a smartphone), and an SD card for storage. All you need to do is add a keyboard (and mouse if you are using a GUI), a display, and then power it using a mobile phone charger. For a display you can use either HDMI or composite which allows you to plug it into old and modern TV's as well as modern computer monitors (via a HDMI to DVI-D cable). The SD card contains the whole operating system and is easy to re-image (i.e. recreate) if it gets messed up (it took about 30 seconds for me to set up my card using the default Debian distribution).

The idea then is to get these into schools were kids can learn to programme and about how computers work without worrying about messing up a family PC, or a PC required for standard ICT lessons etc. Currently though most of the Raspberry Pi's will have been bought by people like me who are interested in technology and grew up programming simpler machines than we commonly use today. We will iron out any bugs in the hardware, and write software for the device, so that when, later in the year, a cased version is launched for schools there will be teaching material and applications available. I'll certainly try and help the community in anyway I can and for the first time in over a decade I have hope that maybe the level of computing skills I see in undergraduates might actually go up rather than down!

Oh and from a quick play this afternoon, it can run GATE! (For those who don't know GATE is how I earn my living).

A Non-Conformant Head

UPDATE 21st April: It would appear that the snippet of misformed HTML is no longer being included in our blogs! The information doesn't seem to be included in any other fashion so I'm assuming it will be back once Blogger have decided how they are going to do it properly.

I've been messing about with Blogger templates over the last few days and I've spotted that Blogger seem to have broken the HTML of every blog they host! They have added an invalid element within the head section of each page which causes the premature closing of the head section. Depending on what scripts etc. you use in your template this may cause a problem.

The new code that Blogger have added to our templates is aimed at adding extra metadata to each page, which in turn will enable Google to have more information about each page within a blog when they are included in a search result. They are using the schema.org metadata format to achieve this. Specifically each page of this blog now contains the following in the head section:
<itemscopetag itemscope='itemscope' itemtype='http://schema.org/Blog'>
   <meta content='Code from an English Coffee Drinker' itemprop='name'/>
As you can probably gather, this snippet essentially tags the page as being from a blog whose title is "Code from an English Coffee Drinker". From the full Blog schema you can see that there is actually a whole set of properties that Google could set for each blog, and I'm guessing that at some point in the future they will add more information, which in turn will enrich their search result pages. Now I'm all for adding extra metadata (I've even written a GATE application that runs ANNIE over webpages and then embeds appropriate schema.org metadata), but unfortunately Blogger have messed up their implementation.

The problem is that they have used an itemscopetag tag, which isn't valid in any version of the HTML specification. Also the specification tells us that if, when parsing the head section of a page we encounter an unknown tag "act as if an end tag token with the tag name "head" had been seen, and reprocess the current token". This essentially causes the premature closing of the head section, with anything else now part of the body instead. Depending on what has been forced out of head and into body and which browser you are using you may see different results. For example, it looks as if links to the Chrome Web Store are broken by this.

What Blogger should have done was added the information to the body tag or one of the main content div tags instead. For example, they could have started the body as follows:
<body itemscope='itemscope' itemtype='http://schema.org/Blog'>
   <meta content='Code from an English Coffee Drinker' itemprop='name'/>
This would have embedded exactly the same metadata but in a format conformant with the HTML specification, and which follows the instructions given on the schema.org site.

Unfortunately there doesn't appear to be anyway to remove this code from our blogs. The best we can do is to move the piece of template code that generates the invalid tag (as well as lots of other code) as late in the head section as possible, so that it pushes the least possible code into the body. To do this you need to edit the HTML version of your template and move the line:
<b:include data='blog' name='all-head-content'/>
To just before the closing head tag so it looks like:
   <b:include data='blog' name='all-head-content'/>
Hopefully Blogger will fix the code the generate soon but until then we just have to minimize the damage they inflict on our blogs any way we can.