Code from an English Coffee Drinker: February 2012

Disable Dynamic Views: An Update

Just over a month ago I blogged about a small GreaseMonkey script that you could install so that you didn't have to see any of Blogger's new dynamic view templates ever again. Well it turns out that there was a small bug in that version of the script.

If you remember, the script essentially works by tricking Blogger into thinking you are viewing the blog without JavaScript by adding v=0 to the query string of each link within the blog. Well it turns out that there was one case where this wasn't working properly; the link that jumps you to the comments within the post specific page. The problem was that these links point not only to a page but a specific anchor by adding #comments to the URL. Unfortunately my script was adding the query string after the anchor reference (i.e. post-page.html#comments?v=0) when it should add it to the page location (i.e. post-page.html?v=0#comments). I've updated the script to v1.1 which contains a fix for this. In theory if you have already installed the script your browser should eventually pick up the new version. But if you haven't yet installed the script or just want to make sure you have the latest version then you can install/upgrade by simply clicking this link.

One Byte At A Time

Whilst working on Postvorta one of the things I've tried to do is to make the code as efficient as possible in order that search results are returned as quickly as possible. Mostly this has involved caching data where possible as well as using efficient data structures and algorithms. Of course with Postvorta being a web application part of the time taken to show search results is dependent on the amount of data that is actually returned to a browser including; HTML pages, JavaScript files, style sheets, and images. I am already using JAWR to minify and compress JavaScript and CSS files which makes a real difference to the amount of data that you have to download each time you search but in this post I want to talk about a small issue I uncovered when trying to trim just a few bytes from the HTML pages.

I've recently been reading a book on Java Performance by Charlie Hunt and Binu John. While it covers quite a few aspects of performance that I was already aware of there is also quite a lot of information that is new to me. One chapter is devoted to performance tuning for web applications and as well as mentioning minifying and compressing static files (JavaScript, CSS etc.) it devotes a section to considering whitespace in the dynamically generated pages.

When you save a file of text, whitespace characters (spaces, tabs, new lines) all take up the same amount of disk space as any other character, i.e. 1 byte (I know this isn't entirely accurate but I don't want to get into a long discussion of line endings and encoding formats so this assumption will suffice for what follows). This is acceptable if you want to use whitespace for formatting but HTML specifically doesn't use whitespace in this way. Any sequence of whitespace in a HTML file is converted by the browser into a single space character, so it is wasteful to transmit extra whitespace than is needed for the page to be understood and rendered. Of course most people use whitespace not just for formatting but to make the HTML code easier to understand and debug. There are filters that I could add to Postvorta that would strip out all extraneous whitespace before transmitting the results back to the browser but a) this would make debugging the page tricky and b) each filter I add has it's own performance overhead. My plan, therefore, has been to try and re-work the code where possible to eliminate some whitespace while leaving the code readable and to not add an extra filter. In most cases this is easy, but there is one area where eliminating whitespace is more difficult.

When switching between HTML and Java in a JSP page whitespace is often inserted to ensure that the resulting page can be properly interpreted. Unfortunately in almost every case this whitespace is superfluous and can be removed. Fortunately there is an easy way of removing the blank lines from the output that these whitespace characters introduce. The easiest way is to added the following page directive to a JSP page:

<%@ page trimDirectiveWhitespaces="true" %>

While Postvorta currently only contains two pages (the results page and the advanced syntax page) this is easy to do, but in a more complex application there may be tens or hundreds of pages at which point this approach becomes less appealing. You can, however, enable the same feature for every page by editing the applications web.xml to add the following:

<jsp-config>
  <jsp-property-group>
    <url-pattern>*.jsp</url-pattern>
    <trim-directive-whitespaces>true</trim-directive-whitespaces>
  </jsp-property-group>
</jsp-config>

I tried both approaches and they do indeed produce the same output, which in my test case brought the page size down to 13,238 bytes from the original 13,300, saving me a total of 62 bytes! Now 62 bytes might not be very much but this is per page view and so can quickly mount up. Looking at the differences between the old and new pages I noticed that there were still quite a large number of blank lines in the head section of the HTML file that I thought should have been removed. It turns out that the problem is related to how I style the pages but is easy to solve.

I use SiteMesh (I'm using v2.4.1) to style all the pages within Postvorta. This allows me to define the main layout of the pages once and then use this to display all pages. For those of you who use Blogger, you can consider a SiteMesh layout to be equivalent to your blog template. The layout is applied via a SiteMesh specific filter and it appears, that when using the web.xml approach to enable the trimming of whitespace, the layout is applied after the whitespace has been trimmed. This means that whitespace within the main body of the page is removed but not within the head section. The trick is to use the page directive approach within the SiteMesh layout. This has the advantage of being applied to the entire page, only needs to be specified once, and in my test case saves another 15 bytes which brings the total page size down by 77 bytes to 13,223 bytes.

The total savings are small, but if Postvorta ever becomes really popular, shaving a few bytes here and there might well make a noticeable different to performance.

Postvorta Mk II: Faster And With More Features!

The Spitfire Mk II was essentially the same as the original model, just with an upgraded Merlin engine. Today I've done something similar to Postvorta my "intelligent blog search engine".

Those of you who read the initial blog posting I did on Postvorta may remember that underneath it all Postvorta relies on GATE Mímir for indexing and search. Yesterday we (i.e. the GATE group at the University of Sheffield) released new versions of most of the software we develop, including Mímir 4. While I'm not heavily involved with the development of Mímir I do use it quite substantially at work and I've been slowly updating all the systems I'm involved with, including Postvorta, to use this new version. Not only is this new version of Mímir faster it also takes a slightly different approach in the way it handles search results which is more suited to Postvorta than the old approach. I've also added some extra code to Postvorta to cache more information locally. Together these changes have resulted in Postvorta returning results an awful lot faster than before. You will also notice that switching between pages of results is significantly faster than it was before. Of course all these changes are "under the hood" so just like with the Spitfire Mk II, Postvorta should look roughly the same but work much faster. There is one new feature though that is worth talking about: result visualization.

When you search a blog using Postvorta it returns a list of relevant documents ordered from most recent to oldest. Combined with the different ways you can search a blog this ordering is usually the most useful. In some cases, however, especially when a search returns lots of results, it can be difficult to hunt through the posts to find the one you are interested in. To help with this I've started to think about different ways of visualizing the results. Whilst I've had a number of ideas the first to make it as far as a working, stable implementation is a date distribution graph.

A date distribution graph (in this context anyway) is simply a vertical bar chart showing how the results are distributed by month. The graph, just like the results, works backwards so the most recent month is on the left. The bars of the graph can be clicked on to go to the result page containing the first result for that month. Essentially it allows you to jump to posts from a given month directly without having to page through lots of irrelevant results. Currently, depending upon the number of search results, the graph can take a moment or two to be produced, but this is done asynchronously to the normal page loading to allow you to see the actual results as soon as possible.

As always I'd be interested in knowing what you think of this new feature. You can play with it either by searching this blog (the search box is just over on the right), or your own blog.