The Shape Of The Word

The first time I was paid to develop software, coincided with the first Browser War at the end of the 1990's. I worked for three summers in a row as part of a team developing on-line learning software, a lot of which was web based. Whilst those years saw dramatic improvements in web technology, trying to develop a web application that would work reliably and consistently across the different browsers that were available was a a real nightmare. The problem was that browser developers tended to add new features to attract users rather than focusing on correctly implementing the existing web standards. Fortunately the world has moved on in the decade or so since then and most web browsers will now render the same, standards complaint, page in the same way (assuming that we ignore the abomination that is Internet Explorer 6). Unfortunately I was reminded of browser inconsistency recently, when I found that, just like HTML pages during the Browser War, SVG images can appear completely differently depending on the application used to render them.

SVG images, for those who don't already know, are vector rather than raster images. A raster image encodes the colour of each pixel within the image, where as a vector image is a list of instructions for drawing the image. SVG images, therefore, scale well as the co-ordinates etc. within the image can be scaled before drawing. For this reason I've been using SVG images in the applications I've been developing for a while now, as they give me the flexibility to change the interface without having to re-draw toolbar images etc. For those of you who have been reading this blog for a while might remember I translate the SVG images into Java2D code using SVGRoundTrip. I developed SVGRoundTrip out of some code from the now abandoned Flamingo component suite, and have known for a while that it has a few limitations. Until recently though those limitations haven't actually caused me any problems and so I've not been motivated to fix them; if it ain't broke don't fix it!

The main limitation was that text within SVG images wasn't really supported. I did add some code that would render the text to a PNG image, and then load and render the image in the right place when drawing the generated Java2D code. While this works, the whole point of using SVG images is to allow for scaling the images without loss of quality and scaling a PNG as necessary definitely breaks that philosophy. To get around this, one option is to convert the text to curves when editing the image in Inkscape, which works well. The problem is that this makes it impossible to edit the text at a later date. If you might need to edit the image then you need to keep two copies one with the text as text and one with it as curves, which seems a little silly.

Some of the recent development work I've done (as part of my day job) on GATE has involved GUI changes that have required new icons. Given that you never know how a UI might change in the future I decided to use SVG images for all the new icons and slipped SVGRoundTrip into the build process. In the new plugin manager UI I needed a copy of the main application icon, so once I had a usable SVG image (I converted the G to curves in Inkscape) I started to use it throughout the UI for consistency. This worked perfectly until one of my colleagues decided that it would be useful, when the icon was used in a larger version (a dock icon etc.), to include version information within the icon. The current development version of GATE is 7.1-SNAPSHOT and so he created the icon seen to the left of this paragraph. Converting the text to curves in Inkscape produced an SVG file that I could render correctly with SVGRoundTrip, but I decided to have a go at adding support for text to SVGRoundTrip instead.

I started out with a simpler SVG file that just contained a single piece of text and after reading through the API for Apache Batik (the SVG parser underlying SVGRoundTrip) I discovered that actually the following (fairly simple method) would suffice.
private void transcodeTextNode(TextNode node, String comment)
{  
  //this is needed otherwise we get null text runs
  node.getTextPainter().getOutline(node);
     
  @SuppressWarnings("unchecked")
  List runs = node.getTextRuns();
  if (runs != null) {
    for (TextRun run : runs) {
      transcodeShape(run.getLayout().getGlyphVector().getOutline());
   
      TextPaintInfo paintInfo =
          (TextPaintInfo)run.getACI().getAttribute(TextNode.PAINT_INFO);

      // fill the text if we need to
      if (paintInfo.fillPaint != null) {
        transcodePaint(paintInfo.fillPaint);
        printWriter.println("g.setPaint(paint);");
        printWriter.println("g.fill(shape);");
      }
   
      // draw the outline if we need to
      if (paintInfo.strokeStroke != null &&
          paintInfo.strokePaint != null) {
        transcodePaint(paintInfo.strokePaint);
        transcodeStroke(paintInfo.strokeStroke);
        printWriter.println("g.setStroke(stroke);");
        printWriter.println("g.setPaint(paint);");
        printWriter.println("g.draw(shape);");
      }
    }
  }
}
Essentially this method gets each TextRun (i.e. a contguous run of text without line breaks and which is in the same font and style), converts the shape of the text into Java2D code and then converts the fill and stroke commands as well. After initial success I tried varying the fonts and style (including rotating the text) and everything seemed to work. So I moved on to trying to convert the GATE icon.

The result of converting the GATE icon (shown to the right) unfortunately wasn't quite what I was expecting. Clearly it had converted the text into curves, but it looked as if it was drawing each letter in the wrong place. Essentially SVGRoundTrip parses the SVG file and converts it into Java2D commands, so there was really only three possible places things could go wrong:
  1. Batik was parsing the SVG file incorrectly
  2. SVGRoundTrip was incorrectly converting the shape into Java2D commands
  3. There was a problem with the SVG file
After some debugging it was clear that "7.1" and "SNAPSHOT" were being treated as a single shapes and so it was unlikely that the problem was occurring within the SVGRoundTrip code. I didn't really want to start digging into the Batik source code so I went back to have a look at the SVG file.

Both pieces of text were, according to Inkscape, displayed using the DejaVu Sans font using the Bold Semi-Condensed style. After some experimentation I found that if I just set the style to Bold then the image generated by SVGRoundTrip was the same as with Bold Semi-Condensed. This was definitely useful progress! It turns out that when the font information is broken down into CSS for inclusion in the SVG file it actually produces the following XML snippet (I've simplified this to just focus on the font attributes):
<text 
    y="263.66687"
    x="117.51996"
    style="font-size:6px;
           font-style:normal;
           font-variant:normal;
           font-weight:bold;
           font-stretch:semi-condensed;
           font-family:DejaVu Sans;">
  <tspan>SNAPSHOT</tspan>
</text>
Now I'd never heard of font-stretch before so I looked it up in the SVG specification which simply pointed me to the CSS specification which states:

The 'font-stretch' property selects a normal, condensed, or extended face from a font family. Absolute keyword values have the following ordering, from narrowest to widest: ultra-condensed, extra-condensed, condensed, semi-condensed, normal, semi-expanded, expanded, extra-expanded, ultra-expanded.
Now to my ears that sounds more like a hint to how to select a font rather than a direct instruction, especially as while most (if not all) fonts will have bold and italic variants I don't imagine that they will all have nine different condensed versions. To check what versions of the DejaVu Sans font I had installed I had a look at the font options in LibreOffice (a open-source Office package), which lists DejaVu Sans, DejaVu Sans Condensed, DejaVu Sans Light and DejaVu Sans Mono. So no semi-condensed variant is present on my machine. A quick manual tweak to the SVG file to use the condensed version instead and suddenly the text rendered properly in both Inkscape and via SVGRoundTrip! Unfortunately if you then started editing the text (the whole reason for this entire article) in Inkscape it seemed to lose the font-stretch setting altogether which wasn't exactly ideal.

By this time I'd come to the decision that the font-stretch property was something I'd rather avoid, as there didn't seem anyway to ensure that the hint it provided was rendered consistently. Fortunately there is a way to simulate something similar using negative values for the letter-spacing CSS property. The nice thing about this property is that I can give it pixel values which should be interpreted the same by any renderer.


Unfortunately as you can see from the above images the icon still isn't rendered consistently across different applications. Fortunately it renders the same in Inkscape (the left hand image) and SVGRoundTrip (the centre image), and is editable in Inkscape without any problems. The third image shows what happens if I try loading the SVG image into GIMP (an open-source image editor) -- I really don't know what's gone wrong here, it almost looks as if it's picked the wrong font size. Fortunately I don't need to use GIMP to edit or render the image so I'm calling this success.

So SVGRoundTrip now has support for converting text to Java2D curve commands. Because there seems to be quite a few issues with how text is rendered across different SVG tools I'm currently calling this code experimental, so you need to enable it with the -e command line switch. I haven't done a full release of SVGRoundTrip but you can grab this improved version from SVN.

Postvorta: WordPress Support

Just a quick Postvorta related announcement (which I'm sure you'll have figured out from the title and image); Postvorta now supports indexing WordPress blogs! Currently this is limited to blogs hosted on WordPress.com although I'm hoping to extend the support to self hosted WordPress blogs shortly.

Whilst the search works the same as it does for Blogger hosted blogs there are a couple of things to note. Firstly there are currently no image thumbnails in the search results. While WordPress blogs do include image information within the data I use to index the blogs it's not in the same format or as easily usable as the support provided by Blogger. I'll try and rectify this at some point, as I extend support to self hosted WordPress blogs and possibly other blogging platforms. Also free WordPress.com blogs don't allow custom gadgets to include JavaScript or HTML forms and so I don't currently have a way of providing a search gadget like I do on Blogger. other than those two issues it seems to work fine and I've already had positive feedback from one happy user.

So if you have a WordPress.com blog and you'd like to use Postvorta then feel free to sign up to the beta programme.

Disposable Memories

This post is going to be a very technical look at GATE's memory consumption. This will involve discussion of Java's class loading mechanism and garbage collection. If none of that interests you, or if that first sentence didn't make sense, then can I make a friendly suggestion that you stop reading about now!

Before we dive in you need a little background on how memory is used in Java. This short description isn't entirely accurate (a full and accurate description would probably require an entire book) but will suffice for the purposes of this post.

There are, broadly speaking, two types of information that Java keeps in memory when running an application; the definitions of each class that has been used and the information about each instance of a particular class. The instance level information is created when you use the new keyword and is made available for garbage collection when there is no longer anyway of accessing the specific instance. The class definitions on the other hand are only ever released when the classloader instance, which loaded them into memory, is garbage collected. If the Java application doesn't do any custom class loading then all class definitions will be loaded into the system classloader which is never garbage collected. Class definitions are stored in the PermGen area of the Java heap which is why, if you load too many class definitions, you will eventually get the message:
Exception in thread "main" java.lang.OutOfMemoryError: PermGen space
The common way of avoiding this problem is simply to increase the amount of available memory (either the total heap size or just the PermGen area). While this often makes the problem go away, it actually just increases the time until the problem will occur.

GATE itself contains a lot of classes (over 2000 just in the main source tree) but also supports dynamically loading plugins and compiling new classes from JAPE grammars. This all means that there is no maximum number of classes that might be loaded and hence no way of ensuring that the PermGen is always big enough. Fortunately GATE doesn't dynamically load classes into the system classloader but into a custom classloader as you can see from this diagram. Unfortunately this classloader is a singleton instance which is never released.

One of the side effects of this is that if you re-initialize (or close and re-create) a JAPE transducer a new copy of the class definitions are created and added to the PermGen area. This is one of the reasons that when using GATE as part of a web service we suggest using a pool of pipelines rather than creating a new pipeline for each request. Not only does this reduce response times (assuming an adequate sized pool) but also prevents exhaustion of the PermGen area. To show just how quickly this can become a problem I wrote the following short piece of code.
Gate.runInSandbox(true);
Gate.init();

Gate.getCreoleRegister().registerDirectories(
  (new File("/home/mark/gate-top/externals/gate/plugins/ANNIE/"))
    .toURI().toURL());

FeatureMap params = Factory.newFeatureMap();
Transducer jape = (Transducer)Factory.createResource(
  "gate.creole.ANNIETransducer", params);

long count = 1;
while (true) {
  System.out.println(count);
  jape.reInit();
  ++count;
}
This simply initializes GATE, creates a single instance of the ANNIE NE Transducer and then repeatedly re-initializes it. On my machine Java defaults to using 82MB for the PermGen and this was exhausted after loading the transducer just 104 times. I was monitoring the performance using VisualVM and you can see from the following screen shot how the memory was quickly exhausted.


There are a number of other issues with using a singleton classloader but they boil down to the fact that once a class has been defined it can never be forgotten or redefined. The practical aspects of this are that unloading a plugin doesn't actually result in the class definitions being forgotten, so you can't unload, recompile, reload a plugin. If you want to make a change you have to close and restart GATE. Another problem is that if two plugins try and load different versions of the same class only the first version will be used. This is particularly problematic when dealing with complex plugins which may use multiple third-party libraries. With the plugins in the main distribution we try and keep to just one version of each library but clearly we have no control over third-party plugins.

These issues have annoyed me for a while, but I haven't had either the time (this has all been done in my own time) or in some cases the technical knowledge to do anything about it. A few weeks ago, having read a book on Java Performance, a couple of pieces of the puzzle started to fall into place and I realized that I could probably have a good crack at a new classloader architecture that would solve all of these problems. You can see the architecture I've adopted in the diagram to the right. There are two things this diagram doesn't show. Firstly there can be any number of plugin or JAPE classloaders, and secondly those classloaders are what is known as parent last. The reason for having any number of plugin or JAPE classloaders (they are actually the same class I'm just splitting them up on the diagram to show that it handles both plugin loading and compiling JAPE grammars) should be obvious, as it allows us to throw one away when it is no longer required (i.e. we don't want to have to unload all plugins just to unload one). The idea of parent last classloading, however, requires a more detailed explanation.

Traditionally classloaders in Java take a parent first approach. This means that when they are asked to load a class they start by asking their parent classloader (follow the arrows upwards) to load the class, which in turn asks it's parent etc. It's only if this fails that a classloader will itself try and load a class. Changing to using parent last means that two classloaders can now have different copies of the same class defined within them, and hence we can support different versions of the same third party library appearing within different plugins. This works because when a class in a plugin tries to load a class it looks within it's own plugin before looking in either the main GATE classloader or the classloaders associated with other plugins or JAPE grammars. This mechanism also allows classes in different plugins to refer to one another (i.e. a JAPE grammar can refer to classes loaded via a plugin).

To ensure that classloaders can be released and garbage collected properly, I've also made a change to the way plugins are unloaded. Currently in GATE unloading a plugin simply removes the definition of the resources it contains from the CREOLE register, but it doesn't unloaded any instances of the resources that are currently in use. This does tend to lead to some funny behaviour (often weird AWT error messages) and to ensure a thorough cleanup I've updated this code so that it does now unload any instances of resources that depend on the plugin before unloading the plugin itself.

It's taken quite a bit of time hunting through heap dumps using VisualVM but I'm now confident that this new approach works well and that classloaders, and hence classes, can be thrown away when they are no longer needed. As an example, we can use this modified version of GATE along with the sample code I showed before (to re-initialize a JAPE grammar over and over again). This time I let it run for over 1000 iterations (almost 10 minutes) before I stopped it as there was no sign of it running out of memory. A quick look at VisualVM explains why.


You can clearly see that every time the PermGen is almost exhausted garbage collection kicks in freeing some of the previously loaded and discarded class definitions.

This work has all been done in this separate SVN branch so as not to disrupt the main source tree. This means that to try this modified, memory friendly version of GATE you will need to check out the branch and compile it yourself. Mind you if you have read all the way to here then I'm going to guess that won't be a problem. My hope is that this will eventually be merged back into the trunk but I'd prefer to wait until this branch has been tested by a number of other people and not just me. Also there are currently 9 tests that fail when using this branch, although I've looked at the failing code and in every case it's actually the test that is at fault. Essentially the tests load each plugin twice, once into the system classloader (as the classes are on the main Java classpath) and once into the GATE classloader (when the plugin is loaded via the API). Because of the change to parent last classloading, this means that two different definitions of the same class can end up being used which results in apparently nonsensical error messages such as: Error converting class gate.chineseSeg.RunMode to class gate.chineseSeg.RunMode. This situation never happens during normal use of GATE Developer, and can be fixed by altering the creole.xml files used to define the plugins. In other words the failed tests shouldn't stop you trying this branch, although I also wouldn't suggest using this branch in a production environment until it has seen further testing.

So on that note, please try the branch if you have been concerned about memory consumption or had problems with clashing library versions and let me know if you find any problems, or if you have any further suggestions for improvements.