And now for something completely different -- a posting not at all related to Blogger!
I've recently been spending quite a bit of my free time playing around with
GATE Mímir (the reasons for which will become clear in a later post). As I've
mentioned before, Mímir is a multi-paradigm indexing and retrieval system which allows us to combine text, annotations and knowledge base data in a single index. Text within Mímir is indexed using
MG4J and by default is processed (at both indexing and search time) by a
DowncaseTermProcessor which ensures that searches are case insensitive. Unfortunately while case insensitive searching is great there are other common problems when searching text collections, one of which can be nicely illustrated just from the name Mímir.
Whilst the name of the system, Mímir, contains an accented character, most people when searching would probably not go to the bother of figuring out how to enter the accented i and would instead try searching for Mimir. But just as Mímir and Mimir are visually different, so are they different when stored in an MG4J index. In other words if we search using the unaccented version we won't get any results! Whilst Mímir is a slightly unusual case I'm sure that we can all agree that a search for cafe should also bring back documents which mention café.
Now for latin alphabets I could come up with a mapping that would reduce most accented characters down to an unaccented version, but it would be time consuming to build and wouldn't handle the different ways in which accented characters can be encoded using Unicode. So I had a bit of a hunt around and discovered a simple, and I think, elegant way of converting accented characters to their unaccented forms courtesy of a
blog posting by Guillaume Laforge. Creating a custom MG4J term processor using this code was trivial and so I now have a way of ensuring that accented characters don't cause me any problems. The one issue was getting Mímir to use the new term processor.
I deploy Mímir in a
Tomcat instance by building a WAR file and while I could simply add a JAR file containing my custom term processor to the WEB-INF/lib folder before creating the WAR I'd prefer not to have to. If i included my code within the Mímir WAR then anytime I wanted to make a change would require rebuilding the WAR and redeploying which seems to be more work than necessary. Fortunately the Mímir config file allows you to specify GATE plugins that should be loaded when the web app is started. So it is trivial to create a GATE plugin which references a JAR containing my custom term processor. Unfortunately when I tried this, MG4J threw a ClassNotFoundException. The problem is that Java never looks down.
On the right you can see the ClassLoader hierarchy that is created when Mímir is deployed in Tomcat -- I've added little icons to show which are created by the Java runtime environment, which by Tomcat and which by the GATE embedded within Mímir. As you can see the GATE classloader, which is responsible for loading the plugin containing my custom term processor, is right at the bottom of the hierarchy. The MG4J libraries in the Mímir WEB-INF/lib folder which is the responsibility of the Web App classloader. Each classloader only knows about it's parent and not about any children and when asked to load a class first asks it's parent classloader and only if the class cannot be loaded does it then try loading it itself. The problem I was facing was that when MG4J tried to load my custom term processor it did so by asking the Web App classloader and as it is loaded by a child classloader the class couldn't be found and hence a ClassNotFoundException was thrown. Rather than giving up and simply adding the term processor to the WEB-INF/lib folder I decided to see if I could find a way of injecting the term processor into the right classloader.
Now before we go any further I should point out that one of my collegues has described what follows as evil, and I have to say I agree with him. That said it works and given the way Mímir works I can't see any problems arising, but I wouldn't suggest this as a general solution to the class loading problem described above for reasons I'll detail later. However....
Each class in Java knows which ClassLoader instance was responsible for creating it and we can use this information to forcibly inject code into the right place, using the following method.
private static void codeInjector(Class<?> c1, Class<?> c2)
{
try
{
// Get the class loader which loaded MG4J
ClassLoader loader = c2.getClassLoader();
if (!loader.equals(c1.getClassLoader()))
{
//Assuming we aren't running inside the MG4J class loader...
//Get an input stream we can use to read the byte definition of this class
InputStream inp = c1.getClassLoader().getResourceAsStream(c1.getName().replace('.', '/') + ".class");
if (inp != null)
{
//If we could get an input stream then...
//read the class definition into a byte array
byte[] buf = new byte[1024 * 100]; //assume that the class is no larger than 100KB, this one is only 3.5KB
int n = inp.read(buf);
inp.close();
//get the defineClass method
Method method = ClassLoader.class.getDeclaredMethod("defineClass", String.class, byte[].class, int.class, int.class);
//defineClass is protected so we have to make it public before we can call it
method.setAccessible(true);
try
{
//call defineClass to inject ourselves into the MG4J class loader
method.invoke(loader, null, buf, 0, n);
}
finally
{
//set the defineClass method back to being protected
method.setAccessible(false);
}
}
}
}
catch (Exception e)
{
//hmm, something has gone badly wrong so throw the exception
throw new UndeclaredThrowableException(e, "Unable to inject " + c1.getName() + " into the same class loader as " + c2.getName());
}
}
In essence this method injects the definition of class
c1
into the classloader responsible for class
c2
. So in my term processor I call it from a static initializer as follows:
static
{
codeInjector(NormalizingTermProcessor.class,
it.unimi.dsi.mg4j.index.Index.class);
}
So how does this all work. Well hopefully the code comments will help but...
- Firstly we check that the classes are defined by different classloaders (lines 6-8)
- Then we convert the class name into the path to the class file and try and open a stream to read from that file (line 13). If we can't read the class file then it means we have already injected the class which is why the classloader can't find the class file.
- We then read the class file into a byte array (lines 20-22)
- To inject code into a classloader we need to use the
defineClass
method, which unfortunately is protected. So we retrieve a handle to the method and remove the protected restriction (lines 25-29) - We now call
deifneClass
on the classloader we want to know about the class passing in the bytes we read in from the original class file (line 33) - Finally we put the protected restriction back so we leave things as they were when we found them (line 38)
Now there are a couple of things to be aware of which could trip you up if you try and do something similar:
- If there is a security manager in place then you may find that you can't call the
defineClass
method even when the protected restriction is removed. - This code will result in the same class being defined in two classloaders (which after all was the whole point) but instances of the class cannot be shared between the classloaders. If you try to you will get an exception (can't actually remember which one).
Neither of these seem to be an issue with loading custom MG4J term processors into Mímir, so this seems to be a nice, albeit evil, way of allowing me to add functionality without having to add to the Mímir WAR file. Success!