At Sixes And Sevens

At work we are slowly getting ready for a major new release of GATE. In preparation for the release I've been doing a bit of code cleanup and upgrading some of the libraries that we use. After every change I've been running the test suite and unfortunately some of the tests would intermittently fail. Given that none of the other members of the team had reported failing tests and that they were always running successfully on our Jenkins build server I decided the problem must be something related to my computer. My solution then was simply to ignore the failing tests as long as they weren't relevant to the code I was working on, and then have the build server do the final test for me. This worked, but it was exceedingly frustrating that I couldn't track down the problem. Yesterday I couldn't ignore the problem any longer because the same tests suddenly started to randomly fail on the build server as well as my computer and so I set about investigating the problem.

The tests in question are all part of a single standard JUnit test suite that was originally written back in 2001 and which have been running perfectly ever since. Essentially these tests run the main ANNIE components over four test documents checking the result at each stage. Each component is checked in a different test within the suite. Now if you know anything about unit testing you can probably already hear alarm bells ringing. For those of you that don't know what unit testing is, essentially each test should check a single component of the system (i.e. a unit) and should be independent from every other test. In this instance while each test checked a separate component, each relied on all the previous tests in the suite having run successfully.

Now while dependencies between tests isn't ideal it still doesn't explain why they should have worked fine for twelve years but were now failing. And why did they start failing on the build server long after they had been failing on my machine. I eventually tracked the change that caused them to fail when run on the build server back to the upgrade from version 4.10 to 4.11 of JUnit but even with the help of a web search I couldn't figure out what the problem was.

Given that I'd looked at the test results from my machine so many times and not spotted any problems I roped in some colleagues to run the tests for me on their own machines and send me the results to see if I could spot a pattern. The first thing that was obvious was that when using version 4.10 of JUnit the tests only failed for those people running Java 7. GATE only requires Java 6 and those people still with a Java 6 install, which includes the build server (so that we don't accidentally introduce any Java 7 dependencies), were not seeing any failures. If, however, we upgraded JUnit to version 4.11 everyone started to see random failures. The other thing that I eventually spotted was that when the tests failed, the logs seemed to suggest that they had been run in a random order which, given the unfortunate links between the tests, would explain why they then failed. Armed with all this extra information I went back to searching the web and this time I was able to find the problem and an easy solution.

Given that unit tests are meant to be independent from one another, there isn't actually anything within the test suite that stipulates the order in which they should run, but it seems that it always used to be the case that the tests were run in the order in which they were defined in the source code. The tests are extracted from the suite by looking for all methods that start with the word test, and these are extracted from the class definition using the Method[] getDeclaredMethods() method from java.lang.Class. The documentation for this method includes the following description:

Returns an array of Method objects reflecting all the methods declared by the class or interface represented by this Class object. This includes public, protected, default (package) access, and private methods, but excludes inherited methods. The elements in the array returned are not sorted and are not in any particular order.

This makes it more than clear that we should never have assumed that the tests would be run in the same order they were defined, but it turns out that this was the order in which the methods were returned when using the Sun/Oracle versions of Java up to and including Java 6 (update 30 is the last version I've tested). I've written the following simple piece of code that shows the order of the extracted tests as well as info on the version of Java being used:
import java.lang.reflect.Method;

public class AtSixesAndSevens {
  public static void main(String args[]) {
    System.out.println("java version \"" +
      System.getProperty("java.version") + "\"");
    System.out.println(System.getProperty("java.runtime.name") +
      " (build " + System.getProperty("java.runtime.version") + ")");
    System.out.println(System.getProperty("java.vm.name") +
      " (build " + System.getProperty("java.vm.version") + " " +
      System.getProperty("java.vm.info") + ")\n");

    for(Method m : AtSixesAndSevens.class.getDeclaredMethods()) {
      if(m.getName().startsWith("test"))
        System.out.println(m.getName());
    }
  }

  public void testTokenizer() {}
  public void testGazetteer() {}
  public void testSplitter() {}
  public void testTagger() {}
  public void testTransducer() {}
  public void testCustomConstraintDefs() {}
  public void testOrthomatcher() {}
  public void testAllPR() {}
}
Running this on the build server gives the following output:
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02 mixed mode)

testTokenizer
testGazetteer
testSplitter
testTagger
testTransducer
testCustomConstraintDefs
testOrthomatcher
testAllPR
While running it on my machine results in a random ordering of the test methods as you can see here:
java version "1.7.0_51"
OpenJDK Runtime Environment (build 1.7.0_51-b00)
OpenJDK 64-Bit Server VM (build 24.45-b08 mixed mode)

testTagger
testTransducer
testGazetteer
testSplitter
testTokenizer
testCustomConstraintDefs
testOrthomatcher
testAllPR
Interestingly it would seem that the order only changes when the class is re-compiled, which suggests that the ordering may be related to how the methods are stored in the class file, but understanding the inner workings of the class file format is well beyond me. Even more interestingly it seems that even with Java 6 you can see a random ordering if you aren't using a distribution from Sun/Oracle, as here is the output from running under the Java 6 version of OpenJDK:
java version "1.6.0_30"
OpenJDK Runtime Environment (build 1.6.0_30-b30)
OpenJDK 64-Bit Server VM (build 23.25-b01 mixed mode)

testTagger
testTransducer 
testCustomConstraintDefs
testOrthomatcher
testAllPR
testTokenizer
testGazetteer
testSplitter
So this explains why switching from Java 6 to Java 7 could easily cause these related tests to fail, but why should upgrading from JUnit version 4.10 to 4.11 while staying with Java 6 cause a problem?

It turns out that in the new version of JUnit the developers decided to change the default behaviour away from relying on the non-deterministic method ordering provided by Java. Their default approach is now to use a deterministic ordering to guarantee the tests are always run in the same order; as far as I can tell this orders the methods by sorting on the hashCodes of the method names. While this may at least remove the randomness from the test order it doesn't keep them in the same order they are defined in the source file, and so our tests were always failing under JUnit 4.11. Fortunately the developers also allow you to add a class level annotation to force the methods to be ordered alphabetically. I've now renamed the tests so that when sorted alphabetically they are run in the right order (by adding a three digit number after the initial test in their name), and the class definition now looks like:
@FixMethodOrder(MethodSorters.NAME_ASCENDING)
public class TestPR extends TestCase {
  ...
}
So I guess there are two morals to this story. Firstly unit tests are called unit tests for a reason and they really should be independent of one another, but more importantly reading the documentation for the language or library you are using and not making assumptions about how they work (especially when the documentation tells you not to rely on something always being true) would make life easier.