Tuesday 4 December 2012

Annotating JDK default data

A common issue in Java development is use of the default Locale, default TimeZone and default CharSet.

JDK defaults

The JDK has a number of defaults that apply to a running JVM. The most well known are the default Locale, default TimeZone and default CharSet.

These defaults are very useful for getting systems up and running quickly. When a developer new to Java writes some code, they should get the localized answer they expect. However, in any larger environment, especially code intended to run on a server, the defaults cause a problem.

The classic example of this is the String.toLowerCase() method. Many developers use str1.toLowerCase().equals(str2.toLowerCase()) to check if two strings are equal ignoring case. But this code is not valid is all Locales! It turns out that in Turkey, there are two different upper case versions of the letter I and two different lower case versions. This causes lots of problems.

The problem applies more generally to server side applications. Any server-side code that relies on a JDK method using one of the defaults will alter its behaviour based on where and how the server is setup. This is clearly undesirable.

As is often the case, there are two competing forces - ease of use for newcomers, and bugs in larger applications. There is also the issue of backwards compatibility, meaning that the JDK methods depending on the defaults cannot be removed.

One approach has been taken by the Lucene project, scanning code using ASM. The link/tool also provides a useful list of JDK methods affected by this problem:

  java.lang.String#<init>(byte[])
  java.lang.String#<init>(byte[],int)
  java.lang.String#<init>(byte[],int,int)
  java.lang.String#<init>(byte[],int,int,int)
  java.lang.String#getBytes()
  java.lang.String#getBytes(int,int,byte[],int) 
  java.lang.String#toLowerCase()
  java.lang.String#toUpperCase()
  java.lang.String#format(java.lang.String,java.lang.Object[])

  java.io.FileReader
  java.io.FileWriter
  java.io.ByteArrayOutputStream#toString()
  java.io.InputStreamReader#(java.io.InputStream)
  java.io.OutputStreamWriter#(java.io.OutputStream)
  java.io.PrintStream#(java.io.File)
  java.io.PrintStream#(java.io.OutputStream)
  java.io.PrintStream#(java.io.OutputStream,boolean)
  java.io.PrintStream#(java.lang.String)
  java.io.PrintWriter#(java.io.File)
  java.io.PrintWriter#(java.io.OutputStream)
  java.io.PrintWriter#(java.io.OutputStream,boolean)
  java.io.PrintWriter#(java.lang.String)
  java.io.PrintWriter#format(java.lang.String,java.lang.Object[])
  java.io.PrintWriter#printf(java.lang.String,java.lang.Object[])

  java.nio.charset.Charset#displayName()

  java.text.BreakIterator#getCharacterInstance()
  java.text.BreakIterator#getLineInstance()
  java.text.BreakIterator#getSentenceInstance()
  java.text.BreakIterator#getWordInstance()
  java.text.Collator#getInstance()
  java.text.DateFormat#getTimeInstance()
  java.text.DateFormat#getTimeInstance(int)
  java.text.DateFormat#getDateInstance()
  java.text.DateFormat#getDateInstance(int)
  java.text.DateFormat#getDateTimeInstance()
  java.text.DateFormat#getDateTimeInstance(int,int)
  java.text.DateFormat#getInstance()
  java.text.DateFormatSymbols#()
  java.text.DateFormatSymbols#getInstance()

  java.text.DecimalFormat#()
  java.text.DecimalFormat#(java.lang.String)
  java.text.DecimalFormatSymbols#()
  java.text.DecimalFormatSymbols#getInstance()
  java.text.MessageFormat#(java.lang.String)
  java.text.NumberFormat#getInstance()
  java.text.NumberFormat#getNumberInstance()
  java.text.NumberFormat#getIntegerInstance()
  java.text.NumberFormat#getCurrencyInstance()
  java.text.NumberFormat#getPercentInstance()
  java.text.SimpleDateFormat#()
  java.text.SimpleDateFormat#(java.lang.String)

  java.util.Calendar#()
  java.util.Calendar#getInstance()
  java.util.Calendar#getInstance(java.util.Locale)
  java.util.Calendar#getInstance(java.util.TimeZone)
  java.util.Currency#getSymbol()
  java.util.GregorianCalendar#()
  java.util.GregorianCalendar#(int,int,int)
  java.util.GregorianCalendar#(int,int,int,int,int)
  java.util.GregorianCalendar#(int,int,int,int,int,int)
  java.util.GregorianCalendar#(java.util.Locale)
  java.util.GregorianCalendar#(java.util.TimeZone)

  java.util.Scanner#(java.io.InputStream)
  java.util.Scanner#(java.io.File)
  java.util.Scanner#(java.nio.channels.ReadableByteChannel)
  java.util.Formatter#()
  java.util.Formatter#(java.lang.Appendable)
  java.util.Formatter#(java.io.File)
  java.util.Formatter#(java.io.File,java.lang.String)
  java.util.Formatter#(java.io.OutputStream)
  java.util.Formatter#(java.io.OutputStream,java.lang.String)
  java.util.Formatter#(java.io.PrintStream)
  java.util.Formatter#(java.lang.String)
  java.util.Formatter#(java.lang.String,java.lang.String)

Thinking about the two competing forces, it seems to me that the best option would be a new annotation.

Imagine the JDK adds an annotation @DependsOnJdkDefaults. This would be used to annotate all methods in the JDK that directly or indirectly rely on a default, including all of those above. Developers outside the JDK could of course also use the annotation to mark their methods relying on defaults.

  @DependsOnJdkDefaults
  public String toLowerCase() {
    return toLowerCase(Locale.getDefault());
  }

Tooling like Checkstyle, IDEs and perhaps even the JDK compiler could then use the annotation to warn developers that they should not use the method. It would be possible to even envisage an IDE setting to hide the methods from auto-complete.

This would seem to balance the competing forces by providing information that could be widely used and very valuable.

Summary

I think marking methods that rely on JDK defult Locale/TimeZone/CharSet with an annotation would be a powerful tool. What do you think?