Tuesday 4 December 2012

Annotating JDK default data

A common issue in Java development is use of the default Locale, default TimeZone and default CharSet.

JDK defaults

The JDK has a number of defaults that apply to a running JVM. The most well known are the default Locale, default TimeZone and default CharSet.

These defaults are very useful for getting systems up and running quickly. When a developer new to Java writes some code, they should get the localized answer they expect. However, in any larger environment, especially code intended to run on a server, the defaults cause a problem.

The classic example of this is the String.toLowerCase() method. Many developers use str1.toLowerCase().equals(str2.toLowerCase()) to check if two strings are equal ignoring case. But this code is not valid is all Locales! It turns out that in Turkey, there are two different upper case versions of the letter I and two different lower case versions. This causes lots of problems.

The problem applies more generally to server side applications. Any server-side code that relies on a JDK method using one of the defaults will alter its behaviour based on where and how the server is setup. This is clearly undesirable.

As is often the case, there are two competing forces - ease of use for newcomers, and bugs in larger applications. There is also the issue of backwards compatibility, meaning that the JDK methods depending on the defaults cannot be removed.

One approach has been taken by the Lucene project, scanning code using ASM. The link/tool also provides a useful list of JDK methods affected by this problem:

  java.lang.String#<init>(byte[])
  java.lang.String#<init>(byte[],int)
  java.lang.String#<init>(byte[],int,int)
  java.lang.String#<init>(byte[],int,int,int)
  java.lang.String#getBytes()
  java.lang.String#getBytes(int,int,byte[],int) 
  java.lang.String#toLowerCase()
  java.lang.String#toUpperCase()
  java.lang.String#format(java.lang.String,java.lang.Object[])

  java.io.FileReader
  java.io.FileWriter
  java.io.ByteArrayOutputStream#toString()
  java.io.InputStreamReader#(java.io.InputStream)
  java.io.OutputStreamWriter#(java.io.OutputStream)
  java.io.PrintStream#(java.io.File)
  java.io.PrintStream#(java.io.OutputStream)
  java.io.PrintStream#(java.io.OutputStream,boolean)
  java.io.PrintStream#(java.lang.String)
  java.io.PrintWriter#(java.io.File)
  java.io.PrintWriter#(java.io.OutputStream)
  java.io.PrintWriter#(java.io.OutputStream,boolean)
  java.io.PrintWriter#(java.lang.String)
  java.io.PrintWriter#format(java.lang.String,java.lang.Object[])
  java.io.PrintWriter#printf(java.lang.String,java.lang.Object[])

  java.nio.charset.Charset#displayName()

  java.text.BreakIterator#getCharacterInstance()
  java.text.BreakIterator#getLineInstance()
  java.text.BreakIterator#getSentenceInstance()
  java.text.BreakIterator#getWordInstance()
  java.text.Collator#getInstance()
  java.text.DateFormat#getTimeInstance()
  java.text.DateFormat#getTimeInstance(int)
  java.text.DateFormat#getDateInstance()
  java.text.DateFormat#getDateInstance(int)
  java.text.DateFormat#getDateTimeInstance()
  java.text.DateFormat#getDateTimeInstance(int,int)
  java.text.DateFormat#getInstance()
  java.text.DateFormatSymbols#()
  java.text.DateFormatSymbols#getInstance()

  java.text.DecimalFormat#()
  java.text.DecimalFormat#(java.lang.String)
  java.text.DecimalFormatSymbols#()
  java.text.DecimalFormatSymbols#getInstance()
  java.text.MessageFormat#(java.lang.String)
  java.text.NumberFormat#getInstance()
  java.text.NumberFormat#getNumberInstance()
  java.text.NumberFormat#getIntegerInstance()
  java.text.NumberFormat#getCurrencyInstance()
  java.text.NumberFormat#getPercentInstance()
  java.text.SimpleDateFormat#()
  java.text.SimpleDateFormat#(java.lang.String)

  java.util.Calendar#()
  java.util.Calendar#getInstance()
  java.util.Calendar#getInstance(java.util.Locale)
  java.util.Calendar#getInstance(java.util.TimeZone)
  java.util.Currency#getSymbol()
  java.util.GregorianCalendar#()
  java.util.GregorianCalendar#(int,int,int)
  java.util.GregorianCalendar#(int,int,int,int,int)
  java.util.GregorianCalendar#(int,int,int,int,int,int)
  java.util.GregorianCalendar#(java.util.Locale)
  java.util.GregorianCalendar#(java.util.TimeZone)

  java.util.Scanner#(java.io.InputStream)
  java.util.Scanner#(java.io.File)
  java.util.Scanner#(java.nio.channels.ReadableByteChannel)
  java.util.Formatter#()
  java.util.Formatter#(java.lang.Appendable)
  java.util.Formatter#(java.io.File)
  java.util.Formatter#(java.io.File,java.lang.String)
  java.util.Formatter#(java.io.OutputStream)
  java.util.Formatter#(java.io.OutputStream,java.lang.String)
  java.util.Formatter#(java.io.PrintStream)
  java.util.Formatter#(java.lang.String)
  java.util.Formatter#(java.lang.String,java.lang.String)

Thinking about the two competing forces, it seems to me that the best option would be a new annotation.

Imagine the JDK adds an annotation @DependsOnJdkDefaults. This would be used to annotate all methods in the JDK that directly or indirectly rely on a default, including all of those above. Developers outside the JDK could of course also use the annotation to mark their methods relying on defaults.

  @DependsOnJdkDefaults
  public String toLowerCase() {
    return toLowerCase(Locale.getDefault());
  }

Tooling like Checkstyle, IDEs and perhaps even the JDK compiler could then use the annotation to warn developers that they should not use the method. It would be possible to even envisage an IDE setting to hide the methods from auto-complete.

This would seem to balance the competing forces by providing information that could be widely used and very valuable.

Summary

I think marking methods that rely on JDK defult Locale/TimeZone/CharSet with an annotation would be a powerful tool. What do you think?

26 comments:

  1. Hey Stephen, very nice blog post. I like the idea! Unfortunately this needs to be communicated through JCP...

    ReplyDelete
  2. A great annotation for these methods already exists: @Deprecated

    ReplyDelete
  3. @Robert: Unfortunately @Deprecated must also be communicated through JCP...

    ReplyDelete
  4. FWIW, Findbugs issues a warning when using the default charset (http://findbugs.sourceforge.net/bugDescriptions.html#DM_DEFAULT_ENCODING) - I don't think it does anything for default locale / timezone.

    ReplyDelete
  5. @Anonymous: The method list FindBugs is on is not complete (it in fact only catches the most common ones). We investigated FindBugs and PMD at Lucene; but then moved to the 100 times faster ASM based checker with a full-featured charset/locale/TZ list. We were also able to disable other horrible code patterns like printStackTrace().

    ReplyDelete
    Replies
    1. Probably a bit late now, but it would have been possible to implement a custom FindBugs detector that used ASM and thus leveraged existing build infrastructure (ant/maven/etc). Taking it one step further, if there's a precedence for this check in FindBugs, it might be nice to coordinate with the FindBugs team to have it included by default.

      Delete
  6. @Uwe, the JCP isn't really relevant here. Its mostly a rubber stamping body on Java SE matters. It certainly doesn't initiate ideas. This needs to go into the OpenJDK processes if its going to happen. It needs a JEP and a plan for JDK1.9.

    ReplyDelete
    Replies
    1. @Stephen: JEP is also fine, I just don't know all those terms. From my perspective it has to go through some standardization gremium and that's the biggest issue here.

      @Nathan: "The information implied by the annotation may be misleading. For example, many constructors of Formatter are on the list, because they obtain the default locale." -> From my perspective the ctor is still affected because it returns to you an instance of Formatter that is depending on default locale, so it is affected by default locale, just not direct but indirect. Unless you have a separate subclass like DefaultLocaleFormatter which has all methods marked as unsafe, this is the only way to mark those APIs.

      @Frisian: I would be happy if PMD or FindBugs would check for this correctly, but unfortunately their support for dsetecting this type of problems is horrible incomplete. This is why Lucene implemented its own checker as ANT task using bytecode analysis which is horribly fast (and a MAVEN task is also available by some 3rd party people that forked our code). With the introduction of the annotation this byte code analysis could be extended to go away from static lists, but instead just check for annotations on every invoke VM instruction.

      Delete
    2. I agree that the constructor is affected. By this point I meant only to indicate that the annotation may be misleading. Consider the annotation on String.toLowerCase() and on DateFormat.getInstance(). In the former case, the annotation seems to apply because of what state the method changes. In the latter case, the annotation seems to apply because of what value the method returns. In actuality, the annotation applies in both cases because each method calls (either directly or indirectly) a variant of Locale.getDefault. I think it's the "depends on" wording that gives me some hesitation. Maybe something like @ObtainsJdkDefault could be more narrowly interpreted.

      Delete
  7. Three potential issues:

    (1) The annotation indicates that zero or more paths may be default dependent, not that all paths are default dependent. In this way it may be overly safe. Some checked exceptions suffer from this. (MalformedURLException comes to mind.)

    (2) The information implied by the annotation may be misleading. For example, many constructors of Formatter are on the list, because they obtain the default locale. However, the default locale is not *used* by these constructors. It is the methods of such an instance of formatter that have potentially unsafe behavior.

    (3) The information implied by the annotation may be lost by certain abstractions. For example, String.CASE_INSENSITIVE_ORDER.compare(String, String) depends on String.toLowerCase(). Transitively, String.CASE_INSENSITIVE_ORDER.compare(String, String) is default dependent. Transitively, anything that invokes String.CASE_INSENSITIVE_ORDER.compare(String, String) is default dependent. However, the latter dependency may not be knowable, since the type of the instance of Comparator may not be knowable.

    ReplyDelete
    Replies
    1. (3): This is not true, see javadocs: CASE_INSENSITIVE_ORDER - A Comparator that orders String objects as by compareToIgnoreCase -> referring to -> Compares two strings lexicographically, ignoring case differences. This method returns an integer whose sign is that of calling compareTo with normalized versions of the strings where case differences have been eliminated by calling Character.toLowerCase(Character.toUpperCase(character)) on each character. Note that this method does not take locale into account, and will result in an unsatisfactory ordering for certain locales. The java.text package provides collators to allow locale-sensitive ordering.

      So this constant/method is 100% safe and works as if it would use Locale.ROOT (which uses the string lowercasing implied by locale-independent char-by-char lowercasing, e.g. also implemented by Lucene's LowerCase text analysis filters). Please note: There is no Locale-based character lowercasing, locales only affect strings. Because of this lowercasing every single char is totally different to lowercasing a string (which is another confusing thing for beginners).

      Delete
    2. You are correct. I was mistaking Character.toLowerCase for String.toLowerCase, which are not generally equivalent, of course. Still, the point stands that such an annotation on an implementing or overriding method is effectively lost if the overridden method is virtually invoked. This kind of transitivity is more difficult to discover than it is for other effect annotations like checked exceptions, because the same notion of covariance does not apply.

      Delete
    3. I don't know if it's correct to say that it's 100% safe just because it doesn't use the default locale. Sometimes using Locale.ROOT by default is the bug.

      Delete
  8. Given, that these methods are perfectly fine for a client-side application, wouldn't it be enough to check them with PMD or Checkstyle? The JDK's source code is available and I don't think that a lot locale-dependent methods will be added to it in the future.
    I like the annotation, though. But I'd like to see a more generic variant: @DependsOn("JdkDefaults")

    ReplyDelete
    Replies
    1. "Given, that these methods are perfectly fine for a client-side application"

      This is not always true! A client application is also affected by this! Default charsets and Locales should only be used when the user opens the file from his local filesystem or when you present some formatted text to the user. But this is not true for another very common case - the best example I can give is listed in my own blog post about this (referred to by Stephen): Your application may ship with some text-file resource inside the JAR file (e.g. in Lucene those are language specific stop word files), that is read using Class#getResourceAsStream() wrapped by a Reader without charset. In this case, you are the provider of the resource file inside your application's JAR file and so *you* define the charset! The default charset is in 99% of all cases the wrong one!

      Delete
  9. Great idea, but it seems wasteful to use it for JDK defaults only. You are basically talking about a kind of method effect annotation. Frisian in the comment above had suggested @DependsOn("JdkDefault"), but more correct and fully generic form should be more like @DependsOn("Locale.default"), @DependsOn("TimeZone.default") or @DependsOn("Locale.default", "TimeZone.default", ...), etc using the appropriate "import" information from the source file to shorten the package names. One should be able to use this annotation for defaults in other frameworks and to otherwise annotate method's dependence on a mutable static variables.

    However, just like with other effects annotations (and as with "throws" which is also a special kind of effects annotation) there is a scalability problem. Checked exceptions and "throws" somewhat work around this problem by providing a way of handling exception and/or wrapping them into other types of exceptions. Even though checked exceptions are a powerful and very helpful tool in skillful hands precisely because they can be used as an effect annotation tool that is being enforced by compiler, they still invoke a lot of criticism from many people.

    One should figure first how the corresponding scalability problem should be addressed before implementing any kind of @DependsOn annotation.

    ReplyDelete
    Replies
    1. Annotations need not take Strings, they could take a class name. I don't know if that would apply in a yet designed @DependsOn

      Delete
  10. I wouldn't consider this annotation to be like checked exceptions. It is additional info provided for use. It would be totally up to applications whether they apply the annotation to their own code. I don't see the JDK trying to pass the info on.

    I'm not a fan of string based annotations here either. The string is a magic constant that developers have to learn and is not directly compile time checked.

    ReplyDelete
    Replies
    1. The obvious choice would be to use an enum for that. But, because enums are final and annotations can't be used with a type parameter, said enum could only be extended by the JDK developers. The concept of a @DependsOn annotation is IMHO so useful, that application developers should be able to introduce application-specific categories.
      Besides, @SupressWarnings takes a String, too, and there don't seem to be too many complaints about this fact.

      Delete
  11. Good idea. In a similar vein, I'd love to see compiler/tool support for recognising when you've created a thread or thread pool with a default name. Having usefully-named threads makes diagnosis with VisualVM so much less painful.

    ReplyDelete
    Replies
    1. Hi,
      the tool that is used by Lucene to detect those names (the list with signatures above comes from our tool) can also be used to "disallow/forbid" those signatures, in fact, the Lucene SVN also contains a second signature file containing those methods: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/tools/forbiddenApis/executors.txt?view=markup

      I am planning to release the tool as a custom ANT or Maven plugin (currently we only have a ANT version available). It is configureable with own signature lists. See my blog about this. We run the check after compiling our code on the bytecode of the generated classes.

      Delete
    2. That's good news, Uwe, I'll keep an eye on your blog.

      Delete
  12. I don't get it:

    1) all defaults can be set with system properties + for Locale & TimeZone there are also specific methods to set the default. So for server-side applications you can easily force the defaults you want.

    2) I didn't check all of them, but (almost?) every method on the list has an overloaded method with Locale as a parameter. So for a developer it ought to be obvious that if they don't pass a Locale, Java will have to somehow figure out a "default" value for it.

    3) what if Java gets default parameters some day? Annotate every such method with @DependsOnDefaultValuesOfParametersWhenNotGiven?

    So can someone explain me what all the fuss is about?

    ReplyDelete
    Replies
    1. Hi:

      "2) I didn't check all of them, but (almost?) every method on the list has an overloaded method with Locale as a parameter. So for a developer it ought to be obvious that if they don't pass a Locale, Java will have to somehow figure out a "default" value for it."

      The problem here is that e.g. IDEs like eclipse always provide the simpliest method signature to the user (e.g. Eclipse never offers String.toLowerCase(Locale) by itsself). This leads to code horrible broken in the Turkish Locale.

      Delete
    2. Thus the annotation is useful, not as compiler warning like Deprecated, but useful for and IDE to either flag or resort the possibly choices.

      To the point of using @Deprecated. No, the idea is NOT that you should never use these calls and that there is something new and improved, but that such a call has some limitations. Many of the more explicit calls where designed at the same time as no arg version, so deprecated suggests something that is not true.

      Maybe a lesson for API developers is that a call that can assume defaults might be better represented as an API that allows nulls for certain arguments, or require a call to the pass the actual system default (Timezone.getDefault() etc.) than a version that has NO or fewer args.

      System properties for the whole server is NOT always the answer. Sometimes it is a problem of the data comes from users with different locales.

      Delete
  13. Hi, the Lucene forbidden API checker was forked and released as Ant, Maven or CLI task at https://code.google.com/p/forbidden-apis/, available via Maven Central.

    ReplyDelete

Please be aware that by commenting you provide consent to associate your selected profile with your comment. Long comments or those with excessive links may be deleted by Blogger (not me!). All spam will be deleted.