Saturday 25 September 2021

Big problems at the timezone database

The last time I wrote about the timezone database on this blog, the database was under threat from a lawsuit. Fortunately that lawsuit went away relatively quickly as the company involved got the message that their action was a big mistake. Unfortunately this time the mess is internal.

Paul Eggert is the project lead of the timezone database hosted at IANA, a position referred to as the TZ Coordinator. He is an expert in the field, having been involved in documenting timezone data for decades. Unfortunately, he is currently ignoring all objections to an action only he seems intent on making to solve an invented problem that only he sees as important.

The database is the world's principle source of timezone information. The data is included in everything from operating systems to smartphones to programming language development kits such as the JDK. While you may never have heard of it, the sheer pervasiveness of the data makes the potential impact of change or damage pretty huge.

The timezone database contains information about how clocks have varies in each region around the world. The mandate of the project is to record this information from 1970 onwards.

Of course, computers being what they are, a function that returns the timezone for a given date can be passed in a pre-1970 date as well as a post-1970 one. For this, and reasons of completeness, the timezone database contains pre-1970 data as well as post-1970 data. If you go to your JDK or operating system and ask for the timezone offset for 1920-01-01 for the ID "Europe/Oslo" or "Europe/Berlin" you will get an answer:

  DateTimeZone oslo = DateTimeZone.forID("Europe/Oslo");
  System.out.println(oslo.getOffset(new DateTime(1948, 6, 1, 12, 0)));  //prints 3600000
  DateTimeZone berlin = DateTimeZone.forID("Europe/Berlin");
  System.out.println(berlin.getOffset(new DateTime(1948, 6, 1, 12, 0)));  //prints 7200000

The proposed change is to downgrade "Europe/Oslo" to be merely an alias for "Europe/Berlin". The rationale is that since the two regions have the same data post-1970 there should only be one ID. The issue with this is that querying "Europe/Oslo" in pre-1970 (as above) will now return the data from Berlin. ie. the well researched pre-1970 data for Oslo will be replaced by that of Berlin.

The situation with Joda-Time is even worse. With Joda-Time, aliases (also known as Links) are actively resolved. Before the proposed change this test case passes, after the proposed change the test case fails:

  assertEquals("Europe/Oslo", DateTimeZone.forID("Europe/Oslo").getID());

In other words, it will be impossible for a Joda-Time user to hold the ID "Europe/Oslo" in memory. This could be pretty catastrophic for systems that rely on timezone management, particularly ones that end up storing that data in a database. To mitigate this to some degree, I've added a test case and released a version that has such a test case in it, but obviously users of Joda-Time that haven't upgraded to the latest version may still see problems if they update the tzdb version.

But what about backwards compatibility? Well it seems that the TZ Coordinator just doesn't see this as being important. To him, it doesn't matter that users may be relying on this data.

(Technically, the data has moved, not been deleted. But the file containing the moved data is never normally used by downstream systems, thus to all intents and purposes it has been deleted.)

The question you might ask is why is Berlin favoured over Oslo? Why can Berlin keep its status and full history, but Oslo gets effectively deleted? The answer is that Berlin has the greater population - would you have guessed that if you didn't read it here?

From my perspective, I cannot see how it is not incredibly unfair that the timezone ID that represents the country of Norway, "Europe/Oslo", is treated as less important than the ID that represents the country of Germany, "Europe/Berlin". Yes, there is a technical rationale around 1970 and population, but that is really pretty arcane.

In fact it goes further than this. The TZ Coordinator does not really believe that there should be an ID for Oslo/Norway at all. (The official project rules say that there should only be one ID for locations where timezone data is the same post-1970. Country borders simply dont matter.)

Some of the 30 IDs proposed for downgrade are "Europe/Oslo", "Europe/Stockholm", "Europe/Copenhagen", "Europe/Amsterdam", "Europe/Luxembourg", "Europe/Monaco", "Atlantic/Reykjavik" and "Indian/Mahe". If you regularly use any of these IDs you may be affected by this change. Iceland is a classic case - "Atlantic/Reykjavik" is to be downgraded in favour of "Africa/Abidjan". Bonus points if you know which country Abidjan is in!

What is driving the change? Well this is where it gets really weird. The TZ Coordinator's argument is that there is a fairness/equity problem if Oslo is allowed to keep its pre-1970 history but other locations (typically in Africa) are not. I have two problems with this. Firstly, I consider it to also be unfair and inequitable that Berlin gets to have pre-1970 history and Oslo does not. Secondly, the correct approach to solving a fairness/equity problem is to level up, not level down. ie. Most leaders would want to improve the worse performers on the team, not force the best performers to be as bad as the worst.

So, we have a change with terrible downstream effects, including a potential fork of a major global data set and broken end-user applications, to make a problem that no one was complaining about a whole lot worse.

I've spent months trying to stop this happening, but appear to have lost the battle. This is despite near unaminity on the mailing list requesting the changes to be paused. Tonight 9 of the 30 changes have been included in release 2021b. These are not the ones affecting Europe.

I still hope that a solution can be found that the TZ Coordinator is happy with that avoids an impact on countries like Norway, Sweden, Denmark or the Netherlands. In the medium term, I hope that funding can be found for the CLDR project to take on the timezone database (as CLDR has a much better record at managing data like this).

Stay tuned as I try and work out how best to resolve this completely unecessary drama.