Sunday 21 February 2010

Serialization - shared delegates

I've been working on Joda-Money as a side project and have been investigating serialization, with a hope of improving JSR-310

Small serialization

Joda-Money has two key classes - BigMoney, capable of storing information to any scale and Money, limited to the correct number of decimal places for the currency.

 public class BigMoney {
   private final CurrencyUnit currency;
   private final BigDecimal amount;
 }
 public class Money {
   private final BigMoney money;
 }

A default application of serialization to these classes will generate 525 bytes for BigMoney and 599 bytes for Money. This is a lot of data to be sending for objects that seem quite simple.

Where does the size go?

Well, each serialized class had to write a header to state what the class is. For something like Money, it has to write a header for itself, BigMoney, CurrencyUnit, BigDecimal and BigInteger. The header also includes the serialization version number and the names of each field.

Of course, serialization is designed to handle complex cases where the versions of the class file differ on two JVMs. Data is populated into the right fields using the field name. But for simple classes like money, the data isn't going to change over time.

One interesting fact is that the class header is only sent once per stream for a class. As a result, for each subsequent after the first the size is reduced. For default serialization of a subsequent BigMoney the size is 59 bytes and for Money it is 65 bytes. Clearly, the header is a major overhead.

Making the data smaller

The key to this is using a serialization delegate class. The delegate is a class that is written into the output stream in place of the original class. This approach is required because the fields are final which prevents a sensible data format from being written/read by the class itself.

 public class Money {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser( ... );
   }
 }

So, there is a new class Ser which will appear in the stream wherever the Money class would have been. The name Ser is deliberately short, as each letter takes up space in the stream.

The delegate class is usually written as a static inner class:

 public class Money implements Serializable {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser(this);
   }
   private static class Ser implements Serializable {
     private Money obj;
     private Ser(Money money) {obj = money;}
     private void writeObject(ObjectOutputStream out) throws IOException {
       // write data to stream, avoiding defaultWriteObject()
       // this writes the currency code and amount directly
     }
     private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
       // read data from stream to obj variable, avoiding defaultReadObject()
     }
     private Object readResolve() {
       return obj;
     }
   }
 }

The delegate class uses the low level writeObject and readObject to control the data in the stream. The readResolve method then returns the correct object back for the serialization mechanism to put in the object structure. The class is static to ensure a stable serialized form.

Simply taking control of the stream in this way will greatly reduce the overall size. The biggest gain is in writing out the BigDecimal in an efficient manner.

Even better?

My investigation has shown a technique to make the stream even smaller.

Firstly, rather than using a static inner class, use a top-level package scoped class. This will have a shorter fully qualified class name, thus a shorter header.

Secondly, look at the other classes in the package. If there are more classes that need the same treatment, why not use a single delegate class for all of them?

 public class BigMoney {
   private final CurrencyUnit currency;
   private final BigDecimal amount;
   private Object writeReplace() {
     return new Ser(Ser.BIG_MONEY, this);
   }
 }
 public class Money {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser(Ser.MONEY, this);
   }
 }
 class Ser implements Externalizable {
   static final byte BIG_MONEY = 0;
   static final byte MONEY = 1;
   private byte type;
   private Object obj;
   private Ser(byte type, Object obj) {this.byte = byte; this.obj = obj;}
   public void writeExternal(ObjectOutput out) throws IOException {
     out.writeByte(type);
     switch (type) {
       // write data to stream based on the type
       // this writes the currency code and amount directly
     }
   }
   public void readExternal(ObjectInput in) throws IOException {
     type = in.readByte();
     switch (type) {
       // read data from stream to obj variable based on the type
     }
   }
   private Object readResolve() {
     return obj;
   }
 }

So, both classes are sharing the same serialization delegate, using a single byte type to distinguish them. Since the header is written once per class per stream, there is now only one header written whether your stream contains BigMoney, Money or both.

I've also switched to using Externalizable rather than Serializable. Despite the public methods, these cannot be called on the general API because this is a package scoped class. This change doesn't affect the stream size, but should perform faster (untested!) as there is less reflection involved.

With these changes, the stream size for sending one BigMoney or Money drops to 58 bytes from 525/299 bytes. Sending a subsequent object of the same type drops to 24 bytes, whereas the default would be 59/65 bytes.

The single shared delegate approach also results in a smaller jar file, as there is a large jar file size overhead for each separate class. (We've replaced two delegates by one, so the jar is smaller).

One downside with this approach is that serialization is no longer encapsulated within the class being serialized. This may result in a constructor becoming package scoped rather than private.

The approach is also only recommended where the class and serialized format is stable, as you are fully responsible for evolution over time of the data format.

A final downside is that the object identity of objects might not be not preserved. For example, if the data of the BigDecimal is written out rather than a reference to the object then a new BigDecimal object will be created for each BigMoney deserialized. The extent to which this is a problem is dependent on the memory structure being serialized.

The same problem applies to multiple Money object backed by the same BigMoney. The default serialized size for the second would be just 10 bytes, whereas the basic shared delegate approach would be 24 bytes.

As a result, I recommend only writing the base class, BigMoney in this case, directly using its contents. Other classes that contain the base class, Money in this case, should write out a reference to the BigMoney from the shared delegate. This approach means that the second Money takes 14 bytes when the BigMoney is shared and 34 bytes when it isn't.

Using this final approach, the figures are as follows

Object Default serialization Shared delegate
First sent Subsequent First sent Subsequent
BigMoney 525 59 58 24
Money 599 65 68 34
Money with shared BigMoney 599 10 68 14

Summary

The shared delegate technique offers one route to the smallest stream size for serialization. The data size for the first object was a tenth of the original, and halved for subsequent objects. However, I would recommend this as a specialist technique for low level value objects rather than general beans.

So is this worth applying to JSR-310? Feedback welcome!

Monday 8 February 2010

New job - impact on JSR-310

This is a quick blog to outline my upcoming job change and how it affects JSR-310.

For many years I've worked for SITA, global leader in air transport communications and IT solutions. But the time has come to move on, so from the 1st of March I'm starting a new job at a London startup, Open Gamma.

So, what can I tell you about Open Gamma? Well not too much just yet as its only just coming out of stealth mode. I can say they're lead by Kirk Wylie, they're building technology for the financial industry, and I'm excited about their big idea! Oh, and they're hiring (London only).

And how does this affect JSR-310?

Well, OpenGamma will be actively supporting my work on JSR-310 in work time! Clearly this will have a big impact on development pace, and we may yet make JDK 7 (but of course thats up to the SunOracle).

In the meantime, watch out for the Early Draft Review of JSR-310 where I'll need maximum feedback!