Sunday, 21 February 2010

Serialization - shared delegates

I've been working on Joda-Money as a side project and have been investigating serialization, with a hope of improving JSR-310

Small serialization

Joda-Money has two key classes - BigMoney, capable of storing information to any scale and Money, limited to the correct number of decimal places for the currency.

 public class BigMoney {
   private final CurrencyUnit currency;
   private final BigDecimal amount;
 }
 public class Money {
   private final BigMoney money;
 }

A default application of serialization to these classes will generate 525 bytes for BigMoney and 599 bytes for Money. This is a lot of data to be sending for objects that seem quite simple.

Where does the size go?

Well, each serialized class had to write a header to state what the class is. For something like Money, it has to write a header for itself, BigMoney, CurrencyUnit, BigDecimal and BigInteger. The header also includes the serialization version number and the names of each field.

Of course, serialization is designed to handle complex cases where the versions of the class file differ on two JVMs. Data is populated into the right fields using the field name. But for simple classes like money, the data isn't going to change over time.

One interesting fact is that the class header is only sent once per stream for a class. As a result, for each subsequent after the first the size is reduced. For default serialization of a subsequent BigMoney the size is 59 bytes and for Money it is 65 bytes. Clearly, the header is a major overhead.

Making the data smaller

The key to this is using a serialization delegate class. The delegate is a class that is written into the output stream in place of the original class. This approach is required because the fields are final which prevents a sensible data format from being written/read by the class itself.

 public class Money {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser( ... );
   }
 }

So, there is a new class Ser which will appear in the stream wherever the Money class would have been. The name Ser is deliberately short, as each letter takes up space in the stream.

The delegate class is usually written as a static inner class:

 public class Money implements Serializable {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser(this);
   }
   private static class Ser implements Serializable {
     private Money obj;
     private Ser(Money money) {obj = money;}
     private void writeObject(ObjectOutputStream out) throws IOException {
       // write data to stream, avoiding defaultWriteObject()
       // this writes the currency code and amount directly
     }
     private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
       // read data from stream to obj variable, avoiding defaultReadObject()
     }
     private Object readResolve() {
       return obj;
     }
   }
 }

The delegate class uses the low level writeObject and readObject to control the data in the stream. The readResolve method then returns the correct object back for the serialization mechanism to put in the object structure. The class is static to ensure a stable serialized form.

Simply taking control of the stream in this way will greatly reduce the overall size. The biggest gain is in writing out the BigDecimal in an efficient manner.

Even better?

My investigation has shown a technique to make the stream even smaller.

Firstly, rather than using a static inner class, use a top-level package scoped class. This will have a shorter fully qualified class name, thus a shorter header.

Secondly, look at the other classes in the package. If there are more classes that need the same treatment, why not use a single delegate class for all of them?

 public class BigMoney {
   private final CurrencyUnit currency;
   private final BigDecimal amount;
   private Object writeReplace() {
     return new Ser(Ser.BIG_MONEY, this);
   }
 }
 public class Money {
   private final BigMoney money;
   private Object writeReplace() {
     return new Ser(Ser.MONEY, this);
   }
 }
 class Ser implements Externalizable {
   static final byte BIG_MONEY = 0;
   static final byte MONEY = 1;
   private byte type;
   private Object obj;
   private Ser(byte type, Object obj) {this.byte = byte; this.obj = obj;}
   public void writeExternal(ObjectOutput out) throws IOException {
     out.writeByte(type);
     switch (type) {
       // write data to stream based on the type
       // this writes the currency code and amount directly
     }
   }
   public void readExternal(ObjectInput in) throws IOException {
     type = in.readByte();
     switch (type) {
       // read data from stream to obj variable based on the type
     }
   }
   private Object readResolve() {
     return obj;
   }
 }

So, both classes are sharing the same serialization delegate, using a single byte type to distinguish them. Since the header is written once per class per stream, there is now only one header written whether your stream contains BigMoney, Money or both.

I've also switched to using Externalizable rather than Serializable. Despite the public methods, these cannot be called on the general API because this is a package scoped class. This change doesn't affect the stream size, but should perform faster (untested!) as there is less reflection involved.

With these changes, the stream size for sending one BigMoney or Money drops to 58 bytes from 525/299 bytes. Sending a subsequent object of the same type drops to 24 bytes, whereas the default would be 59/65 bytes.

The single shared delegate approach also results in a smaller jar file, as there is a large jar file size overhead for each separate class. (We've replaced two delegates by one, so the jar is smaller).

One downside with this approach is that serialization is no longer encapsulated within the class being serialized. This may result in a constructor becoming package scoped rather than private.

The approach is also only recommended where the class and serialized format is stable, as you are fully responsible for evolution over time of the data format.

A final downside is that the object identity of objects might not be not preserved. For example, if the data of the BigDecimal is written out rather than a reference to the object then a new BigDecimal object will be created for each BigMoney deserialized. The extent to which this is a problem is dependent on the memory structure being serialized.

The same problem applies to multiple Money object backed by the same BigMoney. The default serialized size for the second would be just 10 bytes, whereas the basic shared delegate approach would be 24 bytes.

As a result, I recommend only writing the base class, BigMoney in this case, directly using its contents. Other classes that contain the base class, Money in this case, should write out a reference to the BigMoney from the shared delegate. This approach means that the second Money takes 14 bytes when the BigMoney is shared and 34 bytes when it isn't.

Using this final approach, the figures are as follows

Object Default serialization Shared delegate
First sent Subsequent First sent Subsequent
BigMoney 525 59 58 24
Money 599 65 68 34
Money with shared BigMoney 599 10 68 14

Summary

The shared delegate technique offers one route to the smallest stream size for serialization. The data size for the first object was a tenth of the original, and halved for subsequent objects. However, I would recommend this as a specialist technique for low level value objects rather than general beans.

So is this worth applying to JSR-310? Feedback welcome!

5 comments:

  1. I think that if you drop the byte field Ser#type, and instead have one top level serialization delegate per class, then the size of the serialized data would be smaller on average.

    ReplyDelete
  2. A typo in the last example: The assignment to field "type" is missing in class "Ser".

    Shouldn't be "type" be native?

    What about object creation count during serialization? Have you done any performance analysis?

    ReplyDelete
  3. Nice to see serialisation format being looked at, rather than slapping on `implements Serializable` in a base class and forgetting.

    Other common JDK (and JSR) classes don't do this sort of optimisation, so I wonder if it is worth it. Also note that adding classes and reflection artifacts will up the per-process overhead (although there may be benefits through lazy loading and the like).

    API serial format documentation may be difficult. Even non-public/protected classes have public serial format, and obvious needs to be specified by the JSR (including the integer constants).

    Default access instances of Externalisable implementations can have their readExternal/writeExternal methods called. Deserialisation generally allows creating such instances, although the forward process of serialisation shouldn't expose internal instances. It doesn't look like there is a particular problem with that in this case.

    The serialisation spec (section 1.10) says that serialisable inner classes are strongly discouraged. I think this should apply to nested classes as well. http://java.sun.com/javase/6/docs/platform/serialization/spec/serial-arch.html#4539

    readObject/writeObject and required (section 3.4) to call either defaultReadObject/defaultWriteObject or readFields/putFields. OTOH, many JDK classes do not do this. http://java.sun.com/javase/6/docs/platform/serialization/spec/input.html#2971

    ReplyDelete
  4. Hi Stephen,

    I like your basic idea, it's similar to Josh Bloch's "serialization proxy". However, from the perspective of object-oriented design, a single Ser class with a switch doesn't look appealing to me. I'd prefer Ser as base class for MoneySer and BigMoneySer, which may or may not be nested in Money and BigMoney respectively. I don't see why nested class would be discouraged by serialization spec - Josh Bloch uses them in his pattern, and he is proficient in JVM internals.

    Yardena.

    ReplyDelete
  5. Stephen Colebourne23 February 2010 13:14

    @Craig - Having one top level delegate per class produces a one byte smaller steam at the expense of a much larger jar file.

    @Tom - I'm pretty sure that static nested classes are fine for serialization - its inner classes that aren't. I tested readObject/writeObject, and if all fields are transient, then it makes no difference to the stream.

    @Yardena, having an abstract superclass makes matters worse as there are now object header for Ser, MoneySer and BigMoneySer. Plus there isn't anything of value to share. Sometimes, OO isn't the right solution.

    ReplyDelete