I've been working on Joda-Money as a side project and have been investigating serialization, with a hope of improving JSR-310
Small serialization
Joda-Money has two key classes - BigMoney
, capable of storing information to any scale
and Money
, limited to the correct number of decimal places for the currency.
public class BigMoney { private final CurrencyUnit currency; private final BigDecimal amount; } public class Money { private final BigMoney money; }
A default application of serialization to these classes will generate
525 bytes for BigMoney
and 599 bytes for Money
.
This is a lot of data to be sending for objects that seem quite simple.
Where does the size go?
Well, each serialized class had to write a header to state what the class is.
For something like Money
, it has to write a header for itself,
BigMoney
, CurrencyUnit
, BigDecimal
and
BigInteger
.
The header also includes the serialization version number and the names of each field.
Of course, serialization is designed to handle complex cases where the versions of the class file differ on two JVMs. Data is populated into the right fields using the field name. But for simple classes like money, the data isn't going to change over time.
One interesting fact is that the class header is only sent once per stream for a class.
As a result, for each subsequent after the first the size is reduced.
For default serialization of a subsequent BigMoney
the size is 59 bytes
and for Money
it is 65 bytes. Clearly, the header is a major overhead.
Making the data smaller
The key to this is using a serialization delegate class.
The delegate is a class that is written into the output stream in place of the original class.
This approach is required because the fields are final
which prevents a sensible data
format from being written/read by the class itself.
public class Money { private final BigMoney money; private Object writeReplace() { return new Ser( ... ); } }
So, there is a new class Ser
which will appear in the stream wherever the Money
class would have been.
The name Ser
is deliberately short, as each letter takes up space in the stream.
The delegate class is usually written as a static inner class:
public class Money implements Serializable { private final BigMoney money; private Object writeReplace() { return new Ser(this); } private static class Ser implements Serializable { private Money obj; private Ser(Money money) {obj = money;} private void writeObject(ObjectOutputStream out) throws IOException { // write data to stream, avoiding defaultWriteObject() // this writes the currency code and amount directly } private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException { // read data from stream to obj variable, avoiding defaultReadObject() } private Object readResolve() { return obj; } } }
The delegate class uses the low level writeObject
and readObject
to control
the data in the stream. The readResolve
method then returns the correct object back
for the serialization mechanism to put in the object structure.
The class is static
to ensure a stable serialized form.
Simply taking control of the stream in this way will greatly reduce the overall size.
The biggest gain is in writing out the BigDecimal
in an efficient manner.
Even better?
My investigation has shown a technique to make the stream even smaller.
Firstly, rather than using a static inner class, use a top-level package scoped class. This will have a shorter fully qualified class name, thus a shorter header.
Secondly, look at the other classes in the package. If there are more classes that need the same treatment, why not use a single delegate class for all of them?
public class BigMoney { private final CurrencyUnit currency; private final BigDecimal amount; private Object writeReplace() { return new Ser(Ser.BIG_MONEY, this); } } public class Money { private final BigMoney money; private Object writeReplace() { return new Ser(Ser.MONEY, this); } } class Ser implements Externalizable { static final byte BIG_MONEY = 0; static final byte MONEY = 1; private byte type; private Object obj; private Ser(byte type, Object obj) {this.byte = byte; this.obj = obj;} public void writeExternal(ObjectOutput out) throws IOException { out.writeByte(type); switch (type) { // write data to stream based on the type // this writes the currency code and amount directly } } public void readExternal(ObjectInput in) throws IOException { type = in.readByte(); switch (type) { // read data from stream to obj variable based on the type } } private Object readResolve() { return obj; } }
So, both classes are sharing the same serialization delegate, using a single byte type to distinguish them.
Since the header is written once per class per stream, there is now only one header written
whether your stream contains BigMoney
, Money
or both.
I've also switched to using Externalizable
rather than Serializable
.
Despite the public methods, these cannot be called on the general API because this is a package scoped class.
This change doesn't affect the stream size, but should perform faster (untested!) as there is less reflection involved.
With these changes, the stream size for sending one BigMoney
or Money
drops
to 58 bytes from 525/299 bytes.
Sending a subsequent object of the same type drops to 24 bytes, whereas the default would be 59/65 bytes.
The single shared delegate approach also results in a smaller jar file, as there is a large jar file size overhead for each separate class. (We've replaced two delegates by one, so the jar is smaller).
One downside with this approach is that serialization is no longer encapsulated within the class being serialized. This may result in a constructor becoming package scoped rather than private.
The approach is also only recommended where the class and serialized format is stable, as you are fully responsible for evolution over time of the data format.
A final downside is that the object identity of objects might not be not preserved.
For example, if the data of the BigDecimal
is written out rather than a reference to the
object then a new BigDecimal
object will be created for each BigMoney
deserialized.
The extent to which this is a problem is dependent on the memory structure being serialized.
The same problem applies to multiple Money
object backed by the same BigMoney
.
The default serialized size for the second would be just 10 bytes, whereas the basic shared delegate approach
would be 24 bytes.
As a result, I recommend only writing the base class, BigMoney
in this case, directly
using its contents. Other classes that contain the base class, Money
in this case,
should write out a reference to the BigMoney
from the shared delegate.
This approach means that the second Money
takes 14 bytes when the BigMoney
is shared
and 34 bytes when it isn't.
Using this final approach, the figures are as follows
Object | Default serialization | Shared delegate | ||
---|---|---|---|---|
First sent | Subsequent | First sent | Subsequent | |
BigMoney | 525 | 59 | 58 | 24 |
Money | 599 | 65 | 68 | 34 |
Money with shared BigMoney | 599 | 10 | 68 | 14 |
Summary
The shared delegate technique offers one route to the smallest stream size for serialization. The data size for the first object was a tenth of the original, and halved for subsequent objects. However, I would recommend this as a specialist technique for low level value objects rather than general beans.
So is this worth applying to JSR-310? Feedback welcome!