Tinkering with CompressionLib (Part 1)

13 Jul

There is a new system library in OS X 10.11 – CompressionLib.

For the compression algorithm itself CompressionLib exposes 5 choices.

COMPRESSION_LZ4
COMPRESSION_ZLIB
COMPRESSION_LZMA
COMPRESSION_LZ4_RAW
COMPRESSION_LZFSE // Apple specific

lzfse is the interesting one here. It’s an Apple developed algorithm that is faster than, and generally compresses better than zlib (the previous defacto standard). Since it’s Apple specific if you need cross platform support it isn’t for you. For now at least. Nothing has been announced, but I’m very hopeful Apple will choose to open source it. I think it’s in Apple’s best interest to get lzfse used as widely as possible.

CompressionLib’s public interface is small, but nicely thought out. It basically breaks down into two ways of dealing with compression, buffer based and stream based. There are, quite literally, only 7 functions:

compression_encode_buffer
compression_decode_buffer
compression_encode_scratch_buffer_size
compression_decode_scratch_buffer_size

compression_stream_init
compression_stream_process
compression_stream_destory

You can safely ignore the compression_encode_scratch_buffer_size and compression_decode_scratch_buffer_size functions. CompressionLib will automatically create the scratch buffer on your behalf if you pass a NULL scratch buffer to the encode / decode buffer functions. That knocks it down to just 2 functions for buffer based or 3 for stream based.

While watching WWDC ’15 – Session 712 “Low Energy High Performance: Compression and Accelerate” I decided to play around with the buffer functions.

size_t compression_encode_buffer(*dst_buffer, dst_size,
                                 *src_buffer, src_size,
                                 *scratch_buffer,
                                 algorithm)

The encode function compression_encode_buffer, and it’s decode counter part, compression_decode_buffer, take the same parameters and do exactly what you’d expect. You have to specify the dst_buffer‘s size (dst_size). During compression this isn’t a real limitation as you can reasonably expect the worst case scenario to be that the dst_buffer is the same size as the src_buffer (plus a few bytes for overhead when dealing with a very small piece of data). I really like this interface. It helps to make the memory management very clear. You create the buffers, you own them, and it’s your responsibility to free them. You can be reasonably sure of that just by looking at the function prototypes. The argument names also make it very clear exactly what each function needs. No need for a lot of documentation and reading here.

What about decoding?

size_t compression_decode_buffer(*dst_buffer, dst_size,
                                 *src_buffer, src_size,
                                 *scratch_buffer,
                                 algorithm)

This brings me to, what I believe is, a pretty big limitation. When using compression_decode_buffer you don’t have any reasonable expectation of what the uncompressed (dst_buffer) size should be. And if dst_buffer is too small compression_decode_buffer will simply truncate the result to dst_size. This is friendly, at least you don’t have to worry about a buffer overflow. But you must know beyond a reasonable doubt, exactly what size your uncompressed data will be. I halfway expected compression_decode_buffer to return the full size of the uncompressed data so you could increase the dst_buffer size and retry if needed. That would be wasteful, and it doesn’t. compression_decode_buffer returns the size of the data written to the buffer; if it’s truncated it simply returns dst_size. This is pretty clear in the header

@return
The number of bytes written to the destination buffer if the 
input is successfully decompressed. If there is not enough 
space in the destination buffer to hold the entire expanded 
output, only the first dst_size bytes will be written to the 
buffer and dst_size is returned.

I did some digging and couldn’t find any way of getting the expected uncompressed size. I then thought about inspecting the lzfse archive header, but there isn’t any published header spec (at least that I could find – if I’m wrong please let me know @leemorgan).

I think it would be highly beneficial if an API was provided to determine the expected uncompressed buffer size. Perhaps a function like the following:

extern size_t compression_decode_uncompressed_size(const uint8_t * __restrict src_buffer, size_t src_size, compression_algorithm algorithm);

I’ve filed an enhancement request asking for the ability to get the expected uncompressed data’s size. rdar://21787153

With that said, I need to thank Stephen Canon. We had a short conversation on twitter earlier today about CompressionLib and this “limitation”. While playing with CompressionLib I had been focused on lzfse and had neglected to think about how the library needs to provide support for other compression algorithms as well. I had (perhaps incorrectly) assumed that lzfse had this kind of metadata readily available in the archiver’s header. But failed to consider how it would work for archives that don’t have expected uncompressed size stored internally. Stephen pointed out that the _buffer interfaces are very low-level building blocks, and that it’s assumed that the higher level callers will keep this metadata around. The problem with this though, is that the higher level callers will implement this differently (and likely incompatibly). I might write one archiver that stores this data as extended attributes, while someone else might chose to wrap the archive itself with their own header.

But Stephen raises many good points. In lieu of the ability to get at the uncompressed data size directly, I think a standard way of accessing any available headers would be beneficial. This would keep the burden of dealing with the sizes at the higher level caller, while at the same time providing the higher level caller a standard way to get at any headers they might know about and be able to use (such as the uncompressed size).

I readily admit I might be making a bigger deal of this than it needs to be. One could always fall back to the stream functions. But the _buffer functions just look so damn beautiful, I want to use them.