The problem:
When compressing data, what you really want is the smallest possible output that takes the least possible time to produce. You either want to compress it to store it in as small a place as possible or compress it to transfer it to somewhere else quickly. This is a big problem in computer science (and probably in mathematics) and generally, you have to sacrifice one for the other. Usually, a smaller output takes a cleverer algorithm to compute and so takes longer to produce. Sometimes this is desirable as it can make the output significantly smaller without significantly increasing decompression time, meaning that most of the overhead is only done once and people using the compressed data don’t have to worry about it. For example, BZip2 is great for compression but takes significantly longer than gzip to compress and quite a while longer to decompress.
There are several ways of organising compression formats. A few months ago I took a long hard look at the existing popular compression formats and decided that none of them did exactly what I wanted for both storage and data transfer at the same time. In general, they were too complicated.
The common “zip” file format stores its metadata dictionary at the end of the file, which means that the data cannot start to be processed until the all of the data is available.
Both gzip and BZip2 compression are only suitable for single files, which means that they are usually used in union with tar, an archive format that does not make use of any compression itself. The general use of these is to “tar” many files into one archive and then to compress the archive. This leads to an interesting question: should the files be individually compressed and then archived together or should they be archived together and then compressed as a whole?
The “zip” format compresses each file individually while the “tar.gz” and “tar.bz2″ formats archive and then compress the resulting archive. Both have their obvious advantages and disadvantages and these can all be applied to other areas such as whether individual files should be encrypted and then used in archives or archived and then encrypted together. When individually compressing files, a file may be extracted by only decompressing that entry in the archive and ignoring the rest. This means that extracting a small file from a very large archive is fast. However, when archiving many similar files and then compressing them, data that is already known from one file may assist in compressing a similar file further, resulting in a smaller output size as less effort needs to be duplicated. When extracting one of these files, the entire archive (at least up to the end of that file) needs to be decompressed.
These formats also tend to add far too much overhead in the headers and footers of the total archive or the individual entries. In fact, gzip and zip both use the DEFLATE algorithm to actually compress the data. The differences between the two are only in the structure of the format and the different headers and footers used. gzip contains a few headers such as file name, which compression algorithm is used (always DEFLATE), file permissions, etc while also using a file size and an extra CRC32 checksum in the footer. The computation of this checksum is a major overhead over the standard DEFLATE format, which already contains a faster (but less accurate) Adler32 checksum.
The solution:
As DEFLATE is a well known, well documented compression algorithm with many existing implementations such as zlib, that is what we shall use for compression. DEFLATE has a few small headers in the order of a few bytes total just to set up the compression and an Adler32 checksum in the footer to check for corruption very quickly (and quite roughly). DEFLATE comes with 3 basic compression rates: none, normal and best. The common value used for this is “normal” but as modern machines are more than capable of performing “best” compression very quickly, we will use “best”.
In my tests I have found that the ability to extract individual files almost always outweighs the extra compression achieved by archiving several files and then compressing them together. Usually, the latter creates files less than 1% smaller than the former but in the archives that make the most difference to size, the average file size is so large that having to decompress other files in the same archive takes significantly longer. Of course, this is not a problem if every single file in the archive is required anyway. DEFLATE contains a rarely used starting dictionary in one of its headers that may allow nearly equal compression in both cases with the former retaining its ability to extract individual files quickly. For these reasons, the files will be compressed individually.
Some DEFLATE headers that we do not intend on using or that will always remain the same could be dropped to shave a few bytes but to make implementation easier, we will leave them in for now. This allows existing libraries such as zlib to be used without modification for the decompression stage.
As mentioned above, any additional steps such as encryption could either be done on the individual files, allowing them to be extracted and decrypted individually, or on the final archive. Either way, they are out of scope here so I will not say anything else about them.
The archive will use a structure similar to a linked list. There will be no header for the archive as a whole and each individual entry will contain a header for the size of that entry rather than a pointer to the next. This means that entries can be skipped quickly if the underlying stream allows seeking or random access. Of course, if there is no way to request an individual entry over the stream, it cannot be skipped to until it is already present. Removing a header for the archive as a whole means that two archives may be merged simply by concatenating them. This can be achieved on Unix/Linux with the “cat” command and on Windows/DOS with the “copy” command. To allow for streaming, there will be no footers at all other than the existing DEFLATE Adler32 footer, which cannot be used until all of that entry is already available anyway.
Many headers and footers such as file permissions are not uniform across platforms (compare Windows ACLs with Unix “chmod” style permissions, for example), are not available for streaming (e.g. file permissions are not needed when sending some compressed data across a network where it will never be saved and where the file’s owner may not even exist), etc and so have been dropped. This format is meant to be as simple as possible, remember.
The result:
The resulting file format is described as follows:
- 8 bits (1 byte) for a version number to allow for later extensions and backwards compatibility. Version numbers range from 0 to 255 and are only major version changes to the format which would require a new way of parsing the rest of the entry. Entries in the same archive may use different version numbers e.g. if an old archive is merged with a new archive or if new features are only required in some entries.
- 32 bits (4 bytes) original (inflated) data size to enhance decompression and let the user know how big the entry will be. If this does not benefit decompression at all, there is no problem caused by putting it in an entry’s footer as gzip already does, which would allow it to be set based on how much data was compressed even if the amount of data to be compressed was not known before compression started.
- 32 bits (4 bytes) compressed (deflated) data size to enhance decompression – this being in a header rather than a footer may cause problems for streaming, as an entry needs to be compressed entirely before it can start to be sent. However, a possible extension such as setting it to “0″ to represent “unknown” or changing a version number or flag (if flags are already added in a later version) could avoid this if compressing the entire entry before streaming is not possible. If this is set to “unknown” then skipping an entry is not possible without decompressing it first to know where it ends.
- n*16 bits for an optional 0-terminated text descriptor. In the case of files, this may be the Unicode file name. In the case of other data, it may be an identifier, a description or a string of metadata such as CSV values or an XML structure. If it is to be omitted, a 0 length string should be used. This should consist solely of a Unicode null terminator (\u0000)
- The DEFLATE compressed data (starting with the first DEFLATE header and ending with the last DEFLATE footer)
Example:
01 0D 00 00 00 15 00 00 00 73 00 61 00 6D 00 70 00 6C 00 65 00 69 00 2E 00 74 00 78 00 74 00 00 00 78 DA F3 48 CD C9 C9 D7 51 08 CF 2F CA 49 51 04 00 1F 9E 04 6A
This version 1 entry (13 bytes originally, 21 bytes when compressed) describes a file called “samplei.txt”. If you decompress the last line (21 bytes) with DEFLATE you get the ASCII text “Hello, World!”, which is 13 bytes long. In this example, the compressed data is actually larger than the uncompressed data due to overheads (headers and footers) and made even larger if you include all of the headers added for this entry in the archive (to 54 bytes) but when the original file is a few dozen bytes larger, the compressed file is smaller.
If we remove the optional descriptor (file name) the resulting data is just 32 bytes long.
To compare overheads of the same file in different algorithms with and without archiving:
- samplei.txt.gz = 46 bytes (normal and best)
- samplei.txt.bz2 = 58 bytes (normal and -9)
- samplei.txt.tar.gz = 159 bytes (normal) or 153 bytes (best)
- samplei.txt.tar.bz2 = 149 bytes (normal and -9)
- samplei.zip = 133 bytes (normal)
If we were just transferring the data “Hello, World!” across a network, we can transmit it twice simply by joining them together into a single archive, as you would expect from transmitting the same data twice (optional text descriptor removed for brevity):
01 0D 00 00 00 15 00 00 00 00 00 78 DA F3 48 CD C9 C9 D7 51 08 CF 2F CA 49 51 04 00 1F 9E 04 6A 01 0D 00 00 00 15 00 00 00 00 00 78 DA F3 48 CD C9 C9 D7 51 08 CF 2F CA 49 51 04 00 1F 9E 04 6A
This means that data can be transmitted using this compression with no alteration and without needing to know anything about the other packets. If, for example, each packet were limited to a maximum of 23 bytes, we could transmit the above data in the following packets, just as if it were one archive of unknown length that we were sending, rather than 2 separate entries of data:
01 0D 00 00 00 15 00 00 00 00 00 78 DA F3 48 CD C9 C9 D7 51 08 CF 2F CA 49 51 04 00 1F 9E 04 6A 01 0D 00 00 00 15 00 00 00 00 00 78 DA F3 48 CD C9 C9 D7 51 08 CF 2F CA 49 51 04 00 1F 9E 04 6A .. ..
As soon as enough data is received for the first packet, that packet can be extracted and used. The same can be said for every following packet. We can keep adding to these packets on the sending end indefinitely. This is very useful for both storing data and transmitting it, our initial goals.
This means that resources needed by an application can be stored in this format and retrieved quickly and easily while at the same time providing a way of compressing sequential packets that may also be used in the application. These are both features commonly needed when developing games, for example.
We can also start to extract the initial entries in an archive before the archive has even completed downloading, allowing for multiple files to be downloaded at once in a single response, which could have useful applications on the web rather than having to send many requests for individual files and provide many individual responses.
Well, that’s my generic compression format useful for both streaming and storage. Any questions, comments or criticisms are welcome.