Rsync Style Differential Downloads Using a Static Web Server Capable of Serving Up Byte-Ranges

Preamble

RZ is [was] my attempt to improve download speeds of debian Packages files over a miserable 56K modem line. Such files change little from week to week yet, in order to track testing or unstable, you must frequently update these files in their entirety. Distributing patch files is the usual way of reducing such bandwidth load, but this creates a burdensome amount of bookkeeping to work properly. Rsync is the better way, as it compares files on either end of the connection, eliminating all bookkeeping.

Rsync's standard implementation which puts a computational burden on the central server. RZ alleviates that burden; it allow a client to use the rsync algorithm to improve download speeds from any web server which only need support standard http downloads by range. To facilate this, files are analyzed and compressed before being made available on the server --- the client can then use the analyzed checksum data to only load blocks of data from the file which it needs to completely reconstruct the original file.

Because the analysis need only occur once when the file is prepared there is no computational burden on the web server nor any need to modify the web server, in any way, to support differential downloads. Further the rz file format, in addition to the checksum data, contains all the original data in a compressed format which compares well to other file compression formats. (the zlib library is used for compression)

The client code is interesting because the computional cost of the rsync algorithm is quite high, and it is not inconceivable that, if the connection is fast, that the computations might not keep up with the network. To deal with this the programme must run two simultaneous threads of execution, one to compute what is already available locally, and the other to retrieve data off the net. This code deals with this in a traditional Unix fashion, using forked processes, pipes and a select loop to keep thing humming.

Seriously? Perl?

The first working code was written entirely in perl, but was slower than molasses. The Inline C package dramatically improved the performance of the analyzing code, but still left a great deal to be desired in the client synthesizing code, which could not use inlining as effectively. As it now stands, the code is a mishmash of perl and C++ roughly glued together. Performance is as good as it is going to get but the code is unmaintainably ugly and virtually undistributable. My intent was to migrate the code wholly to C++ (without changing the design) but that never happened.