Compressing as Single Archive or Chunks – Use Cases

by , November 10, 2010

We get a ton of feature requests from our customers and most, if not all, will make it into the FileCatalyst product line at some point.  Recently we have been missing the opportunity to fully explain new features to our other customers and help them understand how they may or may not benefit.  To remedy this, we are starting a series of blog posts to talk about new features we release that we feel warrant more than a “bullet point” in our release notes.  We’ve had a lot of questions recently about our “Compress into Single Archive” feature, so let's start there.

Many file transfer protocols—including FTP—suffer from very poor performance when transferring several small files.  The reason is that there are typically a couple of commands that need to be sent prior to each file transfer to initialize it.  For example, in FTP you would need to send a PASV command, followed by the STOR command in order to start a transfer.  Now imagine that the network RTT is 200ms (typical transfer to Asia from North America) you now spend 400ms second setting up each transfer.  Suppose your file set is one thousand files.  You are now wasting 400 seconds (close to 7 minutes) with nothing but initialization time.

If your transfer was going to take several hours anyway, maybe a few minutes isn’t a big deal.  But what if your files are tiny, say a thousand 50 byte files.  You are only sending 50KB of data, and yet it is going to take you 7 minutes?  On your 45 Mbps connection, you are now getting and effective throughput of 1 Kbps!  Now the issue is becoming apparent.  Now scale the number of files to tens or hundreds of thousands of files and you have a real issue.

The way FileCatalyst deals with this issue is with the Single Archive feature, combined with another feature we call “Progressive Transfers”.  With single archive enabled, FileCatalyst will start to build a single zip file that contains all of the small files.  And with Progressive enabled, FileCatalyst doesn’t wait for the archive to be completely built before the transfer starts, but rather starts to transfer immediately.  On a high bandwidth link, the transfer may complete shortly after the archive is completely built.  The archive is then extracted on the destination side to reproduce the original file set.  And only a couple of commands were needed to transfer the data.

Using this feature in the above scenario would likely reduce transfer time to a couple of seconds instead of 7 minutes.  These are pretty amazing results in the right scenarios!

Zip Chunking

Recently a customer extended the case to include several TB of data in hundreds of thousands of files.  The feature, out of the box, did what it advertised; however, this use case exposed a few drawbacks.  The main drawback to a huge single archive is that the entire archive must arrive before it can be extracted; if the entire file took 2 days to transfer, none of the small files are available until the transfer ends.  Furthermore, if network problems cause the transfer to be interrupted part way through, a partial—and useless—zip file results on the destination side.

As of FileCatalyst 2.8, there is a new feature to break the archive into fixed size chunks.  For example, if you specify a chunk size of 500 MB, FileCatalyst will create several zip files that are approximately the size you specify (sizes are approximate because each will contain only full files).  It will then transfer and extract each zip file before moving to the next one.  Now instead of waiting a couple of days for any files to become available as in the scenario above, files will become available in 500 MB increments.  This value can be tuned to suit the specific use case.

No Comments Yet

Be the first to respond!

Sorry, comments for this entry are closed at this time.