Re: Amanda, bsdtar, and libarchive

From: Don Lewis <truckman_at_FreeBSD.org> Date: Sat, 24 Apr 2004 14:52:08 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:52 UTC

On 24 Apr, Tim Kientzle wrote:
> A few people have commented about mixing Amanda and bsdtar.
> Here's my take on some of the issues:
> 
> Don Lewis wrote:
>> On 23 Apr, Tim Kientzle wrote:
>>
>>>Hmmm... How accurate does [--totals] need to be?
>> 
>> ... not terribly accurate.  ... not so much [uncertainty] as to cause Amanda to
>> underestimate the amount of tape needed, ...
> 
> In particular, it sounds like a simple sum of file
> sizes would actually be a useful proxy.  It's very, very easy to scan
> a directory heirarchy to collect such estimates.  I have some (very
> simple) code sitting around that does this if anyone wants to
> investigate incorporating it into Amanda.
> 
>> On the other hand, it does look bad if you archive to a file and the
>> output of --totals doesn't match the archive file size.
> 
> This is hard.  Remember that libarchive supports a lot
> more than just tar format, so any mechanism would have to
> work correctly for formats as different as tar, cpio, shar,
> and even zip or ar.  With shar format, for example, you cannot
> get an accurate archive size without reading the files,
> because the final size varies depending on the actual
> file content.
> 
> Putting all of this knowledge into bsdtar is simply out
> of the question.  The whole point of the bsdtar/libarchive
> distinction is that bsdtar doesn't know anything about
> archive formats.
> 
> If you want reasonably accurate estimates without reading the
> file data, you could either use proxy data (feed the right amount
> of random data to libarchive in place of the actual file
> data), or build some sort of "size estimate" capability into
> libarchive that would build and measure headers and allow
> the format-specific logic to estimate what happens to the
> file data.  (Which is, admittedly, pretty simple for tar and cpio.)

Those are the two that are probably the most important for getting quick
estimates.

There are three variations on the archive size estimate:
	Fast and exact
	Fast and approximate
	Slow and exact

Only some formats would support fast and exact estimates (uncompressed
tar and cpio).

Some formats (uncompressed tar and cpio) will get the same results for
the exact and approximate cases.

Allow the user to specify whether he wants a slow or fast estimate
rather than deciding based on whether or not the output is going to
/dev/null.

For the fast estimates I'd put a format specific file size estimater
into libarchive. For each file in the archive, call the estimator
function with the file name, file size, and (user specified?) estimated
compression ratio.  Add the returned values to any format-specific
overall archive header and trailer sizes.

For slow estimates, go through the motions of creating the archive, but
toss the output.