Re: Amanda, bsdtar, and libarchive

From: Tim Kientzle <tim_at_kientzle.com> Date: Sat, 24 Apr 2004 13:04:13 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:52 UTC

A few people have commented about mixing Amanda and bsdtar.
Here's my take on some of the issues:

Don Lewis wrote:
> On 23 Apr, Tim Kientzle wrote:
>
>>Hmmm... How accurate does [--totals] need to be?
> 
> ... not terribly accurate.  ... not so much [uncertainty] as to cause Amanda to
> underestimate the amount of tape needed, ...

In particular, it sounds like a simple sum of file
sizes would actually be a useful proxy.  It's very, very easy to scan
a directory heirarchy to collect such estimates.  I have some (very
simple) code sitting around that does this if anyone wants to
investigate incorporating it into Amanda.

> On the other hand, it does look bad if you archive to a file and the
> output of --totals doesn't match the archive file size.

This is hard.  Remember that libarchive supports a lot
more than just tar format, so any mechanism would have to
work correctly for formats as different as tar, cpio, shar,
and even zip or ar.  With shar format, for example, you cannot
get an accurate archive size without reading the files,
because the final size varies depending on the actual
file content.

Putting all of this knowledge into bsdtar is simply out
of the question.  The whole point of the bsdtar/libarchive
distinction is that bsdtar doesn't know anything about
archive formats.

If you want reasonably accurate estimates without reading the
file data, you could either use proxy data (feed the right amount
of random data to libarchive in place of the actual file
data), or build some sort of "size estimate" capability into
libarchive that would build and measure headers and allow
the format-specific logic to estimate what happens to the
file data.  (Which is, admittedly, pretty simple for tar and cpio.)

Richard Coleman suggested:
> ... a version of Amanda that natively uses libarchive ...

Now *that's* a worthwhile idea.  (And it is, after all,
the whole point of libarchive, that programs can just
use it to build/extract archives directly without
going through a separate program.)

Richard Coleman also observed:
> Until libarchive gets support for sparse files, it's probably better to stick with gtar or rdump with Amanda. 

A very good point.  Although I've been studying sparse file
issues, and even gtar doesn't entirely do the "right thing,"
partly because FreeBSD doesn't provide any way to query the
layout of a sparse file.  At best, gtar can guess, but that
requires scanning the entire file twice to identify large blocks
of zeros, which is a performance problem for large files.
Also, gtar's sparse file storage doesn't really scale well to
very large numbers of holes.  Joerg Schilling (author of "star") and
I have traded some ideas about approaches that might scale to
petabyte files with millions of holes, but nothing concrete
enough to actually implement yet.

If you actually have large sparse database files, I strongly
suggest that you:
   1) flag the database files themselves as "nodump" and use a backup
      program that will honor that flag, which includes bsdtar, star,
      and rdump all do.
   2) Use a database-specific tool to dump the database to one or
      more non-sparse files that will get picked up by the
      backup program.
This approach also allows you to run backups while the database
is running, as the database dumps themselves aren't changing during
the backup.  Backing up database storage while the database is
running is a very good way to create completely useless backups.

Tim