Re: NEW TAR

From: Harti Brandt <harti_at_freebsd.org> Date: Wed, 21 Jul 2004 17:30:45 +0200 (CEST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:02 UTC

On Wed, 21 Jul 2004, Daniel Lang wrote:

DL>Hi,
DL>
DL>Jan Grant wrote on Wed, Jul 21, 2004 at 02:44:42PM +0100:
DL>[..]
DL>> You're correct, in that filesystem semantics don't require an archiver 
DL>> to recreate holes. There are storage efficiency gains to be made in 
DL>> identifying holes, that's true - particularly in the case of absolutely 
DL>> whopping but extremely sparse files. In those cases, a simple 
DL>> userland-view-of-the-filesystem-semantics approach to ideentifying areas 
DL>> that _might_ be holes (just for archive efficiency) can still be 
DL>> expensive and might involve the scanning of multiple gigabytes of 
DL>> "virtual" zeroes.
DL>> 
DL>> Solaris offers an fcntl to identify holes (IIRC) for just this purpose. 
DL>> If the underlying filesystem can't be made to support it, there's an 
DL>> efficiency loss but otherwise it's no great shakes.
DL>
DL>I don't get it.
DL>
DL>I assume, that for any consumer it is totally transparent if
DL>possibly existing chunks of 0-bytes are actually blocks full of
DL>zeroes or just non-allocated blocks, correct?
DL>
DL>Second, it is true, that there is a gain in terms of occupied disk
DL>space, if chunks of zeroes are not allocated at all, correct?
DL>
DL>So, from my point of view it is totally irrelevant, if a sparse file
DL>is archived and then extracted, if the areas, which contain zeroes
DL>are exactly in the same manner consisting of unallocated blocks
DL>or not.
DL>
DL>So, all I guess an archiver must do is:
DL>
DL> - read the file 
DL> - scan the file for consecutive blocks of zeroes
DL> - archive these blocks in an efficient way
DL> - on extraction, create a sparse file with the previously
DL>   identified empty blocks, regardless if these blocks
DL>   have been 'sparse' blocks in the original file or not.
DL>
DL>I do not see, why it is important if the original file was sparse
DL>at all or maybe in different places.

It just may be a good deal faster just to take existing hole information
(if it exists) than to scan the file.

Also there is a difference between holes and actual zeroes: it's like 
overcommitting memory. Yoy may have a 1TB file consisting of a large 
hole on a 10GB disk. Just as you write something to it you will get an 
error at some time even when writing into the middle of the file, just
because the FS needs to allocate blocks. I could imagine an application
knowing its access pattern to a large sparse file allocating zeroed blocks 
in advance while skipping blocks that it knows it'll not write, just to 
make sure the blocks are there when it will write later on. But that's a 
rather hypothetical application.

harti