UFS2 Snapshots in 6.1-Beta4 - Confirmed Problems

From: John Kozubik <john_at_kozubik.com>
Date: Tue, 21 Mar 2006 01:36:44 -0800 (PST)
Hello,

By request of the participants of freebsd-fs, I have been testing the
behavior and stability of UFS2 snapshots on FreeBSD 6.1-Beta4 for several
days now.

Unfortunately, I have confirmed that behavior from existing PRs (circa
6.0) still manifests itself, as well as some additional bad behavior that
has not yet been documented in PRs.

I hope that I am making this information available soon enough so that it
may be acted upon prior to the release of 6.1.  It would make me very
happy to confidently use snapshots on FreeBSD (something that has never
been possible in the past).

-----

Here is the behavior I have witnessed:


First, I have confirmed that a filesystem with multiple snapshots that
undergoes multiple, rapid deletions of files, will cause the system to
hang.  I have witnessed this before, but had not confirmed it or
documented it in a PR.  Now that I have confirmed this behavior, I have
documented it in: kern/94769

This is a serious problem because, in addition to making it nearly
impossible to run a system with multiple snapshots, it is conceivable that
enough rapid file deletions could occur on an otherwise non snapshotted
system that has a single snapshot on it due to a background fsck, to cause
the system to hang.



Second, kern/92292 is still a problem.  I have reproduced this error in
6.1-BETA4 (and have seen it happening since 5.1).  The (small) difference
is that the cp process seems to stick in the flswai state instead of
biowr.



This next one is complicated, and I haven't submitted a PR for it yet, but
I believe it is quite serious for reasons I will expand on below.

The problem is:  If you completely fill a filesystem (109% usage in `df`
on most systems) that has a snapshot on it, the system becomes very
unresponsive - all interactive and disk response lags terribly and,
although the system is not hung, it is in many cases unusable.

I believe this is serious because it is conceivable that the snapshots on
a filesystem contain critical data, while at the same time the system they
are on is a critical system.  If one allows a snapshotted filesystem to
fill, one is faced with the difficult choice of deleting a snapshot with
potentially critical data on it, or sacrificing the use of that _entire
computer system_.  Data cannot be deleted from that filesystem to free up
space because that data continues to reside on the snapshots.

This behavior makes it imperative that an administrator never allow a
snapshotted filesystem to become full or close to full, which is perhaps
unreasonable.




Related to the last problem, is the fact that a filesystem (without
snapshots) that is completely full will still sync properly with no
errors.  However, when one fills up a filesystem that already has
snapshots living on it, sync fails with the message:

/mnt/data1: write failed, filesystem is full

I do not know if the sync is in fact unsuccessful or not.



In essence, the current behavior means that a filesystem that has
snapshots on it experiences a point of no return if that file system ever
fills up completely.  A snapshot _must be deleted_ to allow the system to
return to reasonable performance.



I have not been able to determine if kern/92272 still exists on FreeBSD
6.1BETA4.  It looks like it does, but I haven't had time to test
conclusively.




Finally, most trivial, an attempted snapshot that fails due to
insufficient space on the target filesystem fails with error:

mksnap_ffs: Cannot create /mnt/data1/.snap/almost_full: No space left on
device

However, it still creates a zero byte file.  I think it should create no
file at all.




Thank you for reviewing this - please contact me if there are further
tests that I can run, or additional details I can provide.


P.S. The items above related to bad behavior when disks fill are items I
am having a hard time recreating perfectly.  That is to say, every time I
combine snapshots with full disks on 6.1-BETA4, bad things happen
(terrible performance or hangs, or both), but I am having trouble
recreating exact scenarios.  YMMV.



-----
John Kozubik - john_at_kozubik.com - http://www.kozubik.com
Received on Tue Mar 21 2006 - 08:36:55 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:53 UTC