Strange ZFS filesystem corruption

From: Paul Mather <paul_at_gromit.dlib.vt.edu>
Date: Mon, 3 Oct 2011 14:21:05 -0400
I wasn't sure whether to post this here or on stable_at_freebsd.org.  The system now runs RELENG_9, but the ZFS pool exhibiting problems was created, IIRC, under 9-CURRENT.  I believe RELENG_9 is sufficiently close to HEAD at this stage that this list is probably the correct place for this message.

I have a raidz2 ZFS pool on a system that I have recently been using as a mirror for about 6.5 TiB of data.  The data are mirrored nightly using rsync.  I noticed during these nightly rsync copies I would get some errors like this:

=====
file has vanished: "/backups/storage/san/DLA/DLA_Records/05DLAAdmin"
rsync: stat "/backups/storage/san/DLA/DLA_Records/05DLAAdmin" failed: No such file or directory (2)
rsync: recv_generator: mkdir "/backups/storage/san/DLA/DLA_Records/05DLAAdmin/05DI_business copy" failed: No such file or directory (2)
*** Skipping any contents from this failed directory ***
=====

It appears that 05DLAAdmin is a directory that is corrupted.  It shows in an "ls" but any attempt to descend into that directory or discern its attributes fails with a "No such file or directory" error.  Furthermore, I cannot delete this directory (even with "rm -rf").  E.g.:

=====
tape# pwd
/backups/storage/san/DLA
tape# whoami
root
tape# rm -rf DLA_Records
rm: DLA_Records/07DLAAdmin/07Digital_Imaging_Work: Directory not empty
rm: DLA_Records/07DLAAdmin/FY07IAWAprep: Directory not empty
rm: DLA_Records/07DLAAdmin: Directory not empty
rm: DLA_Records: Directory not empty
tape# cd DLA_Records
tape# ls
05DLAAdmin      07DLAAdmin
tape# ls -l
ls: 05DLAAdmin: No such file or directory
total 3
drwxrws---  4 500  501  4 Oct  3 11:53 07DLAAdmin
tape# file 05DLAAdmin
05DLAAdmin: cannot open `05DLAAdmin' (No such file or directory)
tape# ls -R 07DLAAdmin
07Digital_Imaging_Work  FY07IAWAprep

07DLAAdmin/07Digital_Imaging_Work:
ls: 07Proposals: No such file or directory

07DLAAdmin/FY07IAWAprep:
ls: Budget: No such file or directory
tape# ls 07DLAAdmin
07Digital_Imaging_Work  FY07IAWAprep
tape# ls 07DLAAdmin/07Digital_Imaging_Work
07Proposals
tape# ls -l 07DLAAdmin/07Digital_Imaging_Work/07Proposals
ls: 07DLAAdmin/07Digital_Imaging_Work/07Proposals: No such file or directory
tape# ls 07DLAAdmin/FY07IAWAprep
Budget
tape# ls 07DLAAdmin/FY07IAWAprep/Budget
ls: 07DLAAdmin/FY07IAWAprep/Budget: No such file or directory
tape# file 07DLAAdmin/FY07IAWAprep/Budget
07DLAAdmin/FY07IAWAprep/Budget: cannot open `07DLAAdmin/FY07IAWAprep/Budget' (No such file or directory)
tape# cd 05DLAAdmin
05DLAAdmin: No such file or directory.
=====

The pool itself reports no errors.  I performed a scrub on the pool yet this bizarre filesystem corruption persists:

=====
tape# zpool status backups
  pool: backups
 state: ONLINE
 scan: scrub repaired 15K in 7h33m with 0 errors on Sat Oct  1 19:22:35 2011
config:

        NAME            STATE     READ WRITE CKSUM
        backups         ONLINE       0     0     0
          raidz2-0      ONLINE       0     0     0
            gpt/disk02  ONLINE       0     0     0
            gpt/disk03  ONLINE       0     0     0
            gpt/disk04  ONLINE       0     0     0
            gpt/disk05  ONLINE       0     0     0
            gpt/disk06  ONLINE       0     0     0
            gpt/disk07  ONLINE       0     0     0

errors: No known data errors
tape# uname -a
FreeBSD tape.private.lib.vt.edu 9.0-BETA3 FreeBSD 9.0-BETA3 #0: Wed Sep 28 15:18:59 EDT 2011     pmather_at_tape.private.lib.vt.edu:/usr/obj/usr/src/sys/TAPE  amd64
tape# zpool get all backups
NAME     PROPERTY       VALUE       SOURCE
backups  size           10.9T       -
backups  capacity       62%         -
backups  altroot        -           default
backups  health         ONLINE      -
backups  guid           1352318175125790395  default
backups  version        28          default
backups  bootfs         -           default
backups  delegation     on          default
backups  autoreplace    off         default
backups  cachefile      -           default
backups  failmode       wait        default
backups  listsnapshots  off         default
backups  autoexpand     off         default
backups  dedupditto     0           default
backups  dedupratio     1.00x       -
backups  free           4.07T       -
backups  allocated      6.80T       -
backups  readonly       off         -
tape# zfs get all backups/storage
NAME             PROPERTY              VALUE                  SOURCE
backups/storage  type                  filesystem             -
backups/storage  creation              Fri Sep  2 14:43 2011  -
backups/storage  used                  4.26T                  -
backups/storage  available             2.60T                  -
backups/storage  referenced            4.26T                  -
backups/storage  compressratio         1.51x                  -
backups/storage  mounted               yes                    -
backups/storage  quota                 none                   default
backups/storage  reservation           none                   default
backups/storage  recordsize            128K                   default
backups/storage  mountpoint            /backups/storage       default
backups/storage  sharenfs              off                    default
backups/storage  checksum              fletcher4              local
backups/storage  compression           gzip-9                 local
backups/storage  atime                 on                     default
backups/storage  devices               on                     default
backups/storage  exec                  off                    local
backups/storage  setuid                on                     default
backups/storage  readonly              off                    default
backups/storage  jailed                off                    default
backups/storage  snapdir               hidden                 default
backups/storage  aclmode               discard                default
backups/storage  aclinherit            restricted             default
backups/storage  canmount              on                     default
backups/storage  xattr                 off                    temporary
backups/storage  copies                1                      default
backups/storage  version               5                      -
backups/storage  utf8only              off                    -
backups/storage  normalization         none                   -
backups/storage  casesensitivity       sensitive              -
backups/storage  vscan                 off                    default
backups/storage  nbmand                off                    default
backups/storage  sharesmb              off                    default
backups/storage  refquota              none                   default
backups/storage  refreservation        none                   default
backups/storage  primarycache          all                    default
backups/storage  secondarycache        all                    default
backups/storage  usedbysnapshots       0                      -
backups/storage  usedbydataset         4.26T                  -
backups/storage  usedbychildren        0                      -
backups/storage  usedbyrefreservation  0                      -
backups/storage  logbias               latency                default
backups/storage  dedup                 off                    default
backups/storage  mlslabel                                     -
backups/storage  sync                  standard               default
backups/storage  refcompressratio      1.51x                  -
=====

I know ZFS does not have a fsck utility ("because it doesn't need one":), but does anyone know of any way of fixing this corruption short of destroying the pool, creating a new one, and restoring from backup?  Is there some way of exporting and re-importing the pool that has the side-effect of doing some kind of fsck-like repairing of subtle corruption like this?

Cheers,

Paul.
Received on Mon Oct 03 2011 - 17:19:32 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:18 UTC