Re: readdir/telldir/seekdir problem (i think)

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Sat, 25 Apr 2015 09:04:32 -0400 (EDT)
Julian Elischer wrote:
> On 4/25/15 4:28 AM, John Baldwin wrote:
> > On Saturday, April 25, 2015 02:36:24 AM Julian Elischer wrote:
> >> On 4/25/15 1:30 AM, Julian Elischer wrote:
> >>> On 4/24/15 10:59 PM, John Baldwin wrote:
> >>>> On Friday, April 24, 2015 01:02:39 PM Julian Elischer wrote:
> >>>>> On 4/23/15 9:54 PM, John Baldwin wrote:
> >>>>>> On Thursday, April 23, 2015 05:02:08 PM Julian Elischer wrote:
> >>>>>>> On 4/23/15 11:20 AM, Julian Elischer wrote:
> >>>>>>>> I'm debugging a problem being seen with samba 3.6.
> >>>>>>>>
> >>>>>>>> basically  telldir/seekdir/readdir don't seem to work as
> >>>>>>>> advertised..
> >>>>>>> ok so it looks like readdir() (and friends) is totally broken
> >>>>>>> in
> >>>>>>> the face
> >>>>>>> of deletes unless you read the entire directory at once or
> >>>>>>> reset
> >>>>>>> to the
> >>>>>>> the first file before the deletes, or earlier.
> >>>>>> I'm not sure that Samba isn't assuming non-portable behavior.
> >>>>>> For example:
> >>>>>>
> >>>>>> >From
> >>>>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir_r.html
> >>>>>>
> >>>>>>
> >>>>>> If a file is removed from or added to the directory after the
> >>>>>> most recent call
> >>>>>> to opendir() or rewinddir(), whether a subsequent call to
> >>>>>> readdir() returns an
> >>>>>> entry for that file is unspecified.
> >>>>>>
> >>>>>> While this doesn't speak directly to your case, it does note
> >>>>>> that
> >>>>>> you will
> >>>>>> get inconsistencies if you scan a directory concurrent with
> >>>>>> add
> >>>>>> and remove.
> >>>>>>
> >>>>>> UFS might kind of work actually since deletes do not compact
> >>>>>> the
> >>>>>> backing
> >>>>>> directory, but I suspect NFS and ZFS would not work.  In
> >>>>>> addition, our
> >>>>>> current NFS support for seekdir is pretty flaky and can't be
> >>>>>> fixed without
> >>>>>> changes to return the seek offset for each directory entry (I
> >>>>>> believe that
> >>>>>> the projects/ino64 patches include this since they are
> >>>>>> breaking
> >>>>>> the ABI of
> >>>>>> the relevant structures already).  The ABI breakage makes this
> >>>>>> a
> >>>>>> very
> >>>>>> non-trivial task.  However, even if you have that per-item
> >>>>>> cookie, it is
> >>>>>> likely meaningless in the face of filesystems that use any
> >>>>>> sort
> >>>>>> of more
> >>>>>> advanced structure than an array (such as trees, etc.) to
> >>>>>> store
> >>>>>> directory
> >>>>>> entries.  POSIX specifically mentions this in the rationale
> >>>>>> for
> >>>>>> seekdir:
> >>>>>>
> >>>>>> One of the perceived problems of implementation is that
> >>>>>> returning
> >>>>>> to a given point in a directory is quite difficult to describe
> >>>>>> formally, in spite of its intuitive appeal, when systems that
> >>>>>> use
> >>>>>> B-trees, hashing functions, or other similar mechanisms to
> >>>>>> order
> >>>>>> their directories are considered. The definition of seekdir()
> >>>>>> and
> >>>>>> telldir() does not specify whether, when using these
> >>>>>> interfaces,
> >>>>>> a given directory entry will be seen at all, or more than
> >>>>>> once.
> >>>>>>
> >>>>>> In fact, given that quote, I would argue that what Samba is
> >>>>>> doing is
> >>>>>> non-portable.  This would seem to indicate that a conforming
> >>>>>> seekdir could
> >>>>>> just change readdir to immediately return EOF until you call
> >>>>>> rewinddir.
> >>>>> how does readdir know that the directory has been changed?
> >>>>> fstat?
> >>>> Undefined.  FreeBSD's libc doesn't cache the entire directory
> >>>> (unless you
> >>>> are using a union mount), instead it just caches the most recent
> >>>> call to
> >>>> getdirentries().  It then serves up entries from that block
> >>>> until
> >>>> you hit
> >>>> the end.  When you hit the end it reads the next block, etc.
> >>>> When you
> >>>> call telldir(), the kernel saves the seek offset from the start
> >>>> of the
> >>>> current block and the offset within that block and returns a
> >>>> cookie to
> >>>> you.  seekdir() looks up the cookie to find the (seek offset,
> >>>> block
> >>>> offset)
> >>>> tuple.  If the location matches the directory's current location
> >>>> it
> >>>> doesn't
> >>>> do anything, otherwise it seeks to the seek offset and reads a
> >>>> new
> >>>> block via
> >>>> getdirentries().  There is no check for seeing if a directory is
> >>>> changed.  Instead, you can only be stale by one "block".  When
> >>>> you
> >>>> read
> >>>> a new block it is relative to the directory's state at that
> >>>> time.
> >>>>
> >>>> Rick's suggestion of reusing the block for a seek within a block
> >>>> would be
> >>>> fairly easy to implement, I think it would just be:
> >>>>
> >>>> Index: head/lib/libc/gen/telldir.c
> >>>> ===================================================================
> >>>> --- head/lib/libc/gen/telldir.c (revision 281929)
> >>>> +++ head/lib/libc/gen/telldir.c (working copy)
> >>>> _at__at_ -101,8 +101,10 _at__at_
> >>>>                   return;
> >>>>           if (lp->loc_loc == dirp->dd_loc && lp->loc_seek ==
> >>>> dirp->dd_seek)
> >>>>                   return;
> >>>> -       (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
> >>>> SEEK_SET);
> >>>> -       dirp->dd_seek = lp->loc_seek;
> >>>> +       if (lp->loc_seek != dirp->dd_seek) {
> >>>> +               (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek,
> >>>> SEEK_SET);
> >>>> +               dirp->dd_seek = lp->loc_seek;
> >>>> +       }
> >>> yes I did that yesterday but it still fails when you transition
> >>> blocks.. (badly).
> >>>
> >>> I also tried bigger blocks.. also fails (eventually)
> >>>
> >>> I did find a way to make it work...  you had to seek back
> >>> to the first block you deleted on each set..
> >>> then work forward from there again..  unfortunately since
> >>> I'm trying to make a microsoft program not fail (via samba)
> >>> I have no control over how it does things and seekdir doesn't
> >>> know what was deleted anyway... (so the fix is fine for  the
> >>> test program but not for real life)
> >>>
> >>> I think I can make the BSD one act like the linux one by changing
> >>> the lseek being done to use the offset (loc) plus the buffer seek
> >>> address of the target, instead of just going for the buffer base
> >>> and
> >>> stepping forward through the entries..
> >>>
> >>> maybe tomorrow.
> >>>
> >> The following conditional code makes ours behave the same as the
> >> linux
> >> one.
> >> it breaks several 'rules' but works where ours is clean but
> >> fails..
> >> as Rick said..  "maybe that's what we should do too."
> >>
> >>
> >> this is at the end of seekdir()
> >>
> >>
> >> The new code does what linux does.. and shouldn't work.. but does
> >>               // at least in the limited conditions I need it to.
> >>               // We'll probably need to do this at work...:
> >>
> >>
> >> The original code is what we have now, but gets mightily confused
> >> sometimes.
> >>          // This is clean(er) but fails in specific
> >>          situations(when
> >> doing commands
> >>          // from Microft windows, via samba).
> >>
> >>
> >> root_at_vps1:/tmp # diff -u dir.c.orig dir.c
> >> --- dir.c.orig    2015-04-24 11:29:36.855317000 -0700
> >> +++ dir.c    2015-04-24 11:15:49.058500000 -0700
> >> _at__at_ -1105,6 +1105,13 _at__at_
> >>            dirp->dd_loc = lp->loc_loc;
> >>            return;
> >>        }
> >> +#ifdef GLIBC_SEEK
> >> +    (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek + lp->loc_loc,
> >> SEEK_SET);
> >> +    dirp->dd_seek = lp->loc_seek + lp->loc_loc;
> >> +    dirp->dd_loc = 0;
> >> +    lp->loc_seek = dirp->dd_seek;
> >> +    lp->loc_loc = 0;
> >> +#else
> >>        (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
> >>        dirp->dd_seek = lp->loc_seek;
> >>        dirp->dd_loc = 0;
> >> _at__at_ -1114,6 +1121,7 _at__at_
> >>            if (dp == NULL)
> >>                break;
> >>        }
> >> +#endif
> >>    }
> > Yes, this isn't at all safe.  There's no guarantee whatsoever that
> > the offset on the directory fd that isn't something returned by
> > getdirentries has any meaning.
> the value returned by getdirentries is the original seek for the
> block, and the  offset is not sent obviosly.. but we could
> do somethign where getdirentries 'remembers' all teh offsets it sent
> for the last set of entries, of we could make it return a a seekable
> cookie for every entry...
> 
The problem is that there are two separate sets of offsets. There are
the byte offsets in the "logical UFS-like directory blocks" generated
by the file systems for VOP_READDIR() and then there are the "physical
offset cookies" which are opaque bits for anything but the underlying
file system. The latter is a location for the directory entry within
the directory structure it maintains.
My thinking was that d_off would be this latter physical offset cookie
and that a new getdirenties(2) syscall and a new VOP_READDIR() would
allow passage of this to the file system, so that it could return
directory entries starting at that place in its directory.
I have never gotten around to looking closely at the libc stuff, to
try and determine if this can all be made to work correctly.

rick
ps: This is all "future" stuff and probably doesn't help come up
    with a better set of libc functions for what is in head/current.
    (Didn't want to hijack the discussion, but it is in a sense
     relevant.)

> >   In particular, the size of the
> > directory entry in a random filesystem might be a different size
> > than the structure returned by getdirentries (since it converts
> > things into a FS-independent format).
> 
> yes I understand all that.. :-)
> >
> > This might work for UFS by accident, but this is probably why ZFS
> > doesn't work.
> actually ZFS seems to work too which stunned me..
> 
> > However, this might be properly fixed by the thing that ino64 is
> > doing where each directory entry returned by getdirentries gives
> > you a seek offset that you _can_ directly seek to (as opposed to
> > seeking to the start of the block and then walking forward N
> > entries until you get an inter-block entry that is the same).
> 
> yeah that might be nice.
> 
> >
> 
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to
> "freebsd-current-unsubscribe_at_freebsd.org"
> 
Received on Sat Apr 25 2015 - 11:04:35 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:57 UTC