Re: readdir/telldir/seekdir problem (i think)

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Sat, 25 Apr 2015 08:16:21 -0400 (EDT)
Julian Elischer wrote:
> On 4/25/15 9:39 AM, Rick Macklem wrote:
> > Jilles Tjoelker wrote:
> >> On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
> >>> Yes, this isn't at all safe.  There's no guarantee whatsoever
> >>> that
> >>> the offset on the directory fd that isn't something returned by
> >>> getdirentries has any meaning.  In particular, the size of the
> >>> directory entry in a random filesystem might be a different size
> >>> than the structure returned by getdirentries (since it converts
> >>> things into a FS-independent format).
> >>> This might work for UFS by accident, but this is probably why ZFS
> >>> doesn't work.
> >>> However, this might be properly fixed by the thing that ino64 is
> >>> doing where each directory entry returned by getdirentries gives
> >>> you a seek offset that you _can_ directly seek to (as opposed to
> >>> seeking to the start of the block and then walking forward N
> >>> entries until you get an inter-block entry that is the same).
> >> The ino64 branch only reserves space for d_off and does not use it
> >> in
> >> any way. This is appropriate since actually using d_off is a major
> >> feature addition.
> >>
> > Well, at some point ino64 will need to define a new
> > getdirentries(2)
> > syscall and I believe this new syscall can have
> > different/additional
> > arguments.
> yes, posix only specifies 2 mandatory fields (d_ino and d_name) and
> everything else is implementation dependent.
> > I'd suggest that the new gtedirentries(2) syscall should return a
> > flag to indicate that the underlying file system is filling in
> > d_off.
> > Then the libc functions can use d_off if it it available.
> > (They will still need to "work" at least as well as they do now if
> >   the file system doesn't support d_off. The old getdirentries(2)
> >   syscall
> >   will be returning the old/current "struct dirent" which doesn't
> >   have
> >   the field anyhow.)
> >
> > Another bit of fun is that the argument for seekdir()/telldir() is
> > a
> > long and ends up 32bits for some arches. d_off is 64bits, since
> > that
> > is what some file systems require.
> what does linux use?
> ------
>        In glibc up to version 2.1.1, the return type of telldir() was
> off_t.
>         POSIX.1-2001 specifies long, and this is the type used since
>         glibc
>         2.1.2.
> 
> also from the linux man page: this is interesting..
> 
> --------
>         In early filesystems, the value returned by telldir() was a
>         simple
>         file offset within a directory.  Modern filesystems use tree
> or hash
>         structures, rather than flat tables, to represent
>         directories.  On
>         such filesystems, the value returned by telldir() (and used
>         internally by readdir(3)) is a "cookie" that is used by the
>         implementation to derive a position within a directory.
> Application
>         programs should treat this strictly as an opaque value,
>         making no
>         assumptions about its contents.
> ------
> but glibc uses the contents in a nonopaque (and possibly wrong) way
> itself in seekdir. .
> (not following their own advice.)
> 
> 
> > Maybe the library code can only use d_off if it is a 64bit arch and
> > the file system is filling it in. (Or maybe the library can keep
> > track
> > of 32<->64bit mappings for the offsets. I haven't looked at the
> > libc
> > functions for a while, so I can't remember what they keep track
> > of.)
> 
> one supposes a 32 bit system would not have such large file systems
> on
> it..
> (maybe?)
For NFS, the cookie is always an opaque 64bits. These cookies are cached
in the kernel by the client, with one for each "logical UFS-like directory
block generated for getdirentries(2)". As such, the NFS case does a 64bit->32bit
mapping. (Because of endianness etc, there is no guarantee that most of these
cookies are 0 for the high order 32bits.)

I need to look at the library code (both glibc and ours) before I understand
this better and can say more.

Have fun with it, rick
ps: The ino64 stuff will never be MFC'd, so it would be nice to "improve"
    what the libc functions do without use of d_off.

> >
> > rick
> >
> >> A proper d_off would still be useful even if UFS's readdir keeps
> >> masking
> >> off the offset so a directory read always starts at the beginning
> >> of
> >> a
> >> 512-byte directory block, since this allows more distinct offset
> >> values
> >> than safely using getdirentries()'s *basep. With d_off, one outer
> >> loop
> >> must read at least one directory block to avoid spinning
> >> indefinitely,
> >> while using getdirentries()'s *basep requires reading the whole
> >> getdirentries() buffer.
> >>
> >> Some Linux filesystems go further and provide a unique d_off for
> >> each
> >> entry.
> >>
> >> Another idea would be to store the last d_ino instead of dd_loc
> >> into
> >> the
> >> struct ddloc. On seekdir(), this would seek to loc_seek as before
> >> and
> >> skip entries until that d_ino is found, or to the start of the
> >> buffer
> >> if
> >> not found (and possibly return some entries again that should not
> >> be
> >> returned, but Samba copes with that).
> >>
> >> --
> >> Jilles Tjoelker
> >> _______________________________________________
> >> freebsd-current_at_freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> >> To unsubscribe, send any mail to
> >> "freebsd-current-unsubscribe_at_freebsd.org"
> >>
> >
> 
> 
Received on Sat Apr 25 2015 - 10:16:24 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:57 UTC