Re: why GNU grep is fast

From: Garrett Wollman <wollman_at_hergotha.csail.mit.edu> Date: Sun, 22 Aug 2010 13:04:00 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC

In article <20100822163644.GU2978_at_hoeg.nl> you write:

>I think that implementing a simple fgrep boils down to mmap()ing a file
>and calling memmem() on the mapping to search for the input string. Of
>course this relies on having an efficient memmem() implementation, for
>example using one of the algorithms mentioned in this thread.

It's actually more complicated than that, because you have to ensure
that you are not matching the middle of a multibyte character, when
the current locale specifies a character set with a multibyte
encoding.  Likewise when searching for the newlines that delimit the
matched line.  (I'm not sure whether FreeBSD supports any character
encodings that would be ambiguous in that way.)  I don't think this
was considered an issue when Mike Haertel was developing GNU grep.

It seems reasonable to implement BMG or some other fast search in
memmem().  Note that if you can't (or don't want to) mmap the whole
file at once, you'll need special handling for the boundary conditions
-- both at the string search level and at the search for line
delimiters.  This is much easier in the fgrep case, obviously, since
the length of the query puts a finite upper bound on the amount of the
old buffer you need to keep -- with regexps you really need your
regexp engine to be able to report its matching state, or else limit
your input to strictly conforming POSIX text files (i.e., line lengths
limited to {LINE_MAX}).

-GAWollman