Re: why GNU grep is fast

From: Mike Haertel <mike_at_ducky.net> Date: Sun, 22 Aug 2010 11:22:12 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC

Dag-Erling Sm�rgrav <des_at_des.no> writes:
> Mike Haertel <mike_at_ducky.net> writes:
> > GNU grep uses the well-known Boyer-Moore algorithm, which looks
> > first for the final letter of the target string, and uses a lookup
> > table to tell it how far ahead it can skip in the input whenever
> > it finds a non-matching character.
> 
> Boyer-Moore is for fixed search strings.  I don't see how that
> optimization can work with a regexp search unless the regexp is so
> simple that you break it down into a small number of cases with known
> length and final character.

GNU grep uses heuristics to look for a fixed string that any string
matching the regex *must* contain, and uses that fixed string as the
bases for its initial Boyer-Moore search.

For example if your regex is /foo.*bar/, the initial Boyer-Moore search
is (probably) searching for foo.

If the initial search succeeds, GNU grep isolates the containing line,
and then runs the full regex matcher on that line to make sure.

This is the sort of thing that a good regex library could do internally.

Unfortunately, you can'd do this with a library that conforms to
the !_at_#%$!_at_#% POSIX regex API.

The problem is that regexec()'s interface is based on NUL-terminated
strings, rather than byte-counted buffers.  So POSIX regexec() is
necessarily character-at-a-time, because it has to look for that
input-terminating NUL byte, and also you can't use it to search binary
data that might contain NULs.  (GNU grep works fine with arbitrary
input files, as long as it can allocate enough memory to hold the
longest line.)

For these reasons a good grep implementation is pretty muched doomed
to bundle its own regex matcher.