Re: why GNU grep is fast

From: Tim Kientzle <tim_at_kientzle.com> Date: Sun, 22 Aug 2010 10:31:46 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC

On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote:
> In article <86k4nikglg.fsf_at_ds4.des.no> you write:
>> Mike Haertel <mike_at_ducky.net> writes:
>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>>> first for the final letter of the target string, and uses a lookup
>>> table to tell it how far ahead it can skip in the input whenever
>>> it finds a non-matching character.
>> 
>> Boyer-Moore is for fixed search strings.  I don't see how that
>> optimization can work with a regexp search unless the regexp is so
>> simple that you break it down into a small number of cases with known
>> length and final character.
> 
> The common case of regexps used in the "grep" utility (and, for
> obvious reasons, universal in the "fgrep" utility) is fixed-length
> search strings.  Even non-fixed-length regexps typically consist of
> one one or two variable-length parts.

This is an important point:  A good grep implementation will
use different strategies depending on the input regexp.
Fixed-string matching is a very important special case.

>  Matching a completely
> variable-length regexp is just hard, computationally, 

See Russ Cox' articles for why this is not true.  It does require
considerable sophistication to build an efficient DFA but the
actual matcher, once built, can run very fast indeed:

  http://swtch.com/~rsc/regexp/

Tim