Re: why GNU grep is fast

From: Dag-Erling Smørgrav <des_at_des.no> Date: Mon, 23 Aug 2010 10:12:12 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC

Mike Haertel <mike_at_ducky.net> writes:
> Dag-Erling Smørgrav <des_at_des.no> writes:
> > You don't really need to "isolate the containing line" unless you have
> > an actual match, do you?
> Theoretically no.  However, suppose the pattern was /foo.*blah/.
> The Boyer-Moore search will be for "blah", since that's the longest
> fixed substring.  But verifying a match for the full regexp either
> requires a regexp matcher with the feature "start here, at THIS point
> in the middle of the RE and THAT point in the middle of the buffer,
> and match backwards and forwards", or else running a more standard
> RE matcher starting from the beginning of the line.

Stupidly enough, I only considered the case where the fixed search
string is at the end of the regexp :)  For the general case, you'd have
to either build a second FSA that mirrors the first, or design your
engine and data structures in such a way that the FSA can be traversed
in either direction.  Then you note which states correspond to the
beginning and end of the fixed substring, and run the FSA forward from
the end and backward from the beginning.  That could prove advantageous
when the input consists of very long lines and the frequency of partial
matches is relatively high; large files with very long lines are very
common in bioinformatics.

> As you mentioned, you can avoid searching forwards for the next
> newline if your RE matcher supports using newline as an exit marker.

Umm, I don't understand why it needs to support using *anything* as an
exit marker.  The newline character will simply not match any of the
transitions from whichever state the FSA is currently in, and neither
will the null character.  The only bit that needs to know about them is
the bit that handles end-of-line and end-of-word anchors.

DES
-- 
Dag-Erling Smørgrav - des_at_des.no