Re: why GNU grep is fast

From: Sean C. Farley <scf_at_FreeBSD.org> Date: Sun, 22 Aug 2010 20:29:01 -0500 (CDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC

On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:

> "Sean C. Farley" <scf_at_FreeBSD.org> writes:
>> Some algorithms:
>> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
>
> Aho-Corasick is not really a search algorithm, but an algorithm for 
> constructing a table-driven finite state machine that will match 
> either of the search strings you fed it.  I believe it is less 
> efficient than Boyer-Moore for small numbers of search terms, since it 
> scans the entire input.  I don't see the point in using it in grep, 
> because grep already has an algorithm for constructing finite state 
> machines: regcomp(3).

especially those that could be useful for fgrep functionality

I was mainly talking about algorithms useful for the fgrep portion 
within FreeGrep.  fgrep would run (still runs?) over the same text for 
each pattern.

Therefore, Aho–Corasick had to be mentioned for the reason referenced 
within the link:
     The Aho–Corasick string matching algorithm formed the basis of the
     original Unix command fgrep.

>> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
>
> It doesn't seem to compare favorably to the far older Aho-Corasick. 
> It uses slightly less memory, but memory is usually not an issue with 
> grep.

I agree, yet I like to keep alternative algorithms in mind in case a 
variant would be useful.

>> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
>> variant)
>
> Glimpse is a POS...  and not really comparable, because grep is 
> designed to search for a single search string in multiple texts, while 
> glimpse is designed to search a large amount of text over and over 
> with different search strings.  I believe it uses suffix tables to 
> construct its index, and Boyer-Moore only to locate specific matches, 
> since the index lists only files, not exact positions.  For anything 
> other than fixed strings, it reverts to agrep, but I assume (I haven't 
> looked at the code) that if the regexp has one or more fixed 
> components, it uses those to narrow the search space before running 
> agrep.

Glimpse may be a POS; I have not used it personally.  I only noted its 
algorithm for possible use within fgrep.

Of course, there may be much better algorithms out there to boost 
fgrep's speed, but these are what I had found at one time.

Sean
-- 
scf_at_FreeBSD.org