grep and Regular expression correctness

From: <chrisk-freebsd_at_list.mightyreason.com>
Date: Mon, 23 Aug 2010 12:51:40 +0100
I saw the discussion on this list with the subject "why GNU grep is
fast" and I thought I would contribute to a discussion of the
correctness of the POSIX regular expressions.

Gabor wrote:
>> I believe Gabor is considering TRE for a good replacement regex library.
> Yes. Oniguruma is slow, Google RE2 only supports Perl and fgrep syntax 
> but not standard regex and Plan 9 implementation iirc only supports 
> fgrep syntax and Unicode but not wchar_t in general.

I specifically need to mention that the ideas used in the TRE algorithm
can work, but that libtre is very buggy (see footnote [1]).

The hardest part of POSIX regular expressions is in picking out the
correct captured subexpressions.  This makes programs like "sed"
vulnerable to the bad regex.h engine that comes with the operating system.

Luckily grep does not need the captured subexpressions, and thus does
not need the complexity that comes from the ideas behind TRE.

I use OS X which inherits an older freebsd suite of tools with bugs like
this:

> echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'

[XababaYababaZ] => [X][ab][aba][Y][b][Z]

Where the [b] between [Y] and [Z] is completely (insanely) wrong.
It cannot even get the leftmost-longest correct:

> echo "ab" | sed -E 's/(()|.)(b)/[\1][\3]/'
a[][b]

Where the answer should have been the same as

> echo "ab" | sed -E 's/(.|())(b)/[\1][\3]/'
[a][b]

I have no idea if these bugs are still present in freebsd-current.

Cheers,
  Chris Kuklewicz



[1] libtre has problems such as this (from Haskell's regex-tre wrapper)
>> "searchme" =~ "s(()|^)e" :: Array Int (MatchOffset,MatchLength)
> array (0,2) [(0,(0,2)),(1,(1,0)),(2,(1,0))]

The above looks sane but re-ordering the alternative causes the match to
fail:

>> "searchme" =~ "s(^|())e" :: Array Int (MatchOffset,MatchLength)
> array (1,0) []

And the problems with ^ and $ are not the only ones:

>> "searchme" =~ "((s)|(e)|(a))*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,3)),(1,(2,1)),(2,(-1,0)),(3,(-1,0)),(4,(2,1))]

The above is correct, but this is not:

>> "searchme" =~ "((s)|(e)|())*" :: Array Int (MatchOffset,MatchLength)
> array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1)),(4,(2,0))]
Received on Mon Aug 23 2010 - 10:03:01 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:06 UTC