Re: bug with special bracket expressions in regular expressions

From: Kimmo Paasiala <kpaasial_at_gmail.com>
Date: Mon, 2 Sep 2013 20:52:18 +0300
On Mon, Sep 2, 2013 at 7:45 PM, Andriy Gapon <avg_at_freebsd.org> wrote:
> on 02/09/2013 17:54 Andriy Gapon said the following:
>>
>> re_format(7) says:
>>      There are two special cases‡ of bracket expressions: the bracket expres‐
>>      sions ‘[[:<:]]’ and ‘[[:>:]]’ match the null string at the beginning and
>>      end of a word respectively.  A word is defined as a sequence of word
>>      characters which is neither preceded nor followed by word characters.  A
>>      word character is an alnum character (as defined by ctype(3)) or an
>>      underscore.  This is an extension, compatible with but not specified by
>>      IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software
>>      intended to be portable to other systems.
>>
>> However I observe the following:
>> $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
>> xx
>> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
>> cd1 xx
>>
>> In my opinion '[[:<:]]' should not affect how the pattern is matched in this case.
>
> It seems that the code works like this:
> - first it matches "cd0 " and "removes" it
> - then it passes "cd1 xx" for matching with a flag that tells that this is not
>   a real start of the string
> - thus the matching code
>  o knows that this is not a real line start, so it can't match [[:<:]]
>    just for that reason
>  o it does _not_ know what was the character before the start of the given
>    substring, so it can not know if it could match [[:<:]]
>
> So matching fails.
> Not sure if this is an internal problem of regex(3) or a problem of how sed(1)
> uses regex(3).
>
> --
> Andriy Gapon

In my opinion this is a bug. The [[:<:]] operator is said to match the
empty string at the beginning of a word with no mention that the word
has to be at the beginning of the whole string that is matched. OS X
version of sed(1) works differently:

$ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
xx
$ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
xx

-Kimmo
Received on Mon Sep 02 2013 - 15:52:20 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:41 UTC