Re: bug with special bracket expressions in regular expressions

From: Andriy Gapon <avg_at_FreeBSD.org>
Date: Mon, 02 Sep 2013 19:45:02 +0300
on 02/09/2013 17:54 Andriy Gapon said the following:
> 
> re_format(7) says:
>      There are two special cases‡ of bracket expressions: the bracket expres‐
>      sions ‘[[:<:]]’ and ‘[[:>:]]’ match the null string at the beginning and
>      end of a word respectively.  A word is defined as a sequence of word
>      characters which is neither preceded nor followed by word characters.  A
>      word character is an alnum character (as defined by ctype(3)) or an
>      underscore.  This is an extension, compatible with but not specified by
>      IEEE Std 1003.2 (“POSIX.2”), and should be used with caution in software
>      intended to be portable to other systems.
> 
> However I observe the following:
> $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
> xx
> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
> cd1 xx
> 
> In my opinion '[[:<:]]' should not affect how the pattern is matched in this case.

It seems that the code works like this:
- first it matches "cd0 " and "removes" it
- then it passes "cd1 xx" for matching with a flag that tells that this is not
  a real start of the string
- thus the matching code
 o knows that this is not a real line start, so it can't match [[:<:]]
   just for that reason
 o it does _not_ know what was the character before the start of the given
   substring, so it can not know if it could match [[:<:]]

So matching fails.
Not sure if this is an internal problem of regex(3) or a problem of how sed(1)
uses regex(3).

-- 
Andriy Gapon
Received on Mon Sep 02 2013 - 14:46:27 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:41 UTC