Re: bug with special bracket expressions in regular expressions

From: Chris Rees <crees_at_physics.org>
Date: Tue, 01 Oct 2013 21:01:43 +0100
On 02/09/2013 16:09, Damian Weber wrote:
>
> On Mon, 2 Sep 2013, Andriy Gapon wrote:
>
>> re_format(7) says:
>>       There are two special cases? of bracket expressions: the bracket expres?
>>       sions ?[[:<:]]? and ?[[:>:]]? match the null string at the beginning and
>>       end of a word respectively.  A word is defined as a sequence of word
>>       characters which is neither preceded nor followed by word characters.  A
>>       word character is an alnum character (as defined by ctype(3)) or an
>>       underscore.  This is an extension, compatible with but not specified by
>>       IEEE Std 1003.2 (?POSIX.2?), and should be used with caution in software
>>       intended to be portable to other systems.
>>
>> However I observe the following:
>> $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
>> xx
>> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
>> cd1 xx
>>
>> In my opinion '[[:<:]]' should not affect how the pattern is matched in this case.
>>
>> Any thoughts, suggestions?
> there are two simpler expressions, whose difference I don't understand either
> (tested on 8.4-PRERELEASE)
>
> $ echo "cd0 cd1 xx" | sed 's/cd[0-9] //g'
> xx
> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9] //g'
> cd1 xx

Well, I agree with your analysis, and I think it's certainly a bug.

Do you think that the BUGS line in regex(3) should perhaps be extended 
to "never works properly"?:
"""
Word-boundary matching does not work properly in multibyte locales.
"""
[[:<:]] can be replaced by \b in a pcre, which works perfectly fine (of 
course)

echo "this word word should be deleted" | perl -pe 's,\bword ,,g' this 
should be deleted

Chris

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Received on Tue Oct 01 2013 - 18:02:21 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:42 UTC