Re: grep extremely slow for LC_CTYPE=C?

From: Kyle Evans <kevans_at_freebsd.org>
Date: Thu, 3 May 2018 09:41:25 -0500
On Thu, May 3, 2018 at 9:08 AM, Stefan Esser <se_at_freebsd.org> wrote:
> Hi all,
>
> while working on a new portmaster version, I found that bsdgrep is much
> faster in an UTF-8 locale than in the C locale, much to my surprise.
>
> I have uploaded a small shell-script with test data that can be fetched
> from:
>
>         https://people.freebsd.org/~se/grep-test.txz
>
> The script uses "grep -v -f patternfile datafile" to select from datafiles
> the lines that are not matched by the contents of patternfile:
>
> #-------------------------------------------------------------------
> #!/bin/sh
>
> LANG=en_US.UTF-8
> LC_CTYPE=en_US.UTF-8
>
> export LANG LC_CTYPE
>
> time grep -v -f grep-test-pattern grep-test-data
>
> LANG=C
> LC_CTYPE=C
> #unset LANG LC_CTYPE # is an alternative leading to the same result ...
>
> time grep -v -f grep-test-pattern grep-test-data
> #-------------------------------------------------------------------
>
> The first "grep" needs 3.5 seconds to finish on my system, but the second
> one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not
> bother to check whether it finishes at all).
>
> Is this a bug in grep?
>
> Maybe there is something odd in the data file (loading the pattern is not
> slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a
> problem that was observed with "real" data, not a specifically constructed
> worst case.
>
> Any ideas what's causing this behavior?
>
> I'm currently setting the UTF-8 locale as in the first invocation above
> to make grep run in reasonable time, but I'd expect it to be faster in
> the C locale ...
>
> Regards, STefan

Hmm... what does `grep -V` look like, just to confirm?

These are the results on my local system:

root_at_viper:/tmp/grep# ./grep-test.sh
All/mpfr-3.1.7.tgz
        0.10 real         0.10 user         0.00 sys
All/mpfr-3.1.7.tgz
        0.09 real         0.08 user         0.00 sys

But I don't immediately recall if I have local modifications in
regex(3)/bsdgrep that might have affected this. =(

Thanks,

Kyle Evans
Received on Thu May 03 2018 - 12:41:47 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC