On Thu, May 3, 2018 at 9:08 AM, Stefan Esser <se_at_freebsd.org> wrote: > Hi all, > > while working on a new portmaster version, I found that bsdgrep is much > faster in an UTF-8 locale than in the C locale, much to my surprise. > > I have uploaded a small shell-script with test data that can be fetched > from: > > https://people.freebsd.org/~se/grep-test.txz > > The script uses "grep -v -f patternfile datafile" to select from datafiles > the lines that are not matched by the contents of patternfile: > > #------------------------------------------------------------------- > #!/bin/sh > > LANG=en_US.UTF-8 > LC_CTYPE=en_US.UTF-8 > > export LANG LC_CTYPE > > time grep -v -f grep-test-pattern grep-test-data > > LANG=C > LC_CTYPE=C > #unset LANG LC_CTYPE # is an alternative leading to the same result ... > > time grep -v -f grep-test-pattern grep-test-data > #------------------------------------------------------------------- > > The first "grep" needs 3.5 seconds to finish on my system, but the second > one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not > bother to check whether it finishes at all). > > Is this a bug in grep? > > Maybe there is something odd in the data file (loading the pattern is not > slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a > problem that was observed with "real" data, not a specifically constructed > worst case. > > Any ideas what's causing this behavior? > > I'm currently setting the UTF-8 locale as in the first invocation above > to make grep run in reasonable time, but I'd expect it to be faster in > the C locale ... > > Regards, STefan Hmm... what does `grep -V` look like, just to confirm? These are the results on my local system: root_at_viper:/tmp/grep# ./grep-test.sh All/mpfr-3.1.7.tgz 0.10 real 0.10 user 0.00 sys All/mpfr-3.1.7.tgz 0.09 real 0.08 user 0.00 sys But I don't immediately recall if I have local modifications in regex(3)/bsdgrep that might have affected this. =( Thanks, Kyle EvansReceived on Thu May 03 2018 - 12:41:47 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC