About gnu/93629 : GNU sort(1) tool dumps core within non-regular locale settings

From: Tobias Svehagen <tobias.svehagen_at_gmail.com> Date: Sun, 2 Apr 2006 14:32:04 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:54 UTC

I saw that this issue was on the todo list for 6.1R so I decided to
take a look at it.

http://www.freebsd.org/cgi/query-pr.cgi?pr=93629

As it says in the report you can recreate the abort by doing the following

setenv LANG uk_UA.KOI8-U
setenv LC_CTYPE ja_JP.UTF-8
/usr/bin/sort

This is quite a weird problem and the it lies in that sort tries to
handle the LC_TIME values in inittables_mb() thinking that they are in
UTF format. The LC_TIME values for uk_UA.KOI8-U does not use UTF
encoding but it uses NONE as encoding. Normally this wouldn't be a
problem since the multibyte routines handle normal ascii values <= 7f
just fine and that's why sort works fine when setting LANG to C for
example (since Jan-Dec has no ascii > 7f).

The thing about uk_UA.KOI8-U (and some others) is that it uses ascii
values > 7f to represent the ukrainian alphabet. For example Jan in
uk_UA.KOI8-U's LC_TIME is d3 a6 de 00. When you parse that string as
UTF, d3 says that it is a multibyte of length 2 and that one works
fine (does not trigger the assertion) but then d6 also says that it is
a multibyte of length 2 and that makes mbrtowc() return -2 (see man
mbrtowc) and that's what makes the assertion go off and abort.

I don't know what I think is the best way to solve this but I think
that something should be done to make sort not abort and core dump.
One solution is of course to make sort check that LC_CTYPE and LC_TIME
is the same (or C) but maybe some people want's to have it that way
(although I don't see why).

Do you have any ideas on how this can be solved in a nice way or do
you think that the fix "set LC_CTYPE and LC_TIME to same value" is
enough?

/Tobias Svehagen