Table of Contents

What?

Some notes regarding linux, i18n and encoding of files. Ken Lundes cfk.inf file is very helpful.

verifying utf8 output

In a utf8 xterm, verifying the output of date and a utf8 file:

[chris@hive ~]$ locale -a|grep ^ja
ja_JP
ja_JP.eucjp
ja_JP.ujis
ja_JP.utf8
japanese
japanese.euc
[chris@hive ~]$ echo $LC_ALL
ja_JP.utf8
[chris@hive ~]$ cat test_utf8
日本語
[chris@hive ~]$ date
2013年 12月 23日 月曜日 22:20:21 CET

looking at eucjp output

Now the same should also work in an eucjp encoding environment. Below (incorrect) output is on Fedora19, on Debian I get proper output:

[chris@hive ~]$ LC_ALL=ja_JP.eucjp luit
[chris@hive ~]$ locale charmap
EUC-JP
[chris@hive ~]$ cat test_eucjp
F|K\8l
[chris@hive ~]$ date
2013G/ 127n 23F| 7nMKF| 22:21:45 CET

Seems to be the 7-bit equivalent of EUC-JP “日本語”.

fedora19/20 and rhel7 issue

Description of problem:
  xterm does not display EUCJP encoding

Version-Release number of selected component (if applicable):
  xterm-293-1.fc19.x86_64
  (not sure if this is an issue in xterm, glibc or something else)

How reproducible:
  always

Steps to Reproduce:
1. # verify eucjp and utf8 locales exist
   locale -a|grep ja_JP   
2. # ensure this is a terminal capable of displaying the chars we request later
   LC_ALL="ja_JP.utf8" echo 日本語
   LC_ALL="ja_JP.utf8" date
3. LC_ALL="ja_JP.eucjp" luit
4. date

Actual results:
2013G/ 127n 24F| 2PMKF| 17:49:59 CET

Expected results:
2013年 12月 24日 火曜日 17:49:04 CET

Additional info:
- the above works on debian stable
- gnome-terminal and xterm both show this
- tried this in "xterm -en eucjp" terminal, as well as gnome-terminal
- in the eucjp output, the high bits seem stripped off by something
- setting "stty raw" does not lead to the expected output
- also creating a utf8 textfile, converting to eucjp with "iconv" 
  and outputting this gives same result
- the output of "date +%A| xxd|md5sum" in a "LC_ALL=ja_JP.eucjp luit"
  environment is identical on the debian and the fedora system

debugging

[chris@hive ~]$ echo $LC_ALL
ja_JP.UTF-8
[chris@hive ~]$ date
2013年 12月 25日 水曜日 21:51:51 CET
[chris@hive ~]$ LC_ALL=ja_JP.eucjp date|iconv -f eucjp
2013年 12月 25日 水曜日 21:51:59 CET
[chris@hive ~]$ LC_ALL=ja_JP.eucjp date|xxd
0000000: 3230 3133 c7af 2031 32b7 ee20 3235 c6fc  2013.. 12.. 25..
0000010: 20bf e5cd cbc6 fc20 3231 3a35 323a 3036   ...... 21:52:06
0000020: 2043 4554 0a                              CET.

[chris@hive ~]$ for l in utf8 eucjp; do echo -e "$l\t $(LC_ALL=ja_JP.utf8 date +%A)"; LC_ALL=ja_JP.$l date +%A|xxd; echo; done
utf8     水曜日
0000000: e6b0 b4e6 9b9c e697 a50a                 ..........

eucjp    水曜日
0000000: bfe5 cdcb c6fc 0a                        .......

for i in $(xlsfonts); do
        echo "current font: $i";
        xterm -fn $i -e 'LC_ALL=ja_JP.eucjp luit cat test_eucjp; sleep 10;';
done

[chris@hive ~]$ for l in utf8 eucjp; do echo -e "$l\t $(LC_ALL=ja_JP.utf8 date +%A)"; LC_ALL=ja_JP.$l date +%A|xxd -g1; echo; done
utf8     水曜日
0000000: e6 b0 b4 e6 9b 9c e6 97 a5 0a                    ..........

eucjp    水曜日
0000000: bf e5 cd cb c6 fc 0a                             .......

The following produces the same on Debian and Fedora:

for l in utf8 eucjp ; do
        echo -e "$l\t $(LC_ALL=ja_JP.utf8 date +%A)";
        LC_ALL=ja_JP.$l date +%A|xxd; echo;
done | md5sum -c <(echo "31f0f83e7fe3dacb7b288c101cd1debd  -")