% beamer selects english by default. We have to select both, already here. \documentclass[russian,english]{beamer} \usepackage{ucs} \usepackage[utf8x]{inputenc} \usepackage[encapsulated]{CJK} % ftp://ftp.dante.de/tex-archive/macros/latex/contrib/beamer/doc/beameruserguide.pdf % lcy ot2 t2a t2b t2c x2 % \usepackage[T2A]{fontenc} % \usepackage[T2A]{fontenc} \usepackage[russian,english]{babel} \newcommand{\jptext}[1]{\begin{CJK}{UTF8}{min}#1\end{CJK}} % shaded box for code %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{listings,color} \definecolor{verbgray}{gray}{0.9} \lstnewenvironment{code}{% \lstset{backgroundcolor=\color{verbgray}, frame=single, framerule=0pt, basicstyle=\ttfamily, columns=fullflexible}}{} \definecolor{shadecolor}{rgb}{.9, .9, .9} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage{graphicx} \usepackage{wrapfig} % now we have to switch between english and russian. % for japanese there seems to be no babel.. I guess for japanese you can divide everywhere anyway ;) \selectlanguage{english} \author[chorn]{Christian Horn} \date{\today} \titlegraphic{ \includegraphics[width=\textwidth]{pics/languages_titlepic/languages3.png} % \includegraphics[width=\textwidth]{pics/languages_titlepic/languages.png} } \begin{document} % \title{adventures with i18n on Linux} \title{i18n on Linux by example: Japanese} \frame{\titlepage} \frame{\frametitle{Table of contents}\tableofcontents} % suggestions for improvement, TODO: % Burnhard: fuer die 2015-Version, ein Ueberblick darueber, welche % Unicode-Features nur teilweise von Software unterstuetzt werden. % speziell von Sachen wie libslang (z.B. mc), ncursesw, ... %\subsection{Lists I} \begin{frame} \frametitle{my background} \begin{itemize} \item Christian Horn \item * in Mühlhausen, Thüringen/Germany \item Russian in school --- I learn: I am bad at languages. \item later English --- I confirm: I am bad at languages. \item Munich, work. \begin{itemize} \item \#linuxadmin \#engineer \#architect \#typography \end{itemize} \item 2008 for 3 months in Japan, since then: learning Japanese. \begin{itemize} \item \#language \#culture \#people \end{itemize} \end{itemize} \end{frame} \begin{frame} \frametitle{about this talk} \begin{itemize} % \item topic outline: using !(english,german) on linux. \item topic outline: i18n on linux, emphasis on Japanese \item What are the concepts, what typical issues are hit? \item This is me sharing what I find fascinating \item Let me know if it's not that interesting for others, then we directly head for beer :) \end{itemize} \end{frame} \frame{\frametitle{motivation of dealing with i18n/linux/Japanese} \begin{itemize} \item To learn a language you want to use ways interesting to you: reading Japanese literature, watching TV series, ... \item You naturally integrate the language into your daily life: when reading news, writing mails.. \item It's fun to look at obscure things. \begin{itemize} \item Obscure for us in the English speaking IT world - but using Cyrillic or Kanji is not obscure in Asia. \end{itemize} \end{itemize} } \begin{frame} \frametitle{What is i18n?} \begin{center} Is this taxi occupied or free? \fbox{\includegraphics{pics/taxi/taxi2small2.jpg}} \\ % \fbox{\includegraphics[scale=0.9]{pics/i18n/taxi2small.jpg}} \\ % \fbox{\includegraphics[scale=0.44]{pics/i18n/taxi.jpg}} \\ \end{center} \end{frame} \begin{frame} \frametitle{What is i18n?} \begin{center} Is this taxi occupied or free? \fbox{\includegraphics{pics/taxi/taxi2small2.jpg}} \\ % \fbox{\includegraphics[scale=0.9]{pics/i18n/taxi2small.jpg}} \\ % \fbox{\includegraphics[scale=0.44]{pics/i18n/taxi2.jpg}} \\ It's free. Although a European mostly maps 'red' to 'not free'. \end{center} \end{frame} \frame{\frametitle{What is i18n?} \begin{itemize} \item Based on culture we make assumptions \item Here most Europeans conclude 'colour red, car is occupied' \item Adapting software to our reading habits and common understanding makes it easier for us to interact, less prone to errors. \end{itemize} } \frame{\frametitle{What is i18n?} \begin{itemize} \item Can you imagine your grandparents using a computer not speaking their native language? \item What is the meaning of 2/6/14 ?\\ Probably a date.. but whats month/day? All over the world it's read differently. \item We need i18n to interact with the system in the most efficient way. We understand icons with signs from our culture fastest, read local books fastest, we are most effective in our language. \end{itemize} } \frame{\frametitle{What are i18n and l10n?} \begin{itemize} \item {\bf internationalization/i18n:} building a system/software which can easily adapt to a variety of languages, cultures and customs. i.e. date and time formats, local calendars, number formats and numeral systems, handling of personal names, forms of address etc. \item {\bf localization/l10n:} The actual adaption of documents, programs etc. to local languages and cultures. Includes translations, custom icons (for easier understanding), time formats, paper type, currency, weekday start etc. \end{itemize} } % good description of i18n and l10n: % http://localization.gov.in/index.php?option=com_content&view=article&id=39&Itemid=211 % One thing that comes to mind is the use of ○ (まる, generally means % "correct") and × (ばつ, generally means "incorrect"). In some cultures, a % × or check-mark [icon] may be used to indicate completion of a task, while % a Japanese-localized application might use a ○ [icon] instead. Using a × % [icon] might be confusing, and a check-mark might be acceptable in % software only because check-marks are common in GUI widgets. % If I remember correctly, if you have a Japanese Playstation 3 (or a % Japanese game for it), then ○ means "OK" and × means "Cancel". % However, if you play an English-language game, × means "OK". That % threw me off when I wanted to click "OK, OK, OK, ...". % Here in Hong Kong, I once drew a timeline and put an X to mark a % particular point in time. I was told that in Chinese culture, an X is % actually unlucky. *I* thought it was ok..."X" marks the spot on % treasure maps, right? \section{The basics of written Japanese} \begin{frame} \frametitle{The basics of written Japanese} Typical writings contain all of these: % http://en.wikipedia.org/wiki/Ametsuchi_No_Uta % http://en.wikipedia.org/wiki/Iroha % http://en.wikipedia.org/wiki/List_of_pangrams#Japanese \begin{wrapfigure}{r}{0.18\textwidth} \fbox{\includegraphics{pics/ametsuchi_no_uta/text3312_650px.png}} % \fbox{\includegraphics[width=0.14\textwidth]{pics/text3312.png}} % \fbox{\includegraphics[width=0.13\textwidth]{pics/text3312.png}} \tiny{Ametsuchi-No-Uta} \end{wrapfigure} \begin{enumerate} \item {\bf hiragana:} \jptext{「にほんごでかきましょう。」}\\ children start learning here. Contains the 5 vowels, plus 45 consonant/vowel unions like "ma", "ku" and such. All Japanese sentences and names can already be expressed with hiragana! \item {\bf katakana:} \jptext{「ニホンゴデカキマショウ。」}\\ used mainly for imported words and names. Contains again the 5 vowels, and 45 unions which sound the same as hiragana. \item {\bf kanji:} \jptext{「日本語で書きましょう。」}\\ pictogram based, many thousand exist. \end{enumerate} \end{frame} \section{Reading files with exotic contents} \begin{frame}[fragile] \frametitle{Reading exotic files} \begin{itemize} \item So then you receive files with more exotic contents, i.e. textfiles with Japanese Kanji: \begin{verbatim} $ file myfile.txt myfile.txt: UTF-8 Unicode text \end{verbatim} \vspace{15px} \item multiple topics coming up: \begin{enumerate} \item identifying filetype/encoding \item displaying the file \item modifying or processing the file \end{enumerate} \end{itemize} \end{frame} \section{character encodings} \frame{\frametitle{character encodings: fixed width, worldwide use} \begin{itemize} % http://en.wikipedia.org/wiki/Character_encoding \item encoding: recipe to translate the bits into characters \item ASCII: 7bit, specified 1963, has english alphabet, numbers + nonprintables = 128 chars \item EBCDIC: 8bit, 1963, from IBM, with 6 mutually incompatible versions meant to succeed ASCII \item code page 437: the famous DOS page from IBM, these pages being 8bit and ontop of ASCII. I.e. extending for Greek, Esperanto etc. - but only one at a time % http://en.wikipedia.org/wiki/Code_page_437 \item KOI8-R: 8bit, for Russian, stays readable when reading 7bit: \selectlanguage{russian} "Русский Текст" \selectlanguage{english} in KOI8-R becomes "rUSSKIJ tEKST" % KOI8-R got known as different "code pages" by IBM and Microsoft % http://en.wikipedia.org/wiki/KOI8-R \item UTF-32: fixed width unicode implementation \end{itemize} } \begin{frame}[fragile] % \frametitle{character encodings: fixed width, japanese relevant} \frametitle{character sets, japanese relevant} JIS (Japan Industrial Standards), organization responsible for coded character sets (CCS) and encodings used in Japan. \begin{itemize} \item JIS X 0201: Jap. Industry Standard, character set with 7 and 8 bit encodings, first encoding in 1969, ASCII + katakana. Allows phonetic expressing, but no hiragana or Kanji \item JIS X 0208: character set from 1978, 16bit encoding, \textasciitilde 6800 pictograms: most common Kanji \begin{itemize} \item widely used on Japanese mobiles nowadays, not unicode \item also the standard of Aozora Bunko \end{itemize} % http://en.wikipedia.org/wiki/JIS_X_0208 % https://www.debian.org/doc/manuals/intro-i18n/ch-languages.en.html#s-japanese \item Unicode: all of these (apart of JIS X 0213) are part of Unicode 3.0.1. \end{itemize} \end{frame} % still unclear why unicode did not take over JIS X 0208 scenarios \begin{frame}[fragile] \frametitle{character encodings: variable width} % http://en.wikipedia.org/wiki/Variable-width_encoding \begin{itemize} \item Shift-JIS: supports several JIS X character sets \begin{itemize} \item famous on 2chan board since 1999, 4chan ancestor \item good supported on Windows since 3.1j (CP932 flavour) \item many nonstandard custom variants, i.e. with emoticons for mobile \end{itemize} \item EUC-JP: includes several JIS X, updated many times \begin{itemize} \item widely supported in Unix/Linux \item yet InternetExplorer supports only a subset \end{itemize} % http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP \item ISO-2022-JP: mails with: \\ \begin{verbatim} Content-Type: text/plain; charset="iso-2022-jp" \end{verbatim} \item UTF-8, UTF-16: implementations of Unicode standard, widely used in web (html4.0 recommendation) and mail (MIME), Han unification happened % http://en.wikipedia.org/wiki/Unicode http://en.wikipedia.org/wiki/UTF-8 % http://www.joelonsoftware.com/articles/Unicode.html \end{itemize} \end{frame} % $ perl -CSDL -le 'print "\x{4e21}"' % 両 \begin{frame} \frametitle{unicode encodings} \setlength\fboxsep{10pt} \setlength\fboxrule{0.5pt} \fbox{\includegraphics[width=\linewidth]{pics/unicode_encodings.png}} \label{fig:unicode encodings} % In UTF-8, every code point from 0-127 is stored in a single byte. % Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. \end{frame} % TODO: add stories about other unicode encodings, i.e. the one which does not touch 8th bit \begin{frame}[fragile] \frametitle{Looking deeper into fixed with encodings..} \footnotesize \begin{code} $ echo "foobar123" >file_ascii $ echo "foobar123" | iconv -f ascii -t utf32 >file_utf32b $ ls -al file* -rw-rw-r--. 1 chris chris 10 Jan 5 22:18 file_ascii -rw-rw-r--. 1 chris chris 44 Jan 5 22:18 file_utf32 $ file * file_ascii: ASCII text file_utf32: Unicode text, UTF-32, little-endian $ xxd file_ascii 0000000: 666f 6f62 6172 3132 330a foobar123. $ xxd file_utf32 0000000: fffe 0000 6600 0000 6f00 0000 6f00 0000 ....f...o...o... 0000010: 6200 0000 6100 0000 7200 0000 3100 0000 b...a...r...1... 0000020: 3200 0000 3300 0000 0a00 0000 2...3....... \end{code} \normalsize \end{frame} \begin{frame}[fragile] \frametitle{fixed vs. variable width encoding} % \jptext{\$ echo "foo日本" \textgreater f\_utf8} \footnotesize \begin{code} $ echo -e '\x66\x6f\x6f\xe6\x97\xa5\xe6\x9c\xac' >f_utf8 $ iconv -f utf8 -t utf32 f_utf8 >f_utf32 $ file f* f_utf32: Unicode text, UTF-32, little-endian f_utf8: UTF-8 Unicode text $ ls -al f* -rw-rw-r--. 1 chris chris 28 Jan 5 22:50 f_utf32 -rw-rw-r--. 1 chris chris 10 Jan 5 22:50 f_utf8 $ xxd f_utf8 0000000: 666f 6fe6 97a5 e69c ac0a foo....... $ xxd f_utf32 0000000: fffe 0000 6600 0000 6f00 0000 6f00 0000 ....f...o...o... 0000010: e565 0000 2c67 0000 0a00 0000 .e..,g...... \end{code} \normalsize \$ LC\_ALL=en\_US.UTF-8 cat f\_utf8\\ \jptext{foo日本} % TODOMAYBE: insert picture of variable width, singleton, lead units, trail units \end{frame} \begin{frame}[fragile] \frametitle{variable width encoding examples} \begin{itemize} \item Different characters of the same encoding can result in different length when encoded. \end{itemize} \jptext{\$ echo -n "日" \textgreater f\_hi} \\ \selectlanguage{russian} \$ echo -n "я" \textgreater f\_ya \selectlanguage{english} \footnotesize \begin{code} $ file f_* f_hi: UTF-8 Unicode text, with no line terminators f_ya: UTF-8 Unicode text, with no line terminators $ ls -al f* -rw-rw-r--. 1 chris chris 3 Jan 5 23:31 f_hi -rw-rw-r--. 1 chris chris 2 Jan 5 23:31 f_ya $ xxd f_hi 0000000: e697 a5 ... $ xxd f_ya 0000000: d18f .. \end{code} \normalsize \end{frame} \section{Displaying UTF8 characters on Linux} \begin{frame}[fragile] \frametitle{Displaying UTF8 characters on Linux} \begin{enumerate} \item Which utf8 locales are available? \begin{verbatim} locale -a|grep utf8 \end{verbatim} \item Tell the system in which language to communicate: \begin{verbatim} for i in LC_ALL LANG LANGUAGE; do export $i=ru_RU.utf8; done \end{verbatim} \item Run a terminal emulator capable of utf8: xterm -en utf-8 \item Try to output utf8 character U+263a: \begin{verbatim} perl -CSDL -le 'print "\x{263}"' \end{verbatim} Notice: also your font has to support the character! \end{enumerate} % Also famous: % wget -O- http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt | less \end{frame} \section{Converting between encodings} \begin{frame}[fragile] \frametitle{How to convert between encodings?} Lets convert kanji into hiragana. The kakasi program only speaks Shift-JIS. So we \begin{itemize} \item convert utf8 to shift-jis with iconv \item run kakasi \item convert the result to utf8 \item display it \end{itemize} % we can not mix UTF8 and verbatim, so hacking something up :/ \jptext{echo "私は馬鹿です。"} $\vert$ \textbackslash \begin{verbatim} iconv -f utf8 -t sjis | kakasi -JH | \ iconv -f sjis -t utf8 \end{verbatim} \jptext{わたしはばかです。} % echo "私は馬鹿です。" | iconv -f utf8 -t sjis | kakasi -JH | iconv -f sjis -t utf8 % わたしはばかです。 % second usecase: conversion for preparing .mobi files for kindle paperwhite \end{frame} \begin{frame} \frametitle{kakasi update} kakasi 2.3.5 and later supports UTF-8 natively via "-i" and "-o" for input/output encodings \end{frame} \section{Input of Russian or Japanese characters} \begin{frame} \frametitle{Input of Russian or Japanese characters} \begin{center} So, do I need a keyboard with 5.000keys for Japanese? % \fbox{\includegraphics[scale=0.8]{pics/jap_old_keyboard2b.jpg}} \\ % \tiny{source: \url{http://www.flickr.com/photos/e-z/17023214/}} \fbox{\includegraphics{pics/jap_bigkeyboard/20140202_DSC00849_jap_keyboard_chorn_1400px.jpg}} \\ % \fbox{\includegraphics[scale=0.8]{pics/jap_bigkeyboard/20140202_DSC00849_jap_keyboard_chornb.jpg}} \\ \tiny{IBM keyboard for Japanese characters, 1980, German museum} \end{center} \end{frame} % see also: http://twentytwowords.com/2013/01/04/old-japanese-keyboard-has-216-keys-that-can-make-12-characters-each/ \begin{frame} \frametitle{Input of Russian or Japanese characters} No 5000 keys keyboard required.\\ Cut'n'pasting text from websites does not scale either. \begin{itemize} \item the input method gets changed, switching i.e. between Japanese, Russian and English input (i.e. using strg+space ) \item For Japanese Hiragana/Katakana can then directly be entered \item For Kanji we input the sound of the Kanji in Hiragana and have the input system complete to the desired Kanji \item romaji (i.e. "kan") $\rightarrow$ kana ("\jptext{かん}") $\rightarrow$ kanji ("\jptext{館}") \item many homophones, i.e. 61 Kanji readings for "kan" % http://en.wikipedia.org/wiki/Input_method_editor http://en.wikipedia.org/wiki/Japanese_input_methods % http://en.wikipedia.org/wiki/List_of_input_methods_for_UNIX_platforms \end{itemize} \end{frame} \begin{frame} \frametitle{Input of Russian or Japanese characters} \begin{itemize} \item Input method frameworks: \begin{itemize} \item ibus (Fedora, Ubuntu, FreeBSD, ..) \item scim (BSDs) \item fcitx (newcomer, very active development, many features) \end{itemize} \item Input methods: \begin{itemize} \item Anthy (old, not much movement), \item mozc (more modern, handwriting tool, dictionary editor) \item 'Simple Kana Kanji' (ibus-skk) \item 'Kana Kanji engine' (ibus-kkc) \end{itemize} \end{itemize} \end{frame} \section{Locales: having programs act localized} \frame{\frametitle{Locales: having programs act localized} \begin{itemize} \item LOCALE concept introduced in ISO C (ISO/IEC 9899:1990), enhanced 1995 \item also POSIX contains i18n standards \item locale categories: \begin{itemize} \item LC\_CTYPE: are multibyte chars used? \item LC\_COLLATE: sorting related \item LC\_MESSAGES: selects language of software output \item LC\_MONETARY: comma or period as separator, currency mark \item LC\_NUMERIC: numbers, character for dec. point etc. \item LC\_TIME: time, names of weekdays, date order etc. \end{itemize} \end{itemize} } \frame{\frametitle{Locales: example} So programs themself support multiple languages, lets use them: \\ \vspace{30px} % \$ for i in en\_US ru\_RU ja\_JP; do LC\_ALL=\$i.utf8 date; done\\ \$ for i in en\_US aa\_ER et\_EE ru\_RU uk\_UA zh\_CN ja\_JP; do \textbackslash \\ \textgreater LC\_ALL=\$i.utf8 date; done\\ Wed Jun 18 21:35:13 CEST 2014 \\ Arbaqa, Qasa Dirri 18, 9:35:13 carra CEST 2014 \\ K juuni 18 21:35:13 CEST 2014 \\ \selectlanguage{russian} Ср июн 18 21:35:13 CEST 2014 \\ середа, 18 червня 2014 21:35:13 +0200 \\ \selectlanguage{english} \jptext{2014年 06月 18日 星期三 21:35:13 CEST\\} \jptext{2014年 6月 18日 水曜日 21:35:13 CEST\\} } % Малко текст на български.Какой-то текст на русском языке. \section{i18n for webpages and email} \begin{frame}[fragile] \frametitle{i18n for webpages and email} \begin{itemize} \item In the beginning, transmitted webpages (HTTP) or mails (SMTP) are still ASCII \item as part of the communication we inform the other side that we want not only ASCII, but i.e. utf8 or other encodings \item in mail header: Content-Type: text/plain; charset="UTF-8" \item in web header: \end{itemize} \begin{verbatim}