How to Use Tcl 8.1 Internationalization Features

Tcl's new internationalization facilities allow you to create Tcl applications that support any multi-byte language, including Chinese and Japanese. Tcl also now includes support for message catalogs, which makes it easier to create localized versions of applications and packages. Tcl is the first cross-platform scripting language to help developers to deploy both commercial and enterprise network applications on a global scale.

This document provides a quick overview of the internationalization features introduced in Tcl 8.1. Topics include:

Character Encoding Overview
Character Encodings and the Operating System
General String Manipulation
Channel Input/Output
Sourcing Scripts in Different Encodings
Converting Strings to Different Encodings
Fonts, Encodings, and Tk Widgets
Message Catalogs
Internationalization and the Tcl C APIs
Summary: Tcl Internationalization Support at a Glance

Character Encoding Overview

A character encoding is simply a mapping of characters and symbols used in written language into a binary format used by computers. For example, in the standard ASCII encoding, the upper-case "A" character from the Latin character set is represented by the byte value 0x41 in hexadecimal. Other widely used character encodings include ISO 8859-1, used by many European languages, Shift-JIS and EUC-JP for Japanese characters, and Big5 for Chinese characters.

The Unicode Standard is a fixed-width, uniform encoding scheme for virtually all characters used in the world's major written languages. Unicode uses a 16-bit encoding for all text elements. These text elements include letters such as "w" or "M", characters such as those used in Japanese Hiragana to represent syllables, or ideographs such as those used in Chinese to represent full words or concepts. The Unicode Standard does not specify the visual representation of a character, which is known as a glyph. For more information on the Unicode Standard, visit the Unicode web site at http://www.unicode.org .

UTF-8 is a standard transformation format for Unicode characters. It is a method of transforming all Unicode characters into a variable length encoding of bytes; a single Unicode character can be represented by one, two, or three bytes. The advantage of the UTF-8 standard is that it and the Unicode standard were designed so that Unicode characters corresponding to the standard ASCII set (up to ASCII value 0x7F in hexadecimal) have the same byte values in both UTF-8 and ASCII encoding. In other words, an upper-case "A" character is represented by the single-byte value 0x41 in both UTF-8 and ASCII encoding.

Beginning in Tcl 8.1, Tcl represents all strings internally as Unicode characters in UTF-8 format. Tcl 8.1 also ships with built-in support for approximately 30 common character encoding standards, and can convert strings from one encoding to another. The encoding names command displays a list of all known encodings. You can create additional encodings as described in the Tcl_GetEncoding.3 reference page.

Tip: Because 7-bit ASCII characters have the same encoding in UTF-8 format, legacy Tcl scripts that use only 7-bit ASCII characters function the same in Tcl 8.1 as they did in Tcl 8.0. Furthermore, because the use of Unicode/UTF-8 encoding is internal to Tcl, most string handling in legacy Tcl scripts works the same in Tcl 8.1 as it did in Tcl 8.0. Most problems in converting from Tcl 8.0 to 8.1 occur in: 1) using non-Latin characters, 2) reading and writing strings from a channel, and 3) writing code that assumes that each character in a string is a fixed byte width (for example, one byte per character).

Character Encodings and the Operating System

The system encoding is the character encoding used by the operating system for items such as file names and environment variables. Text files used by text editors and other applications are usually encoded in the system encoding as well, unless the application that produced them explicitly saves them in another format (for example, if you use a Shift-JIS text editor on an ISO 8859-1 system).

Tcl automatically converts strings from UTF-8 format to the system encoding and vice versa whenever it communicates with the operating system. For example, Tcl automatically handles any encoding conversion needed if you execute commands such as:

% glob *

% set fd [open "Español.txt" w]

The Tcl source command also reads files using the system encoding, and strings passed to and from the Tcl exec command are converted to and from the system encoding.

Tcl attempts to determine the system encoding during initialization based on the platform and locale settings. Tcl usually can determine a reasonable default system encoding based on these settings, but if for some reason it cannot, it uses ISO 8859-1 as the default system encoding.

You can override the default system encoding with the encoding system command. Tcl Developer Xchange recommends that you avoid using this command if at all possible. If you set the default system encoding to anything other than the actual encoding used by your operating system, Tcl will likely find it impossible to communicate properly with your operating system.

Note: For reading and writing files in an encoding other than the system encoding, you need to use the fconfigure -encoding command (not the encoding system command) as described in the "Channel Input/Output" section of this document. Also see the "Sourcing Scripts in Different Encodings" section of this document for special instructions for sourcing files in formats other than the system encoding.

General String Manipulation

Beginning in Tcl 8.1, all Tcl string manipulation functions expect and return Unicode strings encoded in UTF-8 format. Because the use of Unicode/UTF-8 encoding is internal to Tcl, you should see no difference in Tcl 8.0 and 8.1 string handling in your scripts.

The Tcl string functions properly handle multi-byte UTF-8 characters as single characters. For example in the following commands, Tcl treats the string "Café" as a four-character string, even though the internal representation in UTF-8 format requires five bytes. (As with previous versions of Tcl, string indexes start with "0"; that is, the first character is index "0", the second character is index "1", etc.)

% set unistr "Café"
Café
% string length $unistr
4
% string index $unistr 3
é

Furthermore, the new regular expression implementation introduced in Tcl 8.1 handles the full range of Unicode characters.

The "\uxxxx" escape sequence allows you to specify a Unicode character by its four-digit, hexadecimal Unicode code value. For example, the following assigns to a variable two ideograph characters corresponding to the Chinese transliteration of "Tcl" (TAI-KU):

set tclstr "\u592a\u9177"

Channel Input/Output

When reading and writing data on a channel, you need to ensure that Tcl uses the proper character encoding for that channel. The default encoding for newly opened channels (both files and sockets) is the same as the platform- and locale-dependent system encoding used for interfacing with the operating system. (See the "Character Encodings and the Operating System" section of this document for more information.) In most cases, you don't need to do anything special to read or write data because most text files are created in the system encoding. You need to take special steps only when accessing files in an encoding other than the system encoding (for example, reading a file encoded in Shift-JIS format when your system encoding is ISO 8859-1).

The fconfigure -encoding option allows you to specify the encoding for a channel. Thus, to read from a file encoded in Shift-JIS format, you should execute the following commands:

set fd [open $file r]
fconfigure $fd -encoding shiftjis

Tcl then automatically converts any text you read from the file into standard UTF-8 format.

Similarly, if you are writing to a channel, you can use fconfigure -encoding to specify the target character encoding and Tcl automatically converts strings from UTF-8 to that encoding on output.

Note: The Tcl source command always reads files using the system encoding. For a tip on sourcing files in different encodings, see the "Sourcing Scripts in Different Encodings" section of this document.

Sourcing Scripts in Different Encodings

The Tcl source command always reads files using the system encoding. Therefore, Tcl Developer Xchange recommends that whenever possible, you author scripts in the native system encoding.

A difficulty arises when distributing scripts internationally, as you don't necessarily know what the system encoding will be. Fortunately, most common character encodings include the standard 7-bit ASCII characters as a subset. Therefore, you are usually safe if your script contains only 7-bit ASCII characters.

If you need to use an extended character set for your scripts that you distribute, you can provide a small "bootstrap" script written in 7-bit ASCII. The bootstrap script can then load and execute scripts in any encoding that you choose.

You can execute a script written in an encoding other than the system encoding by opening the file, setting the proper encoding using the fconfigure -encoding command, reading the file into a variable, and then evaluating the string with the eval command. For example, the following reads and executes a Tcl script encoded in EUC-JP:

set fd [open "app.tcl" r]
fconfigure $fd encoding euc-jp
set jpscript [read $fd]
close $fd
eval $jpscript

Note: This technique works only if the file contains actual EUC-JP encoded characters (for example, you created the file with a EUC-JP text editor). This technique doesn't work if you build the EUC-JP encoded characters using the "\x" or octal digit escape sequences. Tcl 8.1 interprets each "\x" or octal digit escape sequence as a single Unicode character with the upper bits set to 0. For example, if the script app.tcl above contained the line:

set ha "\xA4\xCF"

then the variable ha would contain two characters, "¤Ï" (Unicode characters "CURRENCY SIGN" and "LATIN CAPITAL LETTER I WITH DIAERESIS"), not the Unicode HA character.

Converting Strings to Different Encodings

You can convert a string to a different encoding using the encoding convertfrom and encoding convertto commands. The encoding convertfrom command converts a string from a specified encoding into UTF-8 Unicode characters; the encoding convertto command converts a string from UTF-8 Unicode into a specified encoding. In either case, if you omit the encoding argument, the command uses the current system encoding.

As an example, the following command converts a string representing the Hiragana letter HA from EUC-JP encoding into a Unicode string:

set ha [encoding convertfrom euc-jp "\xA4\xCF"]

(In Tcl 8.1, the "\x" and octal digit escape sequences specify the lower 8 bits of a Unicode character with the upper 8 bits set to 0. The thus the string "\xA4\xCF" still specifies two characters in Tcl 8.1, just as it did in Tcl 8.0; however Tcl 8.1 stores those characters in four bytes, whereas Tcl 8.0 stored them in two bytes.)

Fonts, Encodings, and Tk Widgets

Tk widgets that display text now require text strings in Unicode/UTF-8 encoding. Tk automatically handles any encoding conversion necessary to display the characters in a particular font.

If the master font that you set for a widget doesn't contain a glyph for a particular Unicode character that you want to display, Tk attempts to locate a font that does. Where possible, Tk attempts to locate a font that matches as many characteristics of the widget's master font as possible (for example, weight, slant, etc.). Once Tk finds a suitable font, it displays the character in that font. In other words, the widget uses the master font for all characters it is capable of displaying, and alternative fonts only as needed.

In some cases, Tk is unable to identify a suitable font, in which case the widget cannot display the characters. (Instead, the widget displays a system-dependent fallback character such as "?") The process of identifying suitable fonts is complex, and Tk's algorithms don't always find a font even if one is actually installed on the system. Therefore, for best results, you should try to select as a widget's master font one that is capable of handling the characters you expect to display. For example, "Times" is likely to be a poor choice if you know that you need to display Japanese or Arabic characters in a widget.

If you work with text in a variety of character sets, you may need to search out fonts to represent them. Markus Kuhn has developed a free 6x13 font that supports essentially all the Unicode characters that can be displayed in a 6x13 glyph. This does not include Japanese, Chinese, and other Asian languages, but it does cover many others. The font is available at http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html . His site also contains many useful links to other sources of fonts and font information.

Message Catalogs

The new msgcat package provides a set of functions for managing multilingual user interfaces. It allows you to define strings in a message catalog, which is independent from your application or package and which you can edit or localize without modifying the application source code. The msgcat package is optional, but Tcl Developer Xchange recommends using it for all multilingual applications and packages.

The basic principle of the msgcat package is that you create a set of message files, one for each supported language, containing localized versions of all the strings your application or package can display. Then in your application or package, instead of using a string directly, you call the ::msgcat::mc command to return a localized version of the string you want.

This document provides only a brief introduction to message catalogs. The msgcat package provides additional features such as namespace support and "best match" handling of sublocales. See the msgcat.n reference page for more information.

Using Message Catalogs

Using message catalogs from within your application or package requires the following steps:

Optionally set the locale using the ::msgcat::mclocale command. If you don't call mclocale, the locale defaults to the value of the env(LANG) environment variable at the time the msgcat package is loaded. If env(LANG) isn't defined, then the locale defaults to "C".
Call ::msgcat::mcload to load the appropriate message files. The mcload command requires as an argument a directory containing your message files.
Anywhere in your script that you would typically specify a string to display, use the ::msgcat::mc command instead. The mc command takes as an argument a source string and returns the translation of that string in the current locale.

The following code fragment demonstrates how you could use the msgcat package in a script:

# Use the default locale as specified by env(LANG).
# You could explicitly set the locale with a command such as
# ::msgcat::mclocale "en_UK"

# Load the messages files.  In this example, they are stored
# in a subdirectory named "msgs" which is in the same directory
# as this script.

::msgcat::mcload [file join [file dirname [info script]] msgs]

# Display a welcome message

puts [::msgcat::mc "Welcome to Tcl!"]

In this example, instead of directly displaying the message "Welcome to Tcl!", the application calls mc to retrieve a localized version of the string. The string returned by mc depends on the current locale. For example, in the "es" locale mc could return the Spanish-language greeting "¡Bienvenido a Tcl!"

If a message file doesn't exist for the current locale, mc executes the procedure ::msgcat::mcunknown. The default behavior of mcunknown is to return the original string ("Welcome to Tcl!" in this case), but you can redefine it to perform any action you want.

Creating Localized Message Files

To use the msgcat package, you need to prepare a set of message files for your package or application, all contained within the same directory. The name of each message file is a locale specifier followed by the extension ".msg" (for example, es.msg for a Spanish message file or en_UK.msg for a UK English message file).

Each message file contains a series of calls to ::msgcat::mcset to set the translation strings for that language. The format of the mcset command is:

::msgcat::mcset locale src-string ?translation-string?

The mcset command defines a locale-specific translation for the given src-string. If no translation-string argument is present, then the value of src-string is also used as the locale-specific translation string.

So, if American English is the "source language" for your application, an en_UK.msg file might contain commands such as:

::msgcat::mcset en_UK "Welcome to Tcl!"
::msgcat::mcset en_UK "Select a color:" "Select a colour:"

Note that no translation string is provided for the first line, so the resulting "translation" for the en_UK locale is the same as the American source string, "Welcome to Tcl!" If you omitted this entry in the message file, then calling mc with the source string "Welcome to Tcl!" in the en_UK locale would result in mcunknown being called. Although the default behavior of mcunknown would produce the desired results (returning "Welcome to Tcl!"), you could run into problems if you override the behavior of mcunknown. Therefore, it is always safest to include a mcset mapping for every source string in your application, even if a particular locale doesn't require a "translation" for that string.

An equivalent Spanish-language message file, es.msg, would contain:

::msgcat::mcset es "Welcome to Tcl!" "¡Bienvenido a Tcl!"
::msgcat::mcset es "Select a color:" "Elige un color:"

Internationalization and the Tcl C APIs

Tcl 8.1 introduces new C APIs to support all new internationalization features. Tcl 8.1 also introduces new convenience functions for manipulating Unicode/UTF-8 strings. By using the new APIs in your applications, you can easily add full Unicode support to your application. Coupled with Tk's powerful font and layout support, you can quickly create fully internationalized applications.

When programming with the Tcl C APIs, you should be aware of the following issues, in addition to the Tcl scripting language internationalization features:

The Tcl C APIs now require all strings to be passed to functions as Unicode characters in UTF-8 format. You must convert strings in native system encodings to UTF-8 before passing them to Tcl C functions. Similarly, you must convert Tcl UTF-8 strings to the native system encoding before passing them to system functions. Tcl provides functions for handling encodings and converting strings from one encoding to another. See the GetEncoding.3 reference page for details.
Because 7-bit ASCII characters have the same encoding in UTF-8 format, legacy code that uses only 7-bit ASCII characters functions the same in Tcl 8.1 as it did in Tcl 8.0. Therefore, if you're certain that your strings contain only 7-bit ASCII characters, no conversion is required.
Because strings in Tcl are now stored as Unicode characters in UTF-8 format, the number of characters in a string is not necessarily equal to the number of bytes in a string. In particular, you should no longer use the standard C string functions such as strlen to count characters in a string. Similarly, other standard C string functions such as toupper don't work with Unicode characters. Tcl provides a set of equivalent Unicode string functions, such as Tcl_NumUtfChars and Tcl_UtfToUpper, as well as other convenience functions for manipulating Unicode strings. See the Utf.3 and UtfToUpper.3 reference pages for details.

Summary: Tcl Internationalization Support at a Glance

The following list is a quick summary of the issues you should be aware of concerning the new internationalization support introduced in Tcl 8.1:

Tcl encodes all strings internally as Unicode characters in UTF-8 format.
The introduction of Unicode/UTF-8 encoding requires no changes to legacy Tcl scripts that use only 7-bit ASCII characters, because UTF-8 characters corresponding to the standard 7-bit ASCII set (up to ASCII value 0x7F in hexadecimal) have the same byte values in both UTF-8 and ASCII encoding. Furthermore, because the use of Unicode/UTF-8 encoding is internal to Tcl, most string handling in legacy Tcl scripts works the same in Tcl 8.1 as it did in Tcl 8.0.
You can specify a Unicode character by its four-digit, hexadecimal Unicode code value with the "\uxxxx" escape sequence.
All Tcl string functions properly handle multi-byte UTF-8 characters as single characters.
Tk widgets that display text accept text string arguments in standard Unicode/UTF-8 encoding. Tk automatically handles any encoding conversion necessary to display the characters in a particular font. If the master font that you set for a widget doesn't contain a glyph (a visual representation) for a particular Unicode character, Tk attempts to locate a font that does. Where possible, Tk attempts to locate a font that matches as many characteristics of the widget's master font as possible (for example, weight, slant, etc.). In some cases, Tk is unable to identify a suitable font, even if one is actually installed on the system. Therefore, for best results, you should try to select as a widget's master font one that is capable of handling the characters you expect to display.
The system encoding is the character encoding used by the operating system. Tcl automatically handles conversions between UTF-8 and the system encoding when interacting with the operating system.
Tcl usually can determine a reasonable default system encoding based on the platform and locale settings, but if for some reason it cannot, it uses ISO 8859-1 as the default system encoding. You can explicitly set the system encoding used by Tcl with the encoding system command.
By default, Tcl uses the system encoding when reading from and writing to channels, and converts the text to UTF-8 format. You can change the character encoding for a channel using the fconfigure -encoding command.
The source command always reads files using the system encoding. Therefore, Scriptics recommends that whenever possible, you author scripts in the native system encoding. Furthermore, most common character encodings include the standard 7-bit ASCII characters as a subset, so you are usually safe writing scripts using only 7-bit ASCII characters. You can execute a script written in a different encoding by opening the file, setting the proper encoding using the fconfigure -encoding command, reading the file into a variable, and then evaluating the string with the eval command.
You can convert a string to a different encoding using the encoding convertfrom and encoding convertto commands.
Tcl has built-in knowledge of approximately 30 common character encodings. The encoding names command displays a list of all known encodings. You can create additional encodings as described in the Tcl_GetEncoding.3 reference page.
The new msgcat package provides a set of functions for managing multilingual user interfaces. It allows you to define strings in a message catalog, which is independent from your application and which you can edit or localize without modifying the application source code. See the msgcat.n reference page for more information.

You should also read the "Internationalization and the Tcl C APIs" section of this document if you use the Tcl APIs in C programs.

Add link to comments for /doc/howto/i18n.html