GNU gettext utilities: Aspects

1.3 Aspects in Native Language Support

For a totally multi-lingual distribution, there are many things to translate beyond output messages.

As of today, GNU gettext offers a complete toolset for translating messages output by C programs. Perl scripts and shell scripts will also need to be translated. Even if there are today some hooks by which this can be done, these hooks are not integrated as well as they should be.
Some programs, like autoconf or bison, are able to produce other programs (or scripts). Even if the generating programs themselves are internationalized, the generated programs they produce may need internationalization on their own, and this indirect internationalization could be automated right from the generating program. In fact, quite usually, generating and generated programs could be internationalized independently, as the effort needed is fairly orthogonal.
A few programs include textual tables which might need translation themselves, independently of the strings contained in the program itself. For example, RFC 1345 /@w gives an English description for each character which the recode program is able to reconstruct at execution. Since these descriptions are extracted from the RFC by mechanical means, translating them properly would require a prior translation of the RFC itself.
Almost all programs accept options, which are often worded out so to be descriptive for the English readers; one might want to consider offering translated versions for program options as well.
Many programs read, interpret, compile, or are somewhat driven by input files which are texts containing keywords, identifiers, or replies which are inherently translatable. For example, one may want gcc to allow diacriticized characters in identifiers or use translated keywords; ‘rm -i’ might accept something else than ‘y’ or ‘n’ for replies, etc. Even if the program will eventually make most of its output in the foreign languages, one has to decide whether the input syntax, option values, etc., are to be localized or not.
The manual accompanying a package, as well as all documentation files in the distribution, could surely be translated, too. Translating a manual, with the intent of later keeping up with updates, is a major undertaking in itself, generally.

As we already stressed, translation is only one aspect of locales. Other internationalization aspects are system services and are handled in GNU libc. There are many attributes that are needed to define a country’s cultural conventions. These attributes include beside the country’s native language, the formatting of the date and time, the representation of numbers, the symbols for currency, etc. These local rules are termed the country’s locale. The locale represents the knowledge needed to support the country’s native attributes.

There are a few major areas which may vary between countries and hence, define what a locale must describe. The following list helps putting multi-lingual messages into the proper context of other tasks related to locales. See the GNU libc manual for details.

Characters and Codesets

The codeset most commonly used through out the USA and most English speaking parts of the world is the ASCII codeset. However, there are many characters needed by various locales that are not found within this codeset. The 8-bit ISO 8859-1 /@w code set has most of the special characters needed to handle the major European languages. However, in many cases, choosing ISO 8859-1 /@w is nevertheless not adequate: it doesn’t even handle the major European currency. Hence each locale will need to specify which codeset they need to use and will need to have the appropriate character handling routines to cope with the codeset.

Currency

The symbols used vary from country to country as does the position used by the symbol. Software needs to be able to transparently display currency figures in the native mode for each locale.

Dates

The format of date varies between locales. For example, Christmas day in 1994 is written as 12/25/94 in the USA and as 25/12/94 in Australia. Other countries might use ISO 8601 /@w dates, etc.

Time of the day may be noted as hh:mm, hh.mm, or otherwise. Some locales require time to be specified in 24-hour mode rather than as AM or PM. Further, the nature and yearly extent of the Daylight Saving correction vary widely between countries.

Numbers

Numbers can be represented differently in different locales. For example, the following numbers are all written correctly for their respective locales:

12,345.67       English
12.345,67       German
 12345,67       French
1,2345.67       Asia

Some programs could go further and use different unit systems, like English units or Metric units, or even take into account variants about how numbers are spelled in full.

Messages

The most obvious area is the language support within a locale. This is where GNU gettext provides the means for developers and users to easily change the language that the software uses to communicate to the user.

These areas of cultural conventions are called locale categories. It is an unfortunate term; locale aspects or locale feature categories would be a better term, because each “locale category” describes an area or task that requires localization. The concrete data that describes the cultural conventions for such an area and for a particular culture is also called a locale category. In this sense, a locale is composed of several locale categories: the locale category describing the codeset, the locale category describing the formatting of numbers, the locale category containing the translated messages, and so on.

Components of locale outside of message handling are standardized in the ISO C standard and the POSIX:2001 standard (also known as the SUSV3 specification). GNU libc fully implements this, and most other modern systems provide a more or less reasonable support for at least some of the missing components.