Gettext/Locale-Names

2.3.1 Locale Names

A locale name usually has the form ‘ll_CC’. Here ‘ll’ is an ISO 639 two-letter language code, and ‘CC’ is an ISO 3166 two-letter country code. For example, for German in Germany, ll is de, and CC is DE. You find a list of the language codes in appendix Language Codes and a list of the country codes in appendix Country Codes.

You might think that the country code specification is redundant. But in fact, some languages have dialects in different countries. For example, ‘de_AT’ is used for Austria, and ‘pt_BR’ for Brazil. The country code serves to distinguish the dialects.

Many locale names have an extended syntax ‘ll_CC.encoding’ that also specifies the character encoding. These are in use because between 2000 and 2005, most users have switched to locales in UTF-8 encoding. For example, the German locale on glibc systems is nowadays ‘de_DE.UTF-8’. The older name ‘de_DE’ still refers to the German locale as of 2000 that stores characters in ISO-8859-1 encoding – a text encoding that cannot even accommodate the Euro currency sign.

Some locale names use ‘ll_CC@variant’ instead of ‘ll_CC’. The ‘@variant’ can denote any kind of characteristics that is not already implied by the language ll and the country CC. It can denote a particular monetary unit. For example, on glibc systems, ‘de_DE@euro’ denotes the locale that uses the Euro currency, in contrast to the older locale ‘de_DE’ which implies the use of the currency before 2002. It can also denote a dialect of the language, or the script used to write text (for example, ‘sr_RS@latin’ uses the Latin script, whereas ‘sr_RS’ uses the Cyrillic script to write Serbian), or the orthography rules, or similar.

On other systems, some variations of this scheme are used, such as ‘ll’. You can get the list of locales supported by your system for your language by running the command ‘locale -a | grep '^ll'’.

There is also a special locale, called ‘C’. When it is used, it disables all localization: in this locale, all programs standardized by POSIX use English messages and an unspecified character encoding (often US-ASCII, but sometimes also ISO-8859-1 or UTF-8, depending on the operating system).