Gettext/Preparing-Strings

4.3 Preparing Translatable Strings

Before strings can be marked for translations, they sometimes need to be adjusted. Usually preparing a string for translation is done right before marking it, during the marking phase which is described in the next sections. What you have to keep in mind while doing that is the following.

Decent English style.
Entire sentences.
Split at paragraphs.
Use format strings instead of string concatenation.
Use placeholders in format strings instead of embedded URLs.
Avoid unusual markup and unusual control characters.

Let’s look at some examples of these guidelines.

Decent English style

Translatable strings should be in good English style. If slang language with abbreviations and shortcuts is used, often translators will not understand the message and will produce very inappropriate translations.

"%s: is parameter\n"

This is nearly untranslatable: Is the displayed item a parameter or the parameter?

"No match"

The ambiguity in this message makes it unintelligible: Is the program attempting to set something on fire? Does it mean "The given object does not match the template"? Does it mean "The template does not fit for any of the objects"?

In both cases, adding more words to the message will help both the translator and the English speaking user.

Entire sentences

Translatable strings should be entire sentences. It is often not possible to translate single verbs or adjectives in a substitutable way.

printf ("File %s is %s protected", filename, rw ? "write" : "read");

Most translators will not look at the source and will thus only see the string "File %s is %s protected", which is unintelligible. Change this to

printf (rw ? "File %s is write protected" : "File %s is read protected",
        filename);

This way the translator will not only understand the message, she will also be able to find the appropriate grammatical construction. A French translator for example translates "write protected" like "protected against writing".

Entire sentences are also important because in many languages, the declination of some word in a sentence depends on the gender or the number (singular/plural) of another part of the sentence. There are usually more interdependencies between words than in English. The consequence is that asking a translator to translate two half-sentences and then combining these two half-sentences through dumb string concatenation will not work, for many languages, even though it would work for English. That’s why translators need to handle entire sentences.

Often sentences don’t fit into a single line. If a sentence is output using two subsequent printf statements, like this

printf ("Locale charset \"%s\" is different from\n", lcharset);
printf ("input file charset \"%s\".\n", fcharset);

the translator would have to translate two half sentences, but nothing in the POT file would tell her that the two half sentences belong together. It is necessary to merge the two printf statements so that the translator can handle the entire sentence at once and decide at which place to insert a line break in the translation (if at all):

printf ("Locale charset \"%s\" is different from\n\
input file charset \"%s\".\n", lcharset, fcharset);

You may now ask: how about two or more adjacent sentences? Like in this case:

puts ("Apollo 13 scenario: Stack overflow handling failed.");
puts ("On the next stack overflow we will crash!!!");

Should these two statements merged into a single one? I would recommend to merge them if the two sentences are related to each other, because then it makes it easier for the translator to understand and translate both. On the other hand, if one of the two messages is a stereotypic one, occurring in other places as well, you will do a favour to the translator by not merging the two. (Identical messages occurring in several places are combined by xgettext, so the translator has to handle them once only.)

Split at paragraphs

Translatable strings should be limited to one paragraph; don’t let a single message be longer than ten lines. The reason is that when the translatable string changes, the translator is faced with the task of updating the entire translated string. Maybe only a single word will have changed in the English string, but the translator doesn’t see that (with the current translation tools), therefore she has to proofread the entire message.

Many GNU programs have a ‘--help’ output that extends over several screen pages. It is a courtesy towards the translators to split such a message into several ones of five to ten lines each. While doing that, you can also attempt to split the documented options into groups, such as the input options, the output options, and the informative output options. This will help every user to find the option he is looking for.

No string concatenation

Hardcoded string concatenation is sometimes used to construct English strings:

strcpy (s, "Replace ");
strcat (s, object1);
strcat (s, " with ");
strcat (s, object2);
strcat (s, "?");

In order to present to the translator only entire sentences, and also because in some languages the translator might want to swap the order of object1 and object2, it is necessary to change this to use a format string:

sprintf (s, "Replace %s with %s?", object1, object2);

A similar case is compile time concatenation of strings. The ISO C 99 include file <inttypes.h> contains a macro PRId64 that can be used as a formatting directive for outputting an ‘int64_t’ integer through printf. It expands to a constant string, usually "d" or "ld" or "lld" or something like this, depending on the platform. Assume you have code like

printf ("The amount is %0" PRId64 "\n", number);

The gettext tools and library have special support for these <inttypes.h> macros. You can therefore simply write

printf (gettext ("The amount is %0" PRId64 "\n"), number);

The PO file will contain the string "The amount is %0<PRId64>\n". The translators will provide a translation containing "%0<PRId64>" as well, and at runtime the gettext function’s result will contain the appropriate constant string, "d" or "ld" or "lld".

This works only for the predefined <inttypes.h> macros. If you have defined your own similar macros, let’s say ‘MYPRId64’, that are not known to xgettext, the solution for this problem is to change the code like this:

char buf1[100];
sprintf (buf1, "%0" MYPRId64, number);
printf (gettext ("The amount is %s\n"), buf1);

This means, you put the platform dependent code in one statement, and the internationalization code in a different statement. Note that a buffer length of 100 is safe, because all available hardware integer types are limited to 128 bits, and to print a 128 bit integer one needs at most 54 characters, regardless whether in decimal, octal or hexadecimal.

All this applies to other programming languages as well. For example, in Java and C#, string concatenation is very frequently used, because it is a compiler built-in operator. Like in C, in Java, you would change

System.out.println("Replace "+object1+" with "+object2+"?");

into a statement involving a format string:

System.out.println(
    MessageFormat.format("Replace {0} with {1}?",
                         new Object[] { object1, object2 }));

Similarly, in C#, you would change

Console.WriteLine("Replace "+object1+" with "+object2+"?");

into a statement involving a format string:

Console.WriteLine(
    String.Format("Replace {0} with {1}?", object1, object2));

No embedded URLs

It is good to not embed URLs in translatable strings, for several reasons:

It avoids possible mistakes during copy and paste.
Translators cannot translate the URLs or, by mistake, use the URLs from other packages that are present in their compendium.
When the URLs change, translators don’t need to revisit the translation of the string.

The same holds for email addresses.

So, you would change

fputs (_("GNU GPL version 3 <https://gnu.org/licenses/gpl.html>\n"),
       stream);

to

fprintf (stream, _("GNU GPL version 3 <%s>\n"),
         "https://gnu.org/licenses/gpl.html");

No unusual markup

Unusual markup or control characters should not be used in translatable strings. Translators will likely not understand the particular meaning of the markup or control characters.

For example, if you have a convention that ‘|’ delimits the left-hand and right-hand part of some GUI elements, translators will often not understand it without specific comments. It might be better to have the translator translate the left-hand and right-hand part separately.

Another example is the ‘argp’ convention to use a single ‘\v’ (vertical tab) control character to delimit two sections inside a string. This is flawed. Some translators may convert it to a simple newline, some to blank lines. With some PO file editors it may not be easy to even enter a vertical tab control character. So, you cannot be sure that the translation will contain a ‘\v’ character, at the corresponding position. The solution is, again, to let the translator translate two separate strings and combine at run-time the two translated strings with the ‘\v’ required by the convention.

HTML markup, however, is common enough that it’s probably ok to use in translatable strings. But please bear in mind that the GNU gettext tools don’t verify that the translations are well-formed HTML.