Sed/Locale-Considerations

From Get docs

5.9 Multibyte characters and Locale Considerations

GNU sed processes valid multibyte characters in multibyte locales (e.g. UTF-8). 7

The following example uses the Greek letter Capital Sigma (Σ, Unicode code point 0x03A3). In a UTF-8 locale, sed correctly processes the Sigma as one character despite it being 2 octets (bytes):

$ locale | grep LANG
LANG=en_US.UTF-8

$ printf 'a\u03A3b'
aΣb

$ printf 'a\u03A3b' | sed 's/./X/g'
XXX

$ printf 'a\u03A3b' | od -tx1 -An
 61 ce a3 62

To force sed to process octets separately, use the C locale (also known as the POSIX locale):

$ printf 'a\u03A3b' | LC_ALL=C sed 's/./X/g'
XXXX

5.9.1 Invalid multibyte characters

sed’s regular expressions do not match invalid multibyte sequences in a multibyte locale.

In the following examples, the ascii value 0xCE is an incomplete multibyte character (shown here as �). The regular expression ‘.’ does not match it:

$ printf 'a\xCEb\n'
a�e

$ printf 'a\xCEb\n' | sed 's/./X/g'
X�X

$ printf 'a\xCEc\n' | sed 's/./X/g' | od -tx1c -An
  58  ce  58  0a
   X      X   \n

Similarly, the ’catch-all’ regular expression ‘.*’ does not match the entire line:

$ printf 'a\xCEc\n' | sed 's/.*//' | od -tx1c -An
  ce  63  0a
       c  \n

GNU sed offers the special z command to clear the current pattern space regardless of invalid multibyte characters (i.e. it works like s/.*// but also removes invalid multibyte characters):

$ printf 'a\xCEc\n' | sed 'z' | od -tx1c -An
   0a
   \n

Alternatively, force the C locale to process each octet separately (every octet is a valid character in the C locale):

$ printf 'a\xCEc\n' | LC_ALL=C sed 's/.*//' | od -tx1c -An
  0a
  \n

sed’s inability to process invalid multibyte characters can be used to detect such invalid sequences in a file. In the following examples, the \xCE\xCE is an invalid multibyte sequence, while \xCE\A3 is a valid multibyte sequence (of the Greek Sigma character).

The following sed program removes all valid characters using s/.//g. Any content left in the pattern space (the invalid characters) are added to the hold space using the H command. On the last line ($), the hold space is retrieved (x), newlines are removed (s/\n//g), and any remaining octets are printed unambiguously (l). Thus, any invalid multibyte sequences are printed as octal values:

$ printf 'ab\nc\n\xCE\xCEde\n\xCE\xA3f\n' > invalid.txt

$ cat invalid.txt
ab
c
��de
Σf

$ sed -n 's/.//g ; H ; ${x;s/\n//g;l}' invalid.txt
\316\316$

With a few more commands, sed can print the exact line number corresponding to each invalid characters (line 3). These characters can then be removed by forcing the C locale and using octal escape sequences:

$ sed -n 's/.//g;=;l' invalid.txt | paste - -  | awk '$2!="$"'
3       \316\316$

$ LC_ALL=C sed '3s/\o316\o316//' invalid.txt > fixed.txt

5.9.2 Upper/Lower case conversion

GNU sed’s substitute command (s) supports upper/lower case conversions using \U,\L codes. These conversions support multibyte characters:

$ printf 'ABC\u03a3\n'
ABCΣ

$ printf 'ABC\u03a3\n' | sed 's/.*/\L&/'
abcσ

See The "s" Command.

5.9.3 Multibyte regexp character classes

In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL environment variable to the value ‘C’.

# TODO: is there any real-world system/locale where 'A'
#       is replaced by '-' ?
$ echo A | sed 's/[a-z]/-/'
A

Their interpretation depends on the LC_CTYPE locale; for example, ‘alnum:’ means the character class of numbers and letters in the current locale.

TODO: show example of collation

# TODO: this works on glibc systems, not on musl-libc/freebsd/macosx.
$ printf 'cliché\n' | LC_ALL=fr_FR.utf8 sed 's/[[../=e=]]/X/g'
clichX

Footnotes

(7)

Some regexp edge-cases depends on the operating system and libc implementation. The examples shown are known to work as-expected on GNU/Linux systems using glibc.