Character sets (GNU Coreutils 9.0)
Next: Translating, Up: tr invocation [Contents][Index]
9.1.1 Specifying sets of characters
The format of the set1 and set2 arguments resembles the format of regular expressions; however, they are not regular expressions, only lists of characters. Most characters simply represent themselves in these strings, but the strings can contain the shorthands listed below, for convenience. Some of them can be used only in set1 or set2, as noted below.
- Backslash escapes
The following backslash escape sequences are recognized:
- ‘
\a’ Control-G.
- ‘
\b’ Control-H.
- ‘
\f’ Control-L.
- ‘
\n’ Control-J.
- ‘
\r’ Control-M.
- ‘
\t’ Control-I.
- ‘
\v’ Control-K.
- ‘
\ooo’ The 8-bit character with the value given by
ooo, which is 1 to 3 octal digits. Note that ‘\400’ is interpreted as the two-byte sequence, ‘\040’ ‘0’.- ‘
\\’ A backslash.
While a backslash followed by a character not listed above is interpreted as that character, the backslash also effectively removes any special significance, so it is useful to escape ‘
[’, ‘]’, ‘*’, and ‘-’.- ‘
- Ranges
The notation ‘
m-n’ expands to all of the characters frommthroughn, in ascending order.mshould collate beforen; if it doesn’t, an error results. As an example, ‘0-9’ is the same as ‘0123456789’.GNU
trdoes not support the System V syntax that uses square brackets to enclose ranges. Translations specified in that format sometimes work as expected, since the brackets are often transliterated to themselves. However, they should be avoided because they sometimes behave unexpectedly. For example, ‘tr -d '[0-9]'’ deletes brackets as well as digits.Many historically common and even accepted uses of ranges are not portable. For example, on EBCDIC hosts using the ‘
A-Z’ range will not do what most would expect because ‘A’ through ‘Z’ are not contiguous as they are in ASCII. If you can rely on a POSIX compliant version oftr, then the best way to work around this is to use character classes (see below). Otherwise, it is most portable (and most ugly) to enumerate the members of the ranges.- Repeated characters
The notation ‘
[c*n]’ inset2expands toncopies of characterc. Thus, ‘[y*6]’ is the same as ‘yyyyyy’. The notation ‘[c*]’ instring2expands to as many copies ofcas are needed to makeset2as long asset1. Ifnbegins with ‘0’, it is interpreted in octal, otherwise in decimal.- Character classes
The notation ‘
[:class:]’ expands to all of the characters in the (predefined) classclass. The characters expand in no particular order, except for theupperandlowerclasses, which expand in ascending order. When the--delete(-d) and--squeeze-repeats(-s) options are both given, any character class can be used inset2. Otherwise, only the character classeslowerandupperare accepted inset2, and then only if the corresponding character class (upperandlower, respectively) is specified in the same relative position inset1. Doing this specifies case conversion. The class names are given below; an error results when an invalid class name is given.alnumLetters and digits.
alphaLetters.
blankHorizontal whitespace.
cntrlControl characters.
digitDigits.
graphPrintable characters, not including space.
lowerLowercase letters.
printPrintable characters, including space.
punctPunctuation characters.
spaceHorizontal or vertical whitespace.
upperUppercase letters.
xdigitHexadecimal digits.
- Equivalence classes
The syntax ‘
[=c=]’ expands to all of the characters that are equivalent toc, in no particular order. Equivalence classes are a relatively recent invention intended to support non-English alphabets. But there seems to be no standard way to define them or determine their contents. Therefore, they are not fully implemented in GNUtr; each character’s equivalence class consists only of that character, which is of no particular use.
Next: Translating, Up: tr invocation [Contents][Index]