Gawk/Bracket-Expressions

3.4 Using Bracket Expressions

As mentioned earlier, a bracket expression matches any character among those listed between the opening and closing square brackets.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, based upon the system’s native character set. For example, ‘[0-9]’ is equivalent to ‘[0123456789]’. (See Regexp Ranges and Locales: A Long Sad Story for an explanation of how the POSIX standard and gawk have changed over time. This is mainly of historical interest.)

With the increasing popularity of the Unicode character standard, there is an additional wrinkle to consider. Octal and hexadecimal escape sequences inside bracket expressions are taken to represent only single-byte characters (characters whose values fit within the range 0–256). To match a range of characters where the endpoints of the range are larger than 256, enter the multibyte encodings of the characters directly.

To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:

[d\]]

matches either ‘d’ or ‘]’. Additionally, if you place ‘]’ right after the opening ‘[’, the closing bracket is treated as one of the characters to be matched.

The treatment of ‘\’ in bracket expressions is compatible with other awk implementations and is also mandated by POSIX. The regular expressions in awk are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep utility.

Character classes are a feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.

A character class is only valid in a regexp inside the brackets of a bracket expression. Character classes consist of ‘[:’, a keyword denoting the class, and ‘:]’. Table 3.1 lists the character classes defined by the POSIX standard.

Class	Meaning
`[:alnum:]`	Alphanumeric characters
`[:alpha:]`	Alphabetic characters
`[:blank:]`	Space and TAB characters
`[:cntrl:]`	Control characters
`[:digit:]`	Numeric characters
`[:graph:]`	Characters that are both printable and visible (a space is printable but not visible, whereas an ‘`a`’ is both)
`[:lower:]`	Lowercase alphabetic characters
`[:print:]`	Printable characters (characters that are not control characters)
`[:punct:]`	Punctuation characters (characters that are not letters, digits, control characters, or space characters)
`[:space:]`	Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)
`[:upper:]`	Uppercase alphabetic characters
`[:xdigit:]`	Characters that are hexadecimal digits

Table 3.1: POSIX character classes

For example, before the POSIX standard, you had to write /[A-Za-z0-9]/ to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not match them. With the POSIX character classes, you can write /alnum:/ to match the alphabetic and numeric characters in your character set.

Some utilities that match regular expressions provide a nonstandard ‘[:ascii:]’ character class; awk does not. However, you can simulate such a construct using ‘[\x00-\x7F]’. This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list (‘[^\x00-\x7F]’) to match any single-byte characters that are not in the ASCII range.

NOTE: Some older versions of Unix awk
treat [:blank:] like [:space:], incorrectly matching more characters than they should. Caveat Emptor.

Two additional special sequences can appear in bracket expressions. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain “e” and a grave-accented “è” are equivalent.) These sequences are:

Collating symbols

Multicharacter collating elements enclosed between ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then ‘Gawk/.ch’ is a regexp that matches this collating element, whereas ‘[ch]’ is a regexp that matches either ‘c’ or ‘h’.

Equivalence classes

Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “ê,” “è,” and “é.” In this case, ‘Gawk/=e=’ is a regexp that matches any of ‘e’, ‘ê’, ‘é’, or ‘è’.

These features are very valuable in non-English-speaking locales.

CAUTION: The library functions that gawk uses for regular
expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.

Inside a bracket expression, an opening bracket (‘[’) that does not start a character class, collating element or equivalence class is taken literally. This is also true of ‘.’ and ‘*’.