GNU Regexp Operators (The GNU Awk User’s Guide)
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to
gawk; they are not available in other
awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘
Matches any space character as defined by the current locale. Think of it as shorthand for ‘
Gawk/docs/latest/:space:’ /@w .
Matches any character that is not a space, as defined by the current locale. Think of it as shorthand for ‘
[^[:space:]]’ /@w .
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘
[[:alnum:]_]’ /@w .
Matches any character that is not word-constituent. Think of it as shorthand for ‘
[^[:alnum:]_]’ /@w .
Matches the empty string at the beginning of a word. For example,
/\<away/ matches ‘
away’ but not ‘
Matches the empty string at the end of a word. For example,
/stow\>/ matches ‘
stow’ but not ‘
Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, ‘
\yballs?\y’ matches either ‘
ball’ or ‘
balls’, as a separate word.
Matches the empty string that occurs between two word-constituent characters. For example,
/\Brat\B/ matches ‘
crate’, but it does not match ‘
dirty rat’. ‘
\B’ is essentially the opposite of ‘
There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. Other GNU programs, including
gawk, consider the entire string to match as the buffer. The operators are:
Matches the empty string at the beginning of a buffer (string)
Matches the empty string at the end of a buffer (string)
^’ and ‘
$’ always work in terms of the beginning and end of strings, these operators don’t add any new capabilities for
awk. They are provided for compatibility with other GNU software.
In other GNU software, the word-boundary operator is ‘
\b’. However, that conflicts with the
awk language’s definition of ‘
\b’ as backspace, so
gawk uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using ‘
\y’ for the GNU ‘
\b’ appears to be the lesser of two evils.
The various command-line options (see section Command-Line Options) control how
gawk interprets characters in regexps:
- No options
In the default case,
gawkprovides all the facilities of POSIX regexps and the previously described GNU regexp operators. GNU regexp operators described in Regular Expression Operators.
Match only POSIX regexps; the GNU operators are not special (e.g., ‘
\w’ matches a literal ‘
w’). Interval expressions are allowed.
Match traditional Unix
awkregexps. The GNU operators are not special, and interval expressions are not available. Because BWK
awksupports them, the POSIX character classes (‘
Gawk/docs/latest/:alnum:’, etc.) are available. Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.
Allow interval expressions in regexps, if
--traditionalhas been provided. Otherwise, interval expressions are available by default.