GNU Regexp Operators (The GNU Awk User’s Guide)
Next: Case-sensitivity, Previous: Computed Regexps, Up: Regexp [Contents][Index]
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk
; they are not available in other awk
implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_
’):
\s
Matches any space character as defined by the current locale. Think of it as shorthand for ‘Gawk/docs/latest/:space:
’ /@w .
\S
Matches any character that is not a space, as defined by the current locale. Think of it as shorthand for ‘[^[:space:]]
’ /@w .
\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘[[:alnum:]_]
’ /@w .
\W
Matches any character that is not word-constituent. Think of it as shorthand for ‘[^[:alnum:]_]
’ /@w .
\<
Matches the empty string at the beginning of a word. For example, /\<away/
matches ‘away
’ but not ‘stowaway
’.
\>
Matches the empty string at the end of a word. For example, /stow\>/
matches ‘stow
’ but not ‘stowaway
’.
\y
Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, ‘\yballs?\y
’ matches either ‘ball
’ or ‘balls
’, as a separate word.
\B
Matches the empty string that occurs between two word-constituent characters. For example, /\Brat\B/
matches ‘crate
’, but it does not match ‘dirty rat
’. ‘\B
’ is essentially the opposite of ‘\y
’.
There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. Other GNU programs, including gawk
, consider the entire string to match as the buffer. The operators are:
\`
Matches the empty string at the beginning of a buffer (string)
\'
Matches the empty string at the end of a buffer (string)
Because ‘^
’ and ‘$
’ always work in terms of the beginning and end of strings, these operators don’t add any new capabilities for awk
. They are provided for compatibility with other GNU software.
In other GNU software, the word-boundary operator is ‘\b
’. However, that conflicts with the awk
language’s definition of ‘\b
’ as backspace, so gawk
uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using ‘\y
’ for the GNU ‘\b
’ appears to be the lesser of two evils.
The various command-line options (see section Command-Line Options) control how gawk
interprets characters in regexps:
- No options
In the default case,
gawk
provides all the facilities of POSIX regexps and the previously described GNU regexp operators. GNU regexp operators described in Regular Expression Operators.--posix
Match only POSIX regexps; the GNU operators are not special (e.g., ‘
\w
’ matches a literal ‘w
’). Interval expressions are allowed.--traditional
Match traditional Unix
awk
regexps. The GNU operators are not special, and interval expressions are not available. Because BWKawk
supports them, the POSIX character classes (‘Gawk/docs/latest/:alnum:
’, etc.) are available. Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.--re-interval
Allow interval expressions in regexps, if
--traditional
has been provided. Otherwise, interval expressions are available by default.
Next: Case-sensitivity, Previous: Computed Regexps, Up: Regexp [Contents][Index]