Gawk/GNU-Regexp-Operators
Next: Case-sensitivity, Previous: Computed Regexps, Up: Regexp [Contents][Index]
3.7 gawk
-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of
additional regexp operators. These operators are described in this
section and are specific to gawk
;
they are not available in other awk
implementations.
Most of the additional operators deal with word matching.
For our purposes, a word is a sequence of one or more letters, digits,
or underscores (‘_
’):
\s
Matches any space character as defined by the current locale.
Think of it as shorthand for
‘space:
’.
\S
Matches any character that is not a space, as defined by the current locale.
Think of it as shorthand for
‘[^[:space:]]
’.
\w
Matches any word-constituent character—that is, it matches any
letter, digit, or underscore. Think of it as shorthand for
‘[[:alnum:]_]
’.
\W
Matches any character that is not word-constituent.
Think of it as shorthand for
‘[^[:alnum:]_]
’.
\<
Matches the empty string at the beginning of a word.
For example, /\<away/
matches ‘away
’ but not
‘stowaway
’.
\>
Matches the empty string at the end of a word.
For example, /stow\>/
matches ‘stow
’ but not ‘stowaway
’.
\y
Matches the empty string at either the beginning or the
end of a word (i.e., the word boundary). For example, ‘\yballs?\y
’
matches either ‘ball
’ or ‘balls
’, as a separate word.
\B
Matches the empty string that occurs between two
word-constituent characters. For example,
/\Brat\B/
matches ‘crate
’, but it does not match ‘dirty rat
’.
‘\B
’ is essentially the opposite of ‘\y
’.
There are two other operators that work on buffers. In Emacs, a
buffer is, naturally, an Emacs buffer.
Other GNU programs, including gawk
,
consider the entire string to match as the buffer.
The operators are:
\`
Matches the empty string at the beginning of a buffer (string)
\'
Matches the empty string at the end of a buffer (string)
Because ‘^
’ and ‘$
’ always work in terms of the beginning
and end of strings, these operators don’t add any new capabilities
for awk
. They are provided for compatibility with other
GNU software.
In other GNU software, the word-boundary operator is ‘\b
’. However,
that conflicts with the awk
language’s definition of ‘\b
’
as backspace, so gawk
uses a different letter.
An alternative method would have been to require two backslashes in the
GNU operators, but this was deemed too confusing. The current
method of using ‘\y
’ for the GNU ‘\b
’ appears to be the
lesser of two evils.
The various command-line options
(see section Command-Line Options)
control how gawk
interprets characters in regexps:
- No options
In the default case,
gawk
provides all the facilities of POSIX regexps and the previously described GNU regexp operators. GNU regexp operators described in Regular Expression Operators.--posix
Match only POSIX regexps; the GNU operators are not special (e.g., ‘
\w
’ matches a literal ‘w
’). Interval expressions are allowed.--traditional
Match traditional Unix
awk
regexps. The GNU operators are not special, and interval expressions are not available. Because BWKawk
supports them, the POSIX character classes (‘alnum:
’, etc.) are available. Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.--re-interval
Allow interval expressions in regexps, if
--traditional
has been provided. Otherwise, interval expressions are available by default.
Next: Case-sensitivity, Previous: Computed Regexps, Up: Regexp [Contents][Index]