Regexp Operator Details (The GNU Awk User’s Guide)
Next: Interval Expressions, Up: Regexp Operators [Contents][Index]
3.3.1 Regexp Operators in awk
The escape sequences described earlier in Escape Sequences are valid inside a regexp. They are introduced by a ‘\
’ and are recognized and converted into corresponding real characters as the very first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape sequences and that are not listed here stand for themselves:
\
This suppresses the special meaning of a character when matching. For example, ‘\$
’ matches the character ‘$
’.
^
This matches the beginning of a string. ‘^@chapter
’ matches ‘@chapter
’ at the beginning of a string, for example, and can be used to identify chapter beginnings in Texinfo source files. The ‘^
’ is known as an anchor, because it anchors the pattern to match only at the beginning of the string.
It is important to realize that ‘^
’ does not match the beginning of a line (the point right after a ‘\n
’ newline character) embedded in a string. The condition is not true in the following example:
if ("line1\nLINE 2" ~ /^L/) …
$
This is similar to ‘^
’, but it matches only at the end of a string. For example, ‘p$
’ matches a record that ends with a ‘p
’. The ‘$
’ is an anchor and does not match the end of a line (the point right before a ‘\n
’ newline character) embedded in a string. The condition in the following example is not true:
if ("line1\nLINE 2" ~ /1$/) …
.
(period)
This matches any single character, including the newline character. For example, ‘.P
’ matches any single character followed by a ‘P
’ in a string. Using concatenation, we can make a regular expression such as ‘U.A
’, which matches any three-character sequence that begins with ‘U
’ and ends with ‘A
’.
In strict POSIX mode (see section Command-Line Options), ‘.
’ does not match the NUL character, which is a character with all bits equal to zero. Otherwise, NUL is just another character. Other versions of awk
may not be able to match the NUL character.
[
…]
This is called a bracket expression.16 It matches any one of the characters that are enclosed in the square brackets. For example, ‘[MVX]
’ matches any one of the characters ‘M
’, ‘V
’, or ‘X
’ in a string. A full discussion of what can be inside the square brackets of a bracket expression is given in Using Bracket Expressions.
[^
…]
This is a complemented bracket expression. The first character after the ‘[
’ must be a ‘^
’. It matches any characters except those in the square brackets. For example, ‘[^awk]
’ matches any character that is not an ‘a
’, ‘w
’, or ‘k
’.
|
This is the alternation operator and it is used to specify alternatives. The ‘|
’ has the lowest precedence of all the regular expression operators. For example, ‘^P|[aeiouy]
’ matches any string that matches either ‘^P
’ or ‘[aeiouy]
’. This means it matches any string that starts with ‘P
’ or contains (anywhere within it) a lowercase English vowel.
The alternation applies to the largest possible regexps on either side.
(
…)
Parentheses are used for grouping in regular expressions, as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, ‘|
’. For example, ‘@(samp|code)\{[^}]+\}
’ matches both ‘@code{foo}
’ and ‘@samp{bar}
’. (These are Texinfo formatting control sequences. The ‘+
’ is explained further on in this list.)
The left or opening parenthesis is always a metacharacter; to match one literally, precede it with a backslash. However, the right or closing parenthesis is only special when paired with a left parenthesis; an unpaired right parenthesis is (silently) treated as a regular character.
*
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ‘ph*
’ applies the ‘*
’ symbol to the preceding ‘h
’ and looks for matches of one ‘p
’ followed by any number of ‘h
’s. This also matches just ‘p
’ if no ‘h
’s are present.
There are two subtle points to understand about how ‘*
’ works. First, the ‘*
’ applies only to the single preceding regular expression component (e.g., in ‘ph*
’, it applies just to the ‘h
’). To cause ‘*
’ to apply to a larger subexpression, use parentheses: ‘(ph)*
’ matches ‘ph
’, ‘phph
’, ‘phphph
’, and so on.
Second, ‘*
’ finds as many repetitions as possible. If the text to be matched is ‘phhhhhhhhhhhhhhooey
’, ‘ph*
’ matches all of the ‘h
’s.
+
This symbol is similar to ‘*
’, except that the preceding expression must be matched at least once. This means that ‘wh+y
’ would match ‘why
’ and ‘whhy
’, but not ‘wy
’, whereas ‘wh*y
’ would match all three.
?
This symbol is similar to ‘*
’, except that the preceding expression can be matched either once or not at all. For example, ‘fe?d
’ matches ‘fed
’ and ‘fd
’, but nothing else.
{
n
}
{
n
,}
{
n
,
m
}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n
times. If there are two numbers separated by a comma, the preceding regexp is repeated n
to m
times. If there is one number followed by a comma, then the preceding regexp is repeated at least n
times:
wh{3}y
- Matches ‘
whhhy
’, but not ‘why
’ or ‘whhhhy
’. wh{3,5}y
- Matches ‘
whhhy
’, ‘whhhhy
’, or ‘whhhhhy
’ only. wh{2,}y
- Matches ‘
whhy
’, ‘whhhy
’, and so on.
In regular expressions, the ‘*
’, ‘+
’, and ‘?
’ operators, as well as the braces ‘{
’ and ‘}
’, have the highest precedence, followed by concatenation, and finally by ‘|
’. As in arithmetic, parentheses can change how operators are grouped.
In POSIX awk
and gawk
, the ‘*
’, ‘+
’, and ‘?
’ operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/
matches a literal plus sign. However, many other versions of awk
treat such a usage as a syntax error.
Footnotes
(16)
In other literature, you may see a bracket expression referred to as either a character set, a character class, or a character list.
Next: Interval Expressions, Up: Regexp Operators [Contents][Index]