Gawk/Regexp-Operator-Details
Next: Interval Expressions, Up: Regexp Operators [Contents][Index]
3.3.1 Regexp Operators in awk
The escape sequences described
earlier
in Escape Sequences
are valid inside a regexp. They are introduced by a ‘\
’ and
are recognized and converted into corresponding real characters as
the very first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape sequences and that are not listed here stand for themselves:
\
This suppresses the special meaning of a character when
matching. For example, ‘\$
’
matches the character ‘$
’.
^
This matches the beginning of a string. ‘^@chapter
’
matches ‘@chapter
’ at the beginning of a string,
for example, and can be used
to identify chapter beginnings in Texinfo source files.
The ‘^
’ is known as an anchor, because it anchors the pattern to
match only at the beginning of the string.
It is important to realize that ‘^
’ does not match the beginning of
a line (the point right after a ‘\n
’ newline character) embedded in a string.
The condition is not true in the following example:
if ("line1\nLINE 2" ~ /^L/) …
$
This is similar to ‘^
’, but it matches only at the end of a string.
For example, ‘p$
’
matches a record that ends with a ‘p
’. The ‘$
’ is an anchor
and does not match the end of a line
(the point right before a ‘\n
’ newline character)
embedded in a string.
The condition in the following example is not true:
if ("line1\nLINE 2" ~ /1$/) …
.
(period)
This matches any single character,
including the newline character. For example, ‘.P
’
matches any single character followed by a ‘P
’ in a string. Using
concatenation, we can make a regular expression such as ‘U.A
’, which
matches any three-character sequence that begins with ‘U
’ and ends
with ‘A
’.
In strict POSIX mode (see section Command-Line Options),
‘.
’ does not match the NUL
character, which is a character with all bits equal to zero.
Otherwise, NUL is just another character. Other versions of awk
may not be able to match the NUL character.
[
…]
This is called a bracket expression.16
It matches any one of the characters that are enclosed in
the square brackets. For example, ‘[MVX]
’ matches any one of
the characters ‘M
’, ‘V
’, or ‘X
’ in a string. A full
discussion of what can be inside the square brackets of a bracket expression
is given in
Using Bracket Expressions.
[^
…]
This is a complemented bracket expression. The first character after
the ‘[
’ must be a ‘^
’. It matches any characters
except those in the square brackets. For example, ‘[^awk]
’
matches any character that is not an ‘a
’, ‘w
’,
or ‘k
’.
|
This is the alternation operator and it is used to specify
alternatives. The ‘|
’ has the lowest precedence of all the regular
expression operators. For example, ‘^P|[aeiouy]
’ matches any string
that matches either ‘^P
’ or ‘[aeiouy]
’. This means it matches
any string that starts with ‘P
’ or contains (anywhere within it)
a lowercase English vowel.
The alternation applies to the largest possible regexps on either side.
(
…)
Parentheses are used for grouping in regular expressions, as in
arithmetic. They can be used to concatenate regular expressions
containing the alternation operator, ‘|
’. For example,
‘@(samp|code)\{[^}]+\}
’ matches both ‘@code{foo}
’ and
‘@samp{bar}
’.
(These are Texinfo formatting control sequences. The ‘+
’ is
explained further on in this list.)
The left or opening parenthesis is always a metacharacter; to match one literally, precede it with a backslash. However, the right or closing parenthesis is only special when paired with a left parenthesis; an unpaired right parenthesis is (silently) treated as a regular character.
*
This symbol means that the preceding regular expression should be
repeated as many times as necessary to find a match. For example, ‘ph*
’
applies the ‘*
’ symbol to the preceding ‘h
’ and looks for matches
of one ‘p
’ followed by any number of ‘h
’s. This also matches
just ‘p
’ if no ‘h
’s are present.
There are two subtle points to understand about how ‘*
’ works.
First, the ‘*
’ applies only to the single preceding regular expression
component (e.g., in ‘ph*
’, it applies just to the ‘h
’).
To cause ‘*
’ to apply to a larger subexpression, use parentheses:
‘(ph)*
’ matches ‘ph
’, ‘phph
’, ‘phphph
’, and so on.
Second, ‘*
’ finds as many repetitions as possible. If the text
to be matched is ‘phhhhhhhhhhhhhhooey
’, ‘ph*
’ matches all of
the ‘h
’s.
+
This symbol is similar to ‘*
’, except that the preceding expression must be
matched at least once. This means that ‘wh+y
’
would match ‘why
’ and ‘whhy
’, but not ‘wy
’, whereas
‘wh*y
’ would match all three.
?
This symbol is similar to ‘*
’, except that the preceding expression can be
matched either once or not at all. For example, ‘fe?d
’
matches ‘fed
’ and ‘fd
’, but nothing else.
{
n
}
{
n
,}
{
n
,
m
}
One or two numbers inside braces denote an interval expression.
If there is one number in the braces, the preceding regexp is repeated
n
times.
If there are two numbers separated by a comma, the preceding regexp is
repeated n
to m
times.
If there is one number followed by a comma, then the preceding regexp
is repeated at least n
times:
wh{3}y
- Matches ‘
whhhy
’, but not ‘why
’ or ‘whhhhy
’. wh{3,5}y
- Matches ‘
whhhy
’, ‘whhhhy
’, or ‘whhhhhy
’ only. wh{2,}y
- Matches ‘
whhy
’, ‘whhhy
’, and so on.
In regular expressions, the ‘*
’, ‘+
’, and ‘?
’ operators,
as well as the braces ‘{
’ and ‘}
’,
have
the highest precedence, followed by concatenation, and finally by ‘|
’.
As in arithmetic, parentheses can change how operators are grouped.
In POSIX awk
and gawk
, the ‘*
’, ‘+
’, and
‘?
’ operators stand for themselves when there is nothing in the
regexp that precedes them. For example, /+/
matches a literal
plus sign. However, many other versions of awk
treat such a
usage as a syntax error.
Footnotes
(16)
In other literature, you may see a bracket expression referred to as either a character set, a character class, or a character list.
Next: Interval Expressions, Up: Regexp Operators [Contents][Index]