\’ and ‘
CAUTION: This subsubsection has been reported to cause headaches.
You might want to skip it upon first reading.
gensub(), and trying to get literal
backslashes and ampersands into the replacement text, you need to remember
that there are several levels of escape processing going on.
First, there is the lexical level, which is when
and builds an internal copy of it to execute.
Then there is the runtime level, which is when
awk actually scans the
replacement string to determine what to generate.
At both levels,
awk looks for a defined set of characters that
can come after a backslash. At the lexical level, it looks for the
escape sequences listed in Escape Sequences.
Thus, for every ‘
awk processes at the runtime
level, you must type two backslashes at the lexical level.
When a character that is not valid for an escape sequence follows the
gawk both simply remove the initial
\’ and put the next character into the string. Thus, for
"a\qb" is treated as
At the runtime level, the various functions handle sequences of
\’ and ‘
&’ differently. The situation is (sadly) somewhat complex.
gsub() functions treated the
two-character sequence ‘
\&’ specially; this sequence was replaced in
the generated text with a single ‘
&’. Any other ‘
replacement string that did not precede an ‘
&’ was passed
through unchanged. This is illustrated in Table 9.1.
You type sub() sees sub() generates ——– ———- ————— \& & The matched text \\& \& A literal ‘&’ \\\& \& A literal ‘&’ \\\\& \\& A literal ‘\&’ \\\\\& \\& A literal ‘\&’ \\\\\\& \\\& A literal ‘\\&’ \\q \q A literal ‘\q’
This table shows the lexical-level processing, where
an odd number of backslashes becomes an even number at the runtime level,
as well as the runtime processing done by
(For the sake of simplicity, the rest of the following tables only show the
case of even numbers of backslashes entered at the lexical level.)
The problem with the historical approach is that there is no way to get
a literal ‘
\’ followed by the matched text.
Several editions of the POSIX standard attempted to fix this problem but weren’t successful. The details are irrelevant at this point in time.
At one point, the
gawk maintainer submitted
proposed text for a revised standard that
reverts to rules that correspond more closely to the original existing
practice. The proposed rules have special cases that make it possible
to produce a ‘
\’ preceding the matched text.
This is shown in
You type sub() sees sub() generates ——– ———- ————— \\\\\\& \\\& A literal ‘\&’ \\\\& \\& A literal ‘\’, followed by the matched text \\& \& A literal ‘&’ \\q \q A literal ‘\q’ \\\\ \\ \\
In a nutshell, at the runtime level, there are now three special sequences
of characters (‘
\\&’, and ‘
\&’) whereas historically
there was only one. However, as in the historical case, any ‘
is not part of one of these three sequences is not special and appears
in the output literally.
gawk 3.0 and 3.1 follow these rules for
gsub(). The POSIX standard took much longer to be revised than
was expected. In addition, the
gawk maintainer’s proposal was
lost during the standardization process. The final rules are
somewhat simpler. The results are similar except for one case.
The POSIX rules state that ‘
\&’ in the replacement string produces
a literal ‘
\\’ produces a literal ‘
\’, and ‘
by anything else is not special; the ‘
\’ is placed straight into the output.
These rules are presented in Table 9.3.
You type sub() sees sub() generates ——– ———- ————— \\\\\\& \\\& A literal ‘\&’ \\\\& \\& A literal ‘\’, followed by the matched text \\& \& A literal ‘&’ \\q \q A literal ‘\q’ \\\\ \\ \
The only case where the difference is noticeable is the last one: ‘
is seen as ‘
\\’ and produces ‘
\’ instead of ‘
Starting with version 3.1.4,
gawk followed the POSIX rules
--posix was specified (see section Command-Line Options). Otherwise,
it continued to follow the proposed rules, as
that had been its behavior for many years.
When version 4.0.0 was released, the
made the POSIX rules the default, breaking well over a decade’s worth
of backward compatibility.50 Needless to say, this was a bad idea,
and as of version 4.0.1,
gawk resumed its historical
behavior, and only follows the POSIX rules when
--posix is given.
The rules for
gensub() are considerably simpler. At the runtime
gawk sees a ‘
\’, if the following character
is a digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output. Otherwise,
no matter what character follows the ‘
appears in the generated text and the ‘
\’ does not,
as shown in Table 9.4.
You type gensub() sees gensub() generates ——– ————- —————— & & The matched text \\& \& A literal ‘&’ \\\\ \\ A literal ‘\’ \\\\& \\& A literal ‘\’, then the matched text \\\\\\& \\\& A literal ‘\&’ \\q \q A literal ‘q’
Because of the complexity of the lexical- and runtime-level processing
and the special cases for
we recommend the use of
gensub() when you have
to do substitutions.
This was rather naive of him, despite there being a note in this section indicating that the next major version would move to the POSIX rules.