Regexp Field Splitting (The GNU Awk User’s Guide)
Next: Single Character Fields, Previous: Default Field Splitting, Up: Field Separators [Contents][Index]
4.5.2 Using Regular Expressions to Separate Fields
The previous subsection discussed the use of single characters or simple strings as the value of FS
. More generally, the value of FS
may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.
For a less trivial example of a regular expression, try using single spaces to separate fields the way single commas are used. FS
can be set to "[ ]"
/@w (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see section Regular Expressions).
There is an important difference between the two cases of ‘FS = " "
’ (a single space) and ‘FS = "[ \t\n]+"
’ (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS
, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS
is " "
/@w , awk
first strips leading and trailing whitespace from the record and then decides where the fields are. For example, the following pipeline prints ‘b
’:
$ echo ' a b c d ' | awk '{ print $2 }' -| b
However, this pipeline prints ‘a
’ (note the extra spaces around each letter):
$ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" } > { print $2 }' -| a
In this case, the first field is null, or empty.
The stripping of leading and trailing whitespace also comes into play whenever $0
is recomputed. For instance, study this pipeline:
$ echo ' a b c d' | awk '{ print; $2 = $2; print }' -| a b c d -| a b c d
The first print
statement prints the record as it was read, with leading whitespace intact. The assignment to $2
rebuilds $0
by concatenating $1
through $NF
together, separated by the value of OFS
(which is a space by default). Because the leading whitespace was ignored when finding $1
, it is not part of the new $0
. Finally, the last print
statement prints the new $0
.
There is an additional subtlety to be aware of when using regular expressions for field splitting. It is not well specified in the POSIX standard, or anywhere else, what ‘^
’ means when splitting fields. Does the ‘^
’ match only at the beginning of the entire record? Or is each field separator a new string? It turns out that different awk
versions answer this question differently, and you should not rely on any specific behavior in your programs. (d.c.)
As a point of information, BWK awk
allows ‘^
’ to match only at the beginning of the record. gawk
also works this way. For example:
$ echo 'xxAA xxBxx C' | > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++) > printf "-->%s<--\n", $i }' -| --><-- -| -->AA<-- -| -->xxBxx<-- -| -->C<--
Next: Single Character Fields, Previous: Default Field Splitting, Up: Field Separators [Contents][Index]