String Functions (The GNU Awk User’s Guide)
Next: I/O Functions, Previous: Numeric Functions, Up: Built-in [Contents][Index]
9.1.3 String-Manipulation Functions
The functions in this section look at or change the text of one or more strings.
gawk understands locales (see section Where You Are Makes a Difference) and does all string processing in terms of characters, not bytes. This distinction is particularly important to understand for locales where one character may be represented by multiple bytes. Thus, for example, length() returns the number of characters in a string, and not the number of bytes used to represent those characters. Similarly, index() works with character indices, and not byte indices.
CAUTION: A number of functions deal with indices into strings. For these functions, the first character of a string is at position (index) one. This is different from C and the languages descended from it, where the first character is at position zero. You need to remember this when doing index calculations, particularly if you are used to C.
In the following list, optional parameters are enclosed in square brackets ([ ]). /@w Several functions perform string substitution; the full discussion is provided in the description of the sub() function, which comes toward the end, because the list is presented alphabetically.
Those functions that are specific to gawk are marked with a pound sign (‘#’). They are not available in compatibility mode (see section Command-Line Options):
| • Gory Details | More than you want to know about ‘\’ and ‘&’ with sub(), gsub(), and gensub().
|
asort(source[,dest[,how] ]) #
asorti(source[,dest[,how] ]) #These two functions are similar in behavior, so they are described together.
NOTE: The following description ignores the third argument,
how, as it requires understanding features that we have not discussed yet. Thus, the discussion here is a deliberate simplification. (We do provide all the details later on; see Sorting Array Values and Indices with gawk for the full story.)Both functions return the number of elements in the array
source. Forasort(),gawksorts the values ofsourceand replaces the indices of the sorted values ofsourcewith sequential integers starting with one. If the optional arraydestis specified, thensourceis duplicated intodest.destis then sorted, leaving the indices ofsourceunchanged.When comparing strings,
IGNORECASEaffects the sorting (see section Sorting Array Values and Indices with gawk). If thesourcearray contains subarrays as values (see section Arrays of Arrays), they will come last, after all scalar values. Subarrays are not recursively sorted.For example, if the contents of
aare as follows:a["last"] = "de" a["first"] = "sac" a["middle"] = "cul"
A call to
asort():asort(a)
results in the following contents of
a:a[1] = "cul" a[2] = "de" a[3] = "sac"
The
asorti()function works similarly toasort(); however, the indices are sorted, instead of the values. Thus, in the previous example, starting with the same initial set of indices and values ina, calling ‘asorti(a)’ would yield:a[1] = "first" a[2] = "last" a[3] = "middle"
NOTE: Due to implementation limitations, you may not use either
SYMTABorFUNCTABas arguments to these functions, even if providing a second array to use for the actual sorting. Attempting to do so produces a fatal error. This restriction may be lifted in the future.gensub(regexp, replacement, how[, target]) #Search the target string
targetfor matches of the regular expressionregexp. Ifhowis a string beginning with ‘g’ or ‘G’ (short for “global”), then replace all matches ofregexpwithreplacement. Otherwise, treathowas a number indicating which match ofregexpto replace. Treat numeric values less than one as if they were one. If notargetis supplied, use$0. Return the modified string as the result of the function. The original target string is not changed.gensub()is a general substitution function. Its purpose is to provide more features than the standardsub()andgsub()functions.gensub()provides an additional feature that is not available insub()orgsub(): the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, whereNis a digit from 1 to 9. For example:$ gawk ' > BEGIN { > a = "abc def" > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) > print b > }' -| def abcAs with
sub(), you must type two backslashes in order to get one into the string. In the replacement text, the sequence ‘\0’ represents the entire matched text, as does the character ‘&’.The following example shows how you can use the third argument to control which match of the regexp should be changed:
$ echo a b c a b c | > gawk '{ print gensub(/a/, "AA", 2) }' -| a b c AA b cIn this case,
$0is the default target string.gensub()returns the new string as its result, which is passed directly toprintfor printing.If the
howargument is a string that does not begin with ‘g’ or ‘G’, or if it is a number that is less than or equal to zero, only one substitution is performed. Ifhowis zero,gawkissues a warning message.If
regexpdoes not matchtarget,gensub()’s return value is the original unchanged value oftarget.gsub(regexp, replacement[, target])Search
targetfor all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them withreplacement. The ‘g’ ingsub()stands for “global,” which means replace everywhere. For example:{ gsub(/Britain/, "United Kingdom"); print }replaces all occurrences of the string ‘
Britain’ with ‘United Kingdom’ for all input records.The
gsub()function returns the number of substitutions made. If the variable to search and alter (target) is omitted, then the entire input record ($0) is used. As insub(), the characters ‘&’ and ‘\’ are special, and the third argument must be assignable.index(in, find)Search the string
infor the first occurrence of the stringfind, and return the position in characters where that occurrence begins in the stringin. Consider the following example:$ awk 'BEGIN { print index("peanut", "an") }' -| 3If
findis not found,index()returns zero.With BWK
awkandgawk, it is a fatal error to use a regexp constant forfind. Other implementations allow it, simply treating the regexp constant as an expression meaning ‘$0 ~ /regexp/’. (d.c.)length([string])Return the number of characters in
string. Ifstringis a number, the length of the digit string representing that number is returned. For example,length("abcde")is five. By contrast,length(15 * 35)works out to three. In this example, 15 * 35 = 525, and 525 is then converted to the string"525", which has three characters.If no argument is supplied,
length()returns the length of$0.NOTE: In older versions of
awk, thelength()function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses.If
length()is called with a variable that has not been used,gawkforces the variable to be a scalar. Other implementations ofawkleave the variable without a type. (d.c.) Consider:$ gawk 'BEGIN { print length(x) ; x[1] = 1 }' -| 0 error→ gawk: fatal: attempt to use scalar `x' as array $ nawk 'BEGIN { print length(x) ; x[1] = 1 }' -| 0If
--linthas been specified on the command line,gawkissues a warning about this.With
gawkand several otherawkimplementations, when given an array argument, thelength()function returns the number of elements in the array. (c.e.) This is less useful than it might seem at first, as the array is not guaranteed to be indexed from one to the number of elements in it. If--lintis provided on the command line (see section Command-Line Options),gawkwarns that passing an array argument is not portable. If--posixis supplied, using an array argument is a fatal error (see section Arrays in awk).match(string, regexp[, array])Search
stringfor the longest, leftmost substring matched by the regular expressionregexpand return the character position (index) at which that substring begins (one, if it starts at the beginning ofstring). If no match is found, return zero.The
regexpargument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See section Using Dynamic Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.The order of the first two arguments is the opposite of most other string functions that work with regular expressions, such as
sub()andgsub(). It might help to remember that formatch(), the order is the same as for the ‘~’ operator: ‘string ~ regexp’.The
match()function sets the predefined variableRSTARTto the index. It also sets the predefined variableRLENGTHto the length in characters of the matched substring. If no match is found,RSTARTis set to zero, andRLENGTHto -1.For example:
{ if ($1 == "FIND") regex = $2 else { where = match($0, regex) if (where != 0) print "Match of", regex, "found at", where, "in", $0 } }This program looks for lines that match the regular expression stored in the variable
regex. This regular expression can be changed. If the first word on a line is ‘FIND’,regexis changed to be the second word on that line. Therefore, if given:FIND ru+n My program runs but not very quickly FIND Melvin JF+KM This line is property of Reality Engineering Co. Melvin was here.
awkprints:Match of ru+n found at 12 in My program runs Match of Melvin found at 1 in Melvin was here.
If
arrayis present, it is cleared, and then the zeroth element ofarrayis set to the entire portion ofstringmatched byregexp. Ifregexpcontains parentheses, the integer-indexed elements ofarrayare set to contain the portion ofstringmatching the corresponding parenthesized subexpression. For example:$ echo foooobazbarrrrr | > gawk '{ match($0, /(fo+).+(bar*)/, arr) > print arr[1], arr[2] }' -| foooo barrrrrIn addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:
$ echo foooobazbarrrrr | > gawk '{ match($0, /(fo+).+(bar*)/, arr) > print arr[1], arr[2] > print arr[1, "start"], arr[1, "length"] > print arr[2, "start"], arr[2, "length"] > }' -| foooo barrrrr -| 1 5 -| 9 7There may not be subscripts for the start and index for every parenthesized subexpression, because they may not all have matched text; thus, they should be tested for with the
inoperator (see section Referring to an Array Element).The
arrayargument tomatch()is agawkextension. In compatibility mode (see section Command-Line Options), using a third argument is a fatal error.patsplit(string, array[, fieldpat[, seps] ]) #Divide
stringinto pieces (or “fields”) defined byfieldpatand store the pieces inarrayand the separator strings in thesepsarray. The first piece is stored inarray[1], the second piece inarray[2], and so forth. The third argument,fieldpat, is a regexp describing the fields instring(just asFPATis a regexp describing the fields in input records). It may be either a regexp constant or a string. Iffieldpatis omitted, the value ofFPATis used.patsplit()returns the number of elements created.seps[i]is the possibly null separator string afterarray[i]. The possibly null leading separator will be inseps[0]. So a non-nullstringwithnfields will haven+1separators. A nullstringwill not have neither fields nor separators.The
patsplit()function splits strings into pieces in a manner similar to the way input lines are split into fields usingFPAT(see section Defining Fields by Content).Before splitting the string,
patsplit()deletes any previously existing elements in the arraysarrayandseps.split(string, array[, fieldsep[, seps] ])Divide
stringinto pieces separated byfieldsepand store the pieces inarrayand the separator strings in thesepsarray. The first piece is stored inarray[1], the second piece inarray[2], and so forth. The string value of the third argument,fieldsep, is a regexp describing where to splitstring(much asFScan be a regexp describing where to split input records). Iffieldsepis omitted, the value ofFSis used.split()returns the number of elements created.sepsis agawkextension, withseps[i]being the separator string betweenarray[i]andarray[i+1]. Iffieldsepis a single space, then any leading whitespace goes intoseps[0]and any trailing whitespace goes intoseps[n], wherenis the return value ofsplit()(i.e., the number of elements inarray).The
split()function splits strings into pieces in the same way that input lines are split into fields. For example:split("cul-de-sac", a, "-", seps)splits the string
"cul-de-sac"into three fields using ‘-’ as the separator. It sets the contents of the arrayaas follows:a[1] = "cul" a[2] = "de" a[3] = "sac"
and sets the contents of the array
sepsas follows:seps[1] = "-" seps[2] = "-"
The value returned by this call to
split()is three.As with input field-splitting, when the value of
fieldsepis" "/@w , leading and trailing whitespace is ignored in values assigned to the elements ofarraybut not inseps, and the elements are separated by runs of whitespace. Also, as with input field splitting, iffieldsepis the null string, each individual character in the string is split into its own array element. (c.e.) Additionally, iffieldsepis a single-character string, that string acts as the separator, even if its value is a regular expression metacharacter.Note, however, that
RShas no effect on the waysplit()works. Even though ‘RS = ""’ causes the newline character to also be an input field separator, this does not affect howsplit()splits strings.Modern implementations of
awk, includinggawk, allow the third argument to be a regexp constant (/…//@w ) as well as a string. (d.c.) The POSIX standard allows this as well. See section Using Dynamic Regexps for a discussion of the difference between using a string constant or a regexp constant, and the implications for writing your program correctly.Before splitting the string,
split()deletes any previously existing elements in the arraysarrayandseps.If
stringis null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See section The delete Statement.)If
stringdoes not matchfieldsepat all (but is not null),arrayhas one element only. The value of that element is the originalstring.In POSIX mode (see section Command-Line Options), the fourth argument is not allowed.
sprintf(format, expression1, …)Return (without printing) the string that
printfwould have printed out with the same arguments (see section Using printf Statements for Fancier Printing). For example:pival = sprintf("pi = %.2f (approx.)", 22/7)assigns the string ‘
pi = 3.14 (approx.)’ /@w to the variablepival.strtonum(str) #Examine
strand return its numeric value. Ifstrbegins with a leading ‘0’,strtonum()assumes thatstris an octal number. Ifstrbegins with a leading ‘0x’ or ‘0X’,strtonum()assumes thatstris a hexadecimal number. For example:$ echo 0x11 | > gawk '{ printf "%d\n", strtonum($1) }' -| 17Using the
strtonum()function is not the same as adding zero to a string value; the automatic coercion of strings to numbers works only for decimal data, not for octal or hexadecimal.47Note also that
strtonum()uses the current locale’s decimal point for recognizing numbers (see section Where You Are Makes a Difference).sub(regexp, replacement[, target])Search
target, which is treated as a string, for the leftmost, longest substring matched by the regular expressionregexp. Modify the entire string by replacing the matched text withreplacement. The modified string becomes the new value oftarget. Return the number of substitutions made (zero or one).The
regexpargument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See section Using Dynamic Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.This function is peculiar because
targetis not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so thatsub()can store a modified value there. If this argument is omitted, then the default is to use and alter$0.48 For example:str = "water, water, everywhere" sub(/at/, "ith", str)
sets
strto ‘wither, water, everywhere’ /@w , by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.If the special character ‘
&’ appears inreplacement, it stands for the precise substring that was matched byregexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:{ sub(/candidate/, "& and his wife"); print }changes the first occurrence of ‘
candidate’ to ‘candidate and his wife’ on each input line. Here is another example:$ awk 'BEGIN { > str = "daabaaa" > sub(/a+/, "C&C", str) > print str > }' -| dCaaCbaaaThis shows how ‘
&’ can represent a nonconstant string and also illustrates the “leftmost, longest” rule in regexp matching (see section How Much Text Matches?).The effect of this special character (‘
&’) can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write ‘\\&’ in a string constant to include a literal ‘&’ in the replacement. For example, the following shows how to replace the first ‘|’ on each line with an ‘&’:{ sub(/\|/, "\\&"); print }As mentioned, the third argument to
sub()must be a variable, field, or array element. Some versions ofawkallow the third argument to be an expression that is not an lvalue. In such a case,sub()still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions ofawkaccept expressions like the following:sub(/USA/, "United States", "the USA and Canada")
For historical compatibility,
gawkaccepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.Finally, if the
regexpis not a regexp constant, it is converted into a string, and then the value of that string is treated as the regexp to match.substr(string, start[, length])Return a
length-character-long substring ofstring, starting at character numberstart. The first character of a string is character number one.49 For example,substr("washington", 5, 3)returns"ing".If
lengthis not present,substr()returns the whole suffix ofstringthat begins at character numberstart. For example,substr("washington", 5)returns"ington". The whole suffix is also returned iflengthis greater than the number of characters remaining in the string, counting from characterstart.If
startis less than one,substr()treats it as if it was one. (POSIX doesn’t specify what to do in this case: BWKawkacts this way, and thereforegawkdoes too.) Ifstartis greater than the number of characters in the string,substr()returns the null string. Similarly, iflengthis present but less than or equal to zero, the null string is returned.The string returned by
substr()cannot be assigned. Thus, it is a mistake to attempt to change a portion of a string, as shown in the following example:string = "abcdef" # try to get "abCDEf", won't work substr(string, 3, 3) = "CDE"
It is also a mistake to use
substr()as the third argument ofsub()orgsub():gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
(Some commercial versions of
awktreatsubstr()as assignable, but doing so is not portable.)If you need to replace bits and pieces of a string, combine
substr()with string concatenation, in the following manner:string = "abcdef" … string = substr(string, 1, 2) "CDE" substr(string, 6)
tolower(string)Return a copy of
string, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example,tolower("MiXeD cAsE 123")returns"mixed case 123".toupper(string)Return a copy of
string, with each lowercase character in the string replaced with its corresponding uppercase character. Nonalphabetic characters are left unchanged. For example,toupper("MiXeD cAsE 123")returns"MIXED CASE 123".
|
Matching the Null String
In $ echo abc | awk '{ gsub(/m*/, "X"); print }'
-| XaXbXcX
Although this makes a certain amount of sense, it can be surprising. |
Footnotes
(47)
Unless you use the --non-decimal-data option, which isn’t recommended. See section Allowing Nondecimal Input Data for more information.
(48)
Note that this means that the record will first be regenerated using the value of OFS if any fields have been changed, and that the fields will be updated after the substitution, even if the operation is a “no-op” such as ‘sub(/^/, "")’.
(49)
This is different from C and C++, in which the first character is number zero.
Next: I/O Functions, Previous: Numeric Functions, Up: Built-in [Contents][Index]