Gawk/Variable-Typing

From Get docs

6.3.2.1 String Type versus Numeric Type

Scalar objects in awk (variables, array elements, and fields) are dynamically typed. This means their type can change as the program runs, from untyped before any use,33 to string or number, and then from string to number or number to string, as the program progresses. (gawk also provides regexp-typed scalars, but let’s ignore that for now; see section Strongly Typed Regexp Constants.)

You can’t do much with untyped variables, other than tell that they are untyped. The following program tests a against "" and 0; the test succeeds when a has never been assigned a value. It also uses the built-in typeof() function (not presented yet; see section Getting Type Information) to show a’s type:

$ gawk 'BEGIN { print (a == "" && a == 0 ?
> "a is untyped" : "a has a type!") ; print typeof(a) }'
-| a is untyped
-| unassigned

A scalar has numeric type when assigned a numeric value, such as from a numeric constant, or from another scalar with numeric type:

$ gawk 'BEGIN { a = 42 ; print typeof(a)
> b = a ; print typeof(b) }'
number
number

Similarly, a scalar has string type when assigned a string value, such as from a string constant, or from another scalar with string type:

$ gawk 'BEGIN { a = "forty two" ; print typeof(a)
> b = a ; print typeof(b) }'
string
string

So far, this is all simple and straightforward. What happens, though, when awk has to process data from a user? Let’s start with field data. What should the following command produce as output?

echo hello | awk '{ printf("%s %s < 42\n", $1,
                           ($1 < 42 ? "is" : "is not")) }'

Since ‘hello’ is alphabetic data, awk can only do a string comparison. Internally, it converts 42 into "42" and compares the two string values "hello" and "42". Here’s the result:

$ echo hello | awk '{ printf("%s %s < 42\n", $1,
>                            ($1 < 42 ? "is" : "is not")) }'
-| hello is not < 42

However, what happens when data from a user looks like a number? On the one hand, in reality, the input data consists of characters, not binary numeric values. But, on the other hand, the data looks numeric, and awk really ought to treat it as such. And indeed, it does:

$ echo 37 | awk '{ printf("%s %s < 42\n", $1,
>                         ($1 < 42 ? "is" : "is not")) }'
-| 37 is < 42

Here are the rules for when awk treats data as a number, and for when it treats data as a string.

The POSIX standard uses the term numeric string for input data that looks numeric. The ‘37’ in the previous example is a numeric string. So what is the type of a numeric string? Answer: numeric.

The type of a variable is important because the types of two variables determine how they are compared. Variable typing follows these definitions and rules:

  • A numeric constant or the result of a numeric operation has the numeric attribute.
  • A string constant or the result of a string operation has the string attribute.
  • Fields, getline input, FILENAME, ARGV elements, ENVIRON elements, and the elements of an array created by match(), split(), and patsplit() that are numeric strings have the strnum attribute.34 Otherwise, they have the string attribute. Uninitialized variables also have the strnum attribute.
  • Attributes propagate across assignments but are not changed by any use.

The last rule is particularly important. In the following program, a has numeric type, even though it is later used in a string operation:

BEGIN {
     a = 12.345
     b = a " is a cute number"
     print b
}

When two operands are compared, either string comparison or numeric comparison may be used. This depends upon the attributes of the operands, according to the following symmetric matrix:

        +----------------------------------------------
        |       STRING          NUMERIC         STRNUM
--------+----------------------------------------------
        |
STRING  |       string          string          string
        |
NUMERIC |       string          numeric         numeric
        |
STRNUM  |       string          numeric         numeric
--------+----------------------------------------------

The basic idea is that user input that looks numeric—and only user input—should be treated as numeric, even though it is actually made of characters and is therefore also a string. Thus, for example, the string constant " +3.14", when it appears in program source code, is a string—even though it looks numeric—and is never treated as a number for comparison purposes.

In short, when one operand is a “pure” string, such as a string constant, then a string comparison is performed. Otherwise, a numeric comparison is performed. (The primary difference between a number and a strnum is that for strnums gawk preserves the original string value that the scalar had when it came in.)

This point bears additional emphasis: Input that looks numeric is numeric. All other input is treated as strings.

Thus, the six-character input string ‘ +3.14’ receives the strnum attribute. In contrast, the eight characters " +3.14" appearing in program text comprise a string constant. The following examples print ‘1’ when the comparison between the two different constants is true, and ‘0’ otherwise:

$ echo ' +3.14' | awk '{ print($0 == " +3.14") }'    True
-| 1
$ echo ' +3.14' | awk '{ print($0 == "+3.14") }'     False
-| 0
$ echo ' +3.14' | awk '{ print($0 == "3.14") }'      False
-| 0
$ echo ' +3.14' | awk '{ print($0 == 3.14) }'        True
-| 1
$ echo ' +3.14' | awk '{ print($1 == " +3.14") }'    False
-| 0
$ echo ' +3.14' | awk '{ print($1 == "+3.14") }'     True
-| 1
$ echo ' +3.14' | awk '{ print($1 == "3.14") }'      False
-| 0
$ echo ' +3.14' | awk '{ print($1 == 3.14) }'        True
-| 1

You can see the type of an input field (or other user input) using typeof():

$ echo hello 37 | gawk '{ print typeof($1), typeof($2) }'
-| string strnum

Footnotes

(33)

gawk calls this unassigned, as the following example shows.

(34)

Thus, a POSIX numeric string and gawk’s strnum are the same thing.