Case-sensitivity (The GNU Awk User’s Guide)

From Get docs
Gawk/docs/latest/Case 002dsensitivity


3.8 Case Sensitivity in Matching

Case is normally significant in regular expressions, both when matching ordinary characters (i.e., not metacharacters) and inside bracket expressions. Thus, a ‘w’ in a regular expression matches only a lowercase ‘w’ and not an uppercase ‘W’.

The simplest way to do a case-independent match is to use a bracket expression—for example, ‘[Ww]’. However, this can be cumbersome if you need to use it often, and it can make the regular expressions harder to read. There are two alternatives that you might prefer.

One way to perform a case-insensitive match at a particular point in the program is to convert the data to a single case, using the tolower() or toupper() built-in string functions (which we haven’t discussed yet; see section String-Manipulation Functions). For example:

tolower($1) ~ /foo/  { … }

converts the first field to lowercase before matching against it. This works in any POSIX-compliant awk.

Another method, specific to gawk, is to set the variable IGNORECASE to a nonzero value (see section Predefined Variables). When IGNORECASE is not zero, all regexp and string operations ignore case.

Changing the value of IGNORECASE dynamically controls the case sensitivity of the program as it runs. Case is significant by default because IGNORECASE (like most variables) is initialized to zero:

x = "aB"
if (x ~ /ab/) …   # this test will fail

IGNORECASE = 1
if (x ~ /ab/) …   # now it will succeed

In general, you cannot use IGNORECASE to make certain rules case insensitive and other rules case sensitive, as there is no straightforward way to set IGNORECASE just for the pattern of a particular rule.18 To do this, use either bracket expressions or tolower(). However, one thing you can do with IGNORECASE only is dynamically turn case sensitivity on or off for all the rules at once.

IGNORECASE can be set on the command line or in a BEGIN rule (see section Other Command-Line Arguments; also see section Startup and Cleanup Actions). Setting IGNORECASE from the command line is a way to make a program case insensitive without having to edit it.

In multibyte locales, the equivalences between upper- and lowercase characters are tested based on the wide-character values of the locale’s character set. Prior to version 5.0, single-byte characters were tested based on the ISO-8859-1 (ISO Latin-1) character set. However, as of version 5.0, single-byte characters are also tested based on the values of the locale’s character set.19

The value of IGNORECASE has no effect if gawk is in compatibility mode (see section Command-Line Options). Case is always significant in compatibility mode.



Footnotes

(18)

Experienced C and C++ programmers will note that it is possible, using something like ‘IGNORECASE = 1 && /foObAr/ { … }’ and ‘IGNORECASE = 0 || /foobar/ { … }’. However, this is somewhat obscure and we don’t recommend it.

(19)

If you don’t understand this, don’t worry about it; it just means that gawk does the right thing.