Getopt Function (The GNU Awk User’s Guide)

From Get docs
Gawk/docs/latest/Getopt-Function


10.4 Processing Command-Line Options

Most utilities on POSIX-compatible systems take options on the command line that can be used to change the way a program behaves. awk is an example of such a program (see section Command-Line Options). Often, options take arguments (i.e., data that the program needs to correctly obey the command-line option). For example, awk’s -F option requires a string to use as the field separator. The first occurrence on the command line of either -- or a string that does not begin with ‘-’ ends the options.

Modern Unix systems provide a C function named getopt() for processing command-line arguments. The programmer provides a string describing the one-letter options. If an option requires an argument, it is followed in the string with a colon. getopt() is also passed the count and values of the command-line arguments and is called in a loop. getopt() processes the command-line arguments for option letters. Each time around the loop, it returns a single character representing the next option letter that it finds, or ‘?’ if it finds an invalid option. When it returns -1, there are no options left on the command line.

When using getopt(), options that do not take arguments can be grouped together. Furthermore, options that take arguments require that the argument be present. The argument can immediately follow the option letter, or it can be a separate command-line argument.

Given a hypothetical program that takes three command-line options, -a, -b, and -c, where -b requires an argument, all of the following are valid ways of invoking the program:

prog -a -b foo -c data1 data2 data3
prog -ac -bfoo -- data1 data2 data3
prog -acbfoo data1 data2 data3

Notice that when the argument is grouped with its option, the rest of the argument is considered to be the option’s argument. In this example, -acbfoo indicates that all of the -a, -b, and -c options were supplied, and that ‘foo’ is the argument to the -b option.

getopt() provides four external variables that the programmer can use:

optind
The index in the argument value array (argv) where the first nonoption command-line argument can be found.
optarg
The string value of the argument to an option.
opterr
Usually getopt() prints an error message when it finds an invalid option. Setting opterr to zero disables this feature. (An application might want to print its own error message.)
optopt
The letter representing the command-line option.

The following C fragment shows how getopt() might process command-line arguments for awk:

int
main(int argc, char *argv[])
{
    …
    /* print our own message */
    opterr = 0;
    while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {
        switch (c) {
        case 'f':    /* file */
            …
            break;
        case 'F':    /* field separator */
            …
            break;
        case 'v':    /* variable assignment */
            …
            break;
        case 'W':    /* extension */
            …
            break;
        case '?':
        default:
            usage();
            break;
        }
    }
    …
}

The GNU project’s version of the original Unix utilities popularized the use of long command line options. For example, --help in addition to -h. Arguments to long options are either provided as separate command line arguments (‘--source 'program-text'’) or separated from the option with an ‘=’ sign (‘--source='program-text'’).

As a side point, gawk actually uses the GNU getopt_long() function to process both normal and GNU-style long options (see section Command-Line Options).

The abstraction provided by getopt() is very useful and is quite handy in awk programs as well. Following is an awk version of getopt() that accepts both short and long options.

This function highlights one of the greatest weaknesses in awk, which is that it is very poor at manipulating single characters. The function needs repeated calls to substr() in order to access individual characters (see section String-Manipulation Functions).73

The discussion that follows walks through the code a bit at a time:

# getopt.awk --- Do C library getopt(3) function in awk
#                Also supports long options.

# External variables:
#    Optind -- index in ARGV of first nonoption argument
#    Optarg -- string value of argument to current option
#    Opterr -- if nonzero, print our own diagnostic
#    Optopt -- current option letter

# Returns:
#    -1     at end of options
#    "?"    for unrecognized option
#    <s>    a string representing the current option

# Private Data:
#    _opti  -- index in multiflag option, e.g., -abc

The function starts out with comments presenting a list of the global variables it uses, what the return values are, what they mean, and any global variables that are “private” to this library function. Such documentation is essential for any program, and particularly for library functions.

The getopt() function first checks that it was indeed called with a string of options (the options parameter). If both options and longoptions have a zero length, getopt() immediately returns -1:

function getopt(argc, argv, options, longopts,    thisopt, i, j)
{
    if (length(options) == 0 && length(longopts) == 0)
        return -1                # no options given
    if (argv[Optind] == "--") {  # all done
        Optind++
        _opti = 0
        return -1
    } else if (argv[Optind] !~ /^-[^:[:space:]]/) {
        _opti = 0
        return -1
    }

The next thing to check for is the end of the options. A -- ends the command-line options, as does any command-line argument that does not begin with a ‘-’ (unless it is an argument to a preceding option). Optind steps through the array of command-line arguments; it retains its value across calls to getopt(), because it is a global variable.

The regular expression /^-[^:[:space:]/ checks for a ‘-’ followed by anything that is not whitespace and not a colon. If the current command-line argument does not match this pattern, it is not an option, and it ends option processing. Now, we check to see if we are processing a short (single letter) option, or a long option (indicated by two dashes, e.g., ‘--filename’). If it is a short option, we continue on:

    if (argv[Optind] !~ /^--/) {        # if this is a short option
        if (_opti == 0)
            _opti = 2
        thisopt = substr(argv[Optind], _opti, 1)
        Optopt = thisopt
        i = index(options, thisopt)
        if (i == 0) {
            if (Opterr)
                printf("%c -- invalid option\n", thisopt) > "/dev/stderr"
            if (_opti >= length(argv[Optind])) {
                Optind++
                _opti = 0
            } else
                _opti++
            return "?"
        }

The _opti variable tracks the position in the current command-line argument (argv[Optind]). If multiple options are grouped together with one ‘-’ (e.g., -abx), it is necessary to return them to the user one at a time.

If _opti is equal to zero, it is set to two, which is the index in the string of the next character to look at (we skip the ‘-’, which is at position one). The variable thisopt holds the character, obtained with substr(). It is saved in Optopt for the main program to use.

If thisopt is not in the options string, then it is an invalid option. If Opterr is nonzero, getopt() prints an error message on the standard error that is similar to the message from the C version of getopt().

Because the option is invalid, it is necessary to skip it and move on to the next option character. If _opti is greater than or equal to the length of the current command-line argument, it is necessary to move on to the next argument, so Optind is incremented and _opti is reset to zero. Otherwise, Optind is left alone and _opti is merely incremented.

In any case, because the option is invalid, getopt() returns "?". The main program can examine Optopt if it needs to know what the invalid option letter actually is. Continuing on:

        if (substr(options, i + 1, 1) == ":") {
            # get option argument
            if (length(substr(argv[Optind], _opti + 1)) > 0)
                Optarg = substr(argv[Optind], _opti + 1)
            else
                Optarg = argv[++Optind]
            _opti = 0
        } else
            Optarg = ""

If the option requires an argument, the option letter is followed by a colon in the options string. If there are remaining characters in the current command-line argument (argv[Optind]), then the rest of that string is assigned to Optarg. Otherwise, the next command-line argument is used (‘-xFOO’ versus ‘-x FOO’). In either case, _opti is reset to zero, because there are no more characters left to examine in the current command-line argument. Continuing:

        if (_opti == 0 || _opti >= length(argv[Optind])) {
            Optind++
            _opti = 0
        } else
            _opti++
        return thisopt

Finally, for a short option, if _opti is either zero or greater than the length of the current command-line argument, it means this element in argv is through being processed, so Optind is incremented to point to the next element in argv. If neither condition is true, then only _opti is incremented, so that the next option letter can be processed on the next call to getopt().

On the other hand, if the earlier test found that this was a long option, we take a different branch:

    } else {
        j = index(argv[Optind], "=")
        if (j > 0)
            thisopt = substr(argv[Optind], 3, j - 3)
        else
            thisopt = substr(argv[Optind], 3)
        Optopt = thisopt

First, we search this option for a possible embedded equal sign, as the specification of long options allows an argument to an option ‘--someopt:’ to be specified as ‘--someopt=answer’ as well as ‘--someopt answer’.

        i = match(longopts, "(^|,)" thisopt "($|[,:])")
        if (i == 0) {
            if (Opterr)
                 printf("%s -- invalid option\n", thisopt) > "/dev/stderr"
            Optind++
            return "?"
        }

Next, we try to find the current option in longopts. The regular expression given to match(), "(^|,)" thisopt "($|[,:])", matches this option at the beginning of longopts, or at the beginning of a subsequent long option (the previous long option would have been terminated by a comma), and, in any case, either at the end of the longopts string (‘$’), or followed by a comma (separating this option from a subsequent option) or a colon (indicating this long option takes an argument (‘[,:]’).

Using this regular expression, we check to see if the current option might possibly be in longopts (if longopts is not specified, this test will also fail). In case of an error, we possibly print an error message and then return "?". Continuing on:

        if (substr(longopts, i+1+length(thisopt), 1) == ":") {
            if (j > 0)
                Optarg = substr(argv[Optind], j + 1)
            else
                Optarg = argv[++Optind]
        } else
            Optarg = ""

We now check to see if this option takes an argument and, if so, we set Optarg to the value of that argument (either a value after an equal sign specified on the command line, immediately adjoining the long option string, or as the next argument on the command line).

        Optind++
        return thisopt
    }
}

We increase Optind (which we already increased once if a required argument was separated from its option by an equal sign), and return the long option (minus its leading dashes).

The BEGIN rule initializes both Opterr and Optind to one. Opterr is set to one, because the default behavior is for getopt() to print a diagnostic message upon seeing an invalid option. Optind is set to one, because there’s no reason to look at the program name, which is in ARGV[0]:

BEGIN {
    Opterr = 1    # default is to diagnose
    Optind = 1    # skip ARGV[0]

    # test program
    if (_getopt_test) {
        _myshortopts = "ab:cd"
        _mylongopts = "longa,longb:,otherc,otherd"

        while ((_go_c = getopt(ARGC, ARGV, _myshortopts, _mylongopts)) != -1)
            printf("c = <%s>, Optarg = <%s>\n", _go_c, Optarg)
        printf("non-option arguments:\n")
        for (; Optind < ARGC; Optind++)
            printf("\tARGV[%d] = <%s>\n", Optind, ARGV[Optind])
    }
}

The rest of the BEGIN rule is a simple test program. Here are the results of some sample runs of the test program:

$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
-| c = <a>, Optarg = <>
-| c = <c>, Optarg = <>
-| c = <b>, Optarg = <ARG>
-| non-option arguments:
-|         ARGV[3] = <bax>
-|         ARGV[4] = <-x>

$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
-| c = <a>, Optarg = <>
error→ x -- invalid option
-| c = <?>, Optarg = <>
-| non-option arguments:
-|         ARGV[4] = <xyz>
-|         ARGV[5] = <abc>

$ awk -f getopt.awk -v _getopt_test=1 -- -a \
> --longa -b xx --longb=foo=bar --otherd --otherc arg1 arg2
-| c = <a>, Optarg = <>
-| c = <longa>, Optarg = <>
-| c = <b>, Optarg = <xx>
-| c = <longb>, Optarg = <foo=bar>
-| c = <otherd>, Optarg = <>
-| c = <otherc>, Optarg = <>
-| non-option arguments:
-|  ARGV[8] = <arg1>
-|  ARGV[9] = <arg2>

In all the runs, the first -- terminates the arguments to awk, so that it does not try to interpret the -a, etc., as its own options.

NOTE: After getopt() is through, user-level code must clear out all the elements of ARGV from 1 to Optind, so that awk does not try to process the command-line options as file names.

Using ‘#!’ with the -E option may help avoid conflicts between your program’s options and gawk’s options, as -E causes gawk to abandon processing of further options (see section Executable awk Programs and see section Command-Line Options).

Several of the sample programs presented in Practical awk Programs, use getopt() to process their arguments.



Footnotes

(73)

This function was written before gawk acquired the ability to split strings into single characters using "" as the separator. We have left it alone, as using substr() is more portable.