Igawk Program (The GNU Awk User’s Guide)
Next: Anagram Program, Previous: Simple Sed, Up: Miscellaneous Programs [Contents][Index]
11.3.9 An Easy Way to Use Library Functions
In Including Other Files into Your Program, we saw how gawk provides a built-in file-inclusion capability. However, this is a gawk extension. This section provides the motivation for making file inclusion available for standard awk, and shows how to do it using a combination of shell and awk programming.
Using library functions in awk can be very beneficial. It encourages code reuse and the writing of general functions. Programs are smaller and therefore clearer. However, using library functions is only easy when writing awk programs; it is painful when running them, requiring multiple -f options. If gawk is unavailable, then so too is the AWKPATH environment variable and the ability to put awk functions into a library directory (see section Command-Line Options). It would be nice to be able to write programs in the following manner:
# library functions
@include getopt.awk
@include join.awk
…
# main program
BEGIN {
while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
…
…
}
The following program, igawk.sh, provides this service. It simulates gawk’s searching of the AWKPATH variable and also allows nested includes (i.e., a file that is included with @include can contain further @include statements). igawk makes an effort to only include files once, so that nested includes don’t accidentally include a library function twice.
igawk should behave just like gawk externally. This means it should accept all of gawk’s command-line arguments, including the ability to have multiple source files specified via -f and the ability to mix command-line and library source files.
The program is written using the POSIX Shell (sh) command language.81 It works as follows:
- Loop through the arguments, saving anything that doesn’t represent
awksource code for later, when the expanded program is run. - For any arguments that do represent
awktext, put the arguments into a shell variable that will be expanded. There are two cases: Literal text, provided with -e or --source. This text is just appended directly. Source file names, provided with -f. We use a neat trick and append ‘@include filename’ to the shell variable’s contents. Because the file-inclusion program works the way gawk does, this gets the text of the file included in the program at the correct point. - Run an
awkprogram (naturally) over the shell variable’s contents to expand@includestatements. The expanded program is placed in a second shell variable. - Run the expanded program with
gawkand any other original command-line arguments that the user supplied (such as the data file names).
This program uses shell variables extensively: for storing command-line arguments and the text of the awk program that will expand the user’s program, for the user’s original program, and for the expanded program. Doing so removes some potential problems that might arise were we to use temporary files instead, at the cost of making the script somewhat more complicated.
The initial part of the program turns on shell tracing if the first argument is ‘debug’.
The next part loops through all the command-line arguments. There are several cases of interest:
--- This ends the arguments to
igawk. Anything else should be passed on to the user’sawkprogram without being evaluated. -W- This indicates that the next option is specific to
gawk. To make argument processing easier, the-Wis appended to the front of the remaining arguments and the loop continues. (This is anshprogramming trick. Don’t worry about it if you are not familiar withsh.) -v,-F- These are saved and passed on to
gawk. -f,--file,--file=,-Wfile=- The file name is appended to the shell variable
programwith an@includestatement. Theexprutility is used to remove the leading option part of the argument (e.g., ‘--file=’). (Typicalshusage would be to use theechoandsedutilities to do this work. Unfortunately, some versions ofechoevaluate escape sequences in their arguments, possibly mangling the program text. Usingexpravoids this problem.) --source,--source=,-Wsource=- The source text is appended to
program. --version,-Wversionigawkprints its version number, runs ‘gawk --version’ to get thegawkversion information, and then exits.
If none of the -f, --file, -Wfile, --source, or -Wsource arguments are supplied, then the first nonoption argument should be the awk program. If there are no command-line arguments left, igawk prints an error message and exits. Otherwise, the first argument is appended to program. In any case, after the arguments have been processed, the shell variable program contains the complete text of the original awk program.
The program is as follows:
#! /bin/sh
# igawk --- like gawk but do @include processing
if [ "$1" = debug ]
then
set -x
shift
fi
# A literal newline, so that program text is formatted correctly
n='
'
# Initialize variables to empty
program=
opts=
while [ $# -ne 0 ] # loop over arguments
do
case $1 in
--) shift
break ;;
-W) shift
# The ${x?'message here'} construct prints a
# diagnostic if $x is the null string
set -- -W"${@?'missing operand'}"
continue ;;
-[vF]) opts="$opts $1 '${2?'missing operand'}'"
shift ;;
-[vF]*) opts="$opts '$1'" ;;
-f) program="$program$n@include ${2?'missing operand'}"
shift ;;
-f*) f=$(expr "$1" : '-f\(.*\)')
program="$program$n@include $f" ;;
-[W-]file=*)
f=$(expr "$1" : '-.file=\(.*\)')
program="$program$n@include $f" ;;
-[W-]file)
program="$program$n@include ${2?'missing operand'}"
shift ;;
-[W-]source=*)
t=$(expr "$1" : '-.source=\(.*\)')
program="$program$n$t" ;;
-[W-]source)
program="$program$n${2?'missing operand'}"
shift ;;
-[W-]version)
echo igawk: version 3.0 1>&2
gawk --version
exit 0 ;;
-[W-]*) opts="$opts '$1'" ;;
*) break ;;
esac
shift
done
if [ -z "$program" ]
then
program=${1?'missing program'}
shift
fi
# At this point, `program' has the program.
The awk program to process @include directives is stored in the shell variable expand_prog. Doing this keeps the shell script readable. The awk program reads through the user’s program, one line at a time, using getline (see section Explicit Input with getline). The input file names and @include statements are managed using a stack. As each @include is encountered, the current file name is “pushed” onto the stack and the file named in the @include directive becomes the current file name. As each file is finished, the stack is “popped,” and the previous input file becomes the current input file again. The process is started by making the original file the first one on the stack.
The pathto() function does the work of finding the full path to a file. It simulates gawk’s behavior when searching the AWKPATH environment variable (see section The AWKPATH Environment Variable). If a file name has a ‘/’ in it, no path search is done. Similarly, if the file name is "-", then that string is used as-is. Otherwise, the file name is concatenated with the name of each directory in the path, and an attempt is made to open the generated file name. The only way to test if a file can be read in awk is to go ahead and try to read it with getline; this is what pathto() does.82 If the file can be read, it is closed and the file name is returned:
expand_prog='
function pathto(file, i, t, junk)
{
if (index(file, "/") != 0)
return file
if (file == "-")
return file
for (i = 1; i <= ndirs; i++) {
t = (pathlist[i] "/" file)
if ((getline junk < t) > 0) {
# found it
close(t)
return t
}
}
return ""
}
The main program is contained inside one BEGIN rule. The first thing it does is set up the pathlist array that pathto() uses. After splitting the path on ‘:’, null elements are replaced with ".", which represents the current directory:
BEGIN {
path = ENVIRON["AWKPATH"]
ndirs = split(path, pathlist, ":")
for (i = 1; i <= ndirs; i++) {
if (pathlist[i] == "")
pathlist[i] = "."
}
The stack is initialized with ARGV[1], which will be "/dev/stdin". The main loop comes next. Input lines are read in succession. Lines that do not start with @include are printed verbatim. If the line does start with @include, the file name is in $2. pathto() is called to generate the full path. If it cannot, then the program prints an error message and continues.
The next thing to check is if the file is included already. The processed array is indexed by the full file name of each included file and it tracks this information for us. If the file is seen again, a warning message is printed. Otherwise, the new file name is pushed onto the stack and processing continues.
Finally, when getline encounters the end of the input file, the file is closed and the stack is popped. When stackptr is less than zero, the program is done:
stackptr = 0
input[stackptr] = ARGV[1] # ARGV[1] is first file
for (; stackptr >= 0; stackptr--) {
while ((getline < input[stackptr]) > 0) {
if (tolower($1) != "@include") {
print
continue
}
fpath = pathto($2)
if (fpath == "") {
printf("igawk: %s:%d: cannot find %s\n",
input[stackptr], FNR, $2) > "/dev/stderr"
continue
}
if (! (fpath in processed)) {
processed[fpath] = input[stackptr]
input[++stackptr] = fpath # push onto stack
} else
print $2, "included in", input[stackptr],
"already included in",
processed[fpath] > "/dev/stderr"
}
close(input[stackptr])
}
}' # close quote ends `expand_prog' variable
processed_program=$(gawk -- "$expand_prog" /dev/stdin << EOF
$program
EOF
)
The shell construct ‘command << marker’ is called a here document. Everything in the shell script up to the marker is fed to command as input. The shell processes the contents of the here document for variable and command substitution (and possibly other things as well, depending upon the shell).
The shell construct ‘$(…)’ is called command substitution. The output of the command inside the parentheses is substituted into the command line. Because the result is used in a variable assignment, it is saved as a single string, even if the results contain whitespace.
The expanded program is saved in the variable processed_program. It’s done in these steps:
- Run
gawkwith the@include-processing program (the value of theexpand_progshell variable) reading standard input. - Standard input is the contents of the user’s program, from the shell variable
program. Feed its contents togawkvia a here document. - Save the results of this processing in the shell variable
processed_programby using command substitution.
The last step is to call gawk with the expanded program, along with the original options and command-line arguments that the user supplied:
eval gawk $opts -- '"$processed_program"' '"$@"'
The eval command is a shell construct that reruns the shell’s parsing process. This keeps things properly quoted.
This version of igawk represents the fifth version of this program. There are four key simplifications that make the program work better:
- Using
@includeeven for the files named with-fmakes building the initial collectedawkprogram much simpler; all the@includeprocessing can be done once. - Not trying to save the line read with
getlinein thepathto()function when testing for the file’s accessibility for use with the main program simplifies things considerably. - Using a
getlineloop in theBEGINrule does it all in one place. It is not necessary to call out to a separate loop for processing nested@includestatements. - Instead of saving the expanded program in a temporary file, putting it in a shell variable avoids some potential security problems. This has the disadvantage that the script relies upon more features of the
shlanguage, making it harder to follow for those who aren’t familiar withsh.
Also, this program illustrates that it is often worthwhile to combine sh and awk programming together. You can usually accomplish quite a lot, without having to resort to low-level programming in C or C++, and it is frequently easier to do certain kinds of string and argument manipulation using the shell than it is in awk.
Finally, igawk shows that it is not always necessary to add new features to a program; they can often be layered on top.83
Footnotes
(81)
Fully explaining the sh language is beyond the scope of this book. We provide some minimal explanations, but see a good shell programming book if you wish to understand things in more depth.
(82)
On some very old versions of awk, the test ‘getline junk < t’ can loop forever if the file exists but is empty.
(83)
gawk does @include processing itself in order to support the use of awk programs as Web CGI scripts.
Next: Anagram Program, Previous: Simple Sed, Up: Miscellaneous Programs [Contents][Index]