Gawk/Input-Parsers

17.4.5.4 Customized Input Parsers

By default, gawk reads text files as its input. It uses the value of RS to find the end of the record, and then uses FS (or FIELDWIDTHS or FPAT) to split it into fields (see section Reading Input Files). Additionally, it sets the value of RT (see section Predefined Variables).

If you want, you can provide your own custom input parser. An input parser’s job is to return a record to the gawk record-processing code, along with indicators for the value and length of the data to be used for RT, if any.

To provide an input parser, you must first provide two functions (where XXX is a prefix name for your extension):

awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);: This function examines the information available in iobuf (which we discuss shortly). Based on the information there, it decides if the input parser should be used for this file. If so, it should return true. Otherwise, it should return false. It should not change any state (variable values, etc.) within gawk.
awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);: When gawk decides to hand control of the file over to the input parser, it calls this function. This function in turn must fill in certain fields in the awk_input_buf_t structure and ensure that certain conditions are true. It should then return true. If an error of some kind occurs, it should not fill in any fields and should return false; then gawk will not use the input parser. The details are presented shortly.

Your extension should package these functions inside an awk_input_parser_t, which looks like this:

typedef struct awk_input_parser {
    const char *name;   /* name of parser */
    awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
    awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
    awk_const struct awk_input_parser *awk_const next;   /* for gawk */
} awk_input_parser_t;

The fields are:

const char *name;: The name of the input parser. This is a regular C string.
awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);: A pointer to your XXX_can_take_file() function.
awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);: A pointer to your XXX_take_control_of() function.
awk_const struct input_parser *awk_const next;: This is for use by gawk; therefore it is marked awk_const so that the extension cannot modify it.

The steps are as follows:

Create a static awk_input_parser_t variable and initialize it appropriately.
When your extension is loaded, register your input parser with gawk using the register_input_parser() API function (described next).

An awk_input_buf_t looks like this:

typedef struct awk_input {
    const char *name;       /* filename */
    int fd;                 /* file descriptor */
#define INVALID_HANDLE (-1)
    void *opaque;           /* private data for input parsers */
    int (*get_record)(char **out, struct awk_input *iobuf,
                      int *errcode, char **rt_start, size_t *rt_len,
                      const awk_fieldwidth_info_t **field_width);
    ssize_t (*read_func)();
    void (*close_func)(struct awk_input *iobuf);
    struct stat sbuf;       /* stat buf */
} awk_input_buf_t;

The fields can be divided into two categories: those for use (initially, at least) by XXX_can_take_file(), and those for use by XXX_take_control_of(). The first group of fields and their uses are as follows:

const char *name;: The name of the file.
int fd;: A file descriptor for the file. If gawk was able to open the file, then fd will not be equal to INVALID_HANDLE. Otherwise, it will.
struct stat sbuf;: If the file descriptor is valid, then gawk will have filled in this structure via a call to the fstat() system call.

The XXX_can_take_file() function should examine these fields and decide if the input parser should be used for the file. The decision can be made based upon gawk state (the value of a variable defined previously by the extension and set by awk code), the name of the file, whether or not the file descriptor is valid, the information in the struct stat, or any combination of these factors.

Once XXX_can_take_file() has returned true, and gawk has decided to use your input parser, it calls XXX_take_control_of(). That function then fills either the get_record field or the read_func field in the awk_input_buf_t. It must also ensure that fd is not set to INVALID_HANDLE. The following list describes the fields that may be filled by XXX_take_control_of():

void *opaque;

This is used to hold any state information needed by the input parser for this file. It is “opaque” to gawk. The input parser is not required to use this pointer.

int (*get_record)(char **out,
                  struct awk_input *iobuf,
                  int *errcode,
                  char **rt_start,
                  size_t *rt_len,
                  const awk_fieldwidth_info_t **field_width);

This function pointer should point to a function that creates the input records. Said function is the core of the input parser. Its behavior is described in the text following this list.

ssize_t (*read_func)();

This function pointer should point to a function that has the same behavior as the standard POSIX read() system call. It is an alternative to the get_record pointer. Its behavior is also described in the text following this list.

void (*close_func)(struct awk_input *iobuf);

This function pointer should point to a function that does the “teardown.” It should release any resources allocated by XXX_take_control_of(). It may also close the file. If it does so, it should set the fd field to INVALID_HANDLE.

If fd is still not INVALID_HANDLE after the call to this function, gawk calls the regular close() system call.

Having a “teardown” function is optional. If your input parser does not need it, do not set this field. Then, gawk calls the regular close() system call on the file descriptor, so it should be valid.

The XXX_get_record() function does the work of creating input records. The parameters are as follows:

char **out: This is a pointer to a char * variable that is set to point to the record. gawk makes its own copy of the data, so the extension must manage this storage.
struct awk_input *iobuf: This is the awk_input_buf_t for the file. The fields should be used for reading data (fd) and for managing private state (opaque), if any.
int *errcode: If an error occurs, *errcode should be set to an appropriate code from <errno.h>.
char **rt_start

size_t *rt_len

If the concept of a “record terminator” makes sense, then *rt_start should be set to point to the data to be used for RT, and *rt_len should be set to the length of the data. Otherwise, *rt_len should be set to zero. gawk makes its own copy of this data, so the extension must manage this storage.
const awk_fieldwidth_info_t **field_width: If field_width is not NULL, then *field_width will be initialized to NULL, and the function may set it to point to a structure supplying field width information to override the default field parsing mechanism. Note that this structure will not be copied by gawk; it must persist at least until the next call to get_record or close_func. Note also that field_width is NULL when getline is assigning the results to a variable, thus field parsing is not needed. If the parser does set *field_width, then gawk uses this layout to parse the input record, and the PROCINFO["FS"] value will be "API" while this record is active in $0. The awk_fieldwidth_info_t data structure is described below.

The return value is the length of the buffer pointed to by *out, or EOF if end-of-file was reached or an error occurred.

It is guaranteed that errcode is a valid pointer, so there is no need to test for a NULL value. gawk sets *errcode to zero, so there is no need to set it unless an error occurs.

If an error does occur, the function should return EOF and set *errcode to a value greater than zero. In that case, if *errcode does not equal zero, gawk automatically updates the ERRNO variable based on the value of *errcode. (In general, setting ‘*errcode = errno’ should do the right thing.)

As an alternative to supplying a function that returns an input record, you may instead supply a function that simply reads bytes, and let gawk parse the data into records. If you do so, the data should be returned in the multibyte encoding of the current locale. Such a function should follow the same behavior as the read() system call, and you fill in the read_func pointer with its address in the awk_input_buf_t structure.

By default, gawk sets the read_func pointer to point to the read() system call. So your extension need not set this field explicitly.

NOTE: You must choose one method or the other: either a function that
returns a record, or one that returns raw data. In particular, if you supply a function to get a record, gawk will call it, and will never call the raw read function.

gawk ships with a sample extension that reads directories, returning records for each entry in a directory (see section Reading Directories). You may wish to use that code as a guide for writing your own input parser.

When writing an input parser, you should think about (and document) how it is expected to interact with awk code. You may want it to always be called, and to take effect as appropriate (as the readdir extension does). Or you may want it to take effect based upon the value of an awk variable, as the XML extension from the gawkextlib project does (see section The gawkextlib Project). In the latter case, code in a BEGINFILE rule can look at FILENAME and ERRNO to decide whether or not to activate an input parser (see section The BEGINFILE and ENDFILE Special Patterns).

You register your input parser with the following function:

void register_input_parser(awk_input_parser_t *input_parser);: Register the input parser pointed to by input_parser with gawk.

If you would like to override the default field parsing mechanism for a given record, then you must populate an awk_fieldwidth_info_t structure, which looks like this:

typedef struct {
        awk_bool_t     use_chars; /* false ==> use bytes */
        size_t         nf;        /* number of fields in record (NF) */
        struct awk_field_info {
                size_t skip;      /* amount to skip before field starts */
                size_t len;       /* length of field */
        } fields[1];              /* actual dimension should be nf */
} awk_fieldwidth_info_t;

The fields are:

awk_bool_t use_chars;: Set this to awk_true if the field lengths are specified in terms of potentially multi-byte characters, and set it to awk_false if the lengths are in terms of bytes. Performance will be better if the values are supplied in terms of bytes.
size_t nf;: Set this to the number of fields in the input record, i.e. NF.
struct awk_field_info fields[nf];: This is a variable-length array whose actual dimension should be nf. For each field, the skip element should be set to the number of characters or bytes, as controlled by the use_chars flag, to skip before the start of this field. The len element provides the length of the field. The values in fields[0] provide the information for $1, and so on through the fields[nf-1] element containing the information for $NF.

A convenience macro awk_fieldwidth_info_size(numfields) is provided to calculate the appropriate size of a variable-length awk_fieldwidth_info_t structure containing numfields fields. This can be used as an argument to malloc() or in a union to allocate space statically. Please refer to the readdir_test sample extension for an example.