Gawk/Input-Parsers
Next: Output Wrappers, Previous: Extension Version String, Up: Registration Functions [Contents][Index]
17.4.5.4 Customized Input Parsers
By default, gawk
reads text files as its input. It uses the value
of RS
to find the end of the record, and then uses FS
(or FIELDWIDTHS
or FPAT
) to split it into fields (see section Reading Input Files).
Additionally, it sets the value of RT
(see section Predefined Variables).
If you want, you can provide your own custom input parser. An input
parser’s job is to return a record to the gawk
record-processing
code, along with indicators for the value and length of the data to be
used for RT
, if any.
To provide an input parser, you must first provide two functions
(where XXX
is a prefix name for your extension):
awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);
- This function examines the information available in
iobuf
(which we discuss shortly). Based on the information there, it decides if the input parser should be used for this file. If so, it should return true. Otherwise, it should return false. It should not change any state (variable values, etc.) withingawk
. awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);
- When
gawk
decides to hand control of the file over to the input parser, it calls this function. This function in turn must fill in certain fields in theawk_input_buf_t
structure and ensure that certain conditions are true. It should then return true. If an error of some kind occurs, it should not fill in any fields and should return false; thengawk
will not use the input parser. The details are presented shortly.
Your extension should package these functions inside an
awk_input_parser_t
, which looks like this:
typedef struct awk_input_parser { const char *name; /* name of parser */ awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf); awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf); awk_const struct awk_input_parser *awk_const next; /* for gawk */ } awk_input_parser_t;
The fields are:
const char *name;
- The name of the input parser. This is a regular C string.
awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);
- A pointer to your
XXX_can_take_file()
function. awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);
- A pointer to your
XXX_take_control_of()
function. awk_const struct input_parser *awk_const next;
- This is for use by
gawk
; therefore it is markedawk_const
so that the extension cannot modify it.
The steps are as follows:
- Create a
static awk_input_parser_t
variable and initialize it appropriately. - When your extension is loaded, register your input parser with
gawk
using theregister_input_parser()
API function (described next).
An awk_input_buf_t
looks like this:
typedef struct awk_input { const char *name; /* filename */ int fd; /* file descriptor */ #define INVALID_HANDLE (-1) void *opaque; /* private data for input parsers */ int (*get_record)(char **out, struct awk_input *iobuf, int *errcode, char **rt_start, size_t *rt_len, const awk_fieldwidth_info_t **field_width); ssize_t (*read_func)(); void (*close_func)(struct awk_input *iobuf); struct stat sbuf; /* stat buf */ } awk_input_buf_t;
The fields can be divided into two categories: those for use (initially,
at least) by XXX_can_take_file()
, and those for use by
XXX_take_control_of()
. The first group of fields and their uses
are as follows:
const char *name;
- The name of the file.
int fd;
- A file descriptor for the file. If
gawk
was able to open the file, thenfd
will not be equal toINVALID_HANDLE
. Otherwise, it will. struct stat sbuf;
- If the file descriptor is valid, then
gawk
will have filled in this structure via a call to thefstat()
system call.
The XXX_can_take_file()
function should examine these
fields and decide if the input parser should be used for the file.
The decision can be made based upon gawk
state (the value
of a variable defined previously by the extension and set by
awk
code), the name of the
file, whether or not the file descriptor is valid, the information
in the struct stat
, or any combination of these factors.
Once XXX_can_take_file()
has returned true, and
gawk
has decided to use your input parser, it calls
XXX_take_control_of()
. That function then fills
either the get_record
field or the read_func
field in
the awk_input_buf_t
. It must also ensure that fd
is not
set to INVALID_HANDLE
. The following list describes the fields that
may be filled by XXX_take_control_of()
:
void *opaque;
This is used to hold any state information needed by the input parser for this file. It is “opaque” to
gawk
. The input parser is not required to use this pointer.int (*get_record)(char **out,
struct awk_input *iobuf,
int *errcode,
char **rt_start,
size_t *rt_len,
const awk_fieldwidth_info_t **field_width);
This function pointer should point to a function that creates the input records. Said function is the core of the input parser. Its behavior is described in the text following this list.
ssize_t (*read_func)();
This function pointer should point to a function that has the same behavior as the standard POSIX
read()
system call. It is an alternative to theget_record
pointer. Its behavior is also described in the text following this list.void (*close_func)(struct awk_input *iobuf);
This function pointer should point to a function that does the “teardown.” It should release any resources allocated by
XXX_take_control_of()
. It may also close the file. If it does so, it should set thefd
field toINVALID_HANDLE
.If
fd
is still notINVALID_HANDLE
after the call to this function,gawk
calls the regularclose()
system call.Having a “teardown” function is optional. If your input parser does not need it, do not set this field. Then,
gawk
calls the regularclose()
system call on the file descriptor, so it should be valid.
The XXX_get_record()
function does the work of creating
input records. The parameters are as follows:
char **out
- This is a pointer to a
char *
variable that is set to point to the record.gawk
makes its own copy of the data, so the extension must manage this storage. struct awk_input *iobuf
- This is the
awk_input_buf_t
for the file. The fields should be used for reading data (fd
) and for managing private state (opaque
), if any. int *errcode
- If an error occurs,
*errcode
should be set to an appropriate code from<errno.h>
. char **rt_start
size_t *rt_len
- If the concept of a “record terminator” makes sense, then
*rt_start
should be set to point to the data to be used forRT
, and*rt_len
should be set to the length of the data. Otherwise,*rt_len
should be set to zero.gawk
makes its own copy of this data, so the extension must manage this storage. const awk_fieldwidth_info_t **field_width
- If
field_width
is notNULL
, then*field_width
will be initialized toNULL
, and the function may set it to point to a structure supplying field width information to override the default field parsing mechanism. Note that this structure will not be copied bygawk
; it must persist at least until the next call toget_record
orclose_func
. Note also thatfield_width
isNULL
whengetline
is assigning the results to a variable, thus field parsing is not needed. If the parser does set*field_width
, thengawk
uses this layout to parse the input record, and thePROCINFO["FS"]
value will be"API"
while this record is active in$0
. Theawk_fieldwidth_info_t
data structure is described below.
The return value is the length of the buffer pointed to by
*out
, or EOF
if end-of-file was reached or an
error occurred.
It is guaranteed that errcode
is a valid pointer, so there is no
need to test for a NULL
value. gawk
sets *errcode
to zero, so there is no need to set it unless an error occurs.
If an error does occur, the function should return EOF
and set
*errcode
to a value greater than zero. In that case, if *errcode
does not equal zero, gawk
automatically updates
the ERRNO
variable based on the value of *errcode
.
(In general, setting ‘*errcode = errno
’ should do the right thing.)
As an alternative to supplying a function that returns an input record,
you may instead supply a function that simply reads bytes, and let
gawk
parse the data into records. If you do so, the data
should be returned in the multibyte encoding of the current locale.
Such a function should follow the same behavior as the read()
system call, and you fill in the read_func
pointer with its
address in the awk_input_buf_t
structure.
By default, gawk
sets the read_func
pointer to
point to the read()
system call. So your extension need not
set this field explicitly.
NOTE: You must choose one method or the other: either a function that
returns a record, or one that returns raw data. In particular, if you supply a function to get a record,
gawk
will call it, and will never call the raw read function.
gawk
ships with a sample extension that reads directories,
returning records for each entry in a directory (see section Reading Directories). You may wish to use that code as a guide for writing
your own input parser.
When writing an input parser, you should think about (and document)
how it is expected to interact with awk
code. You may want
it to always be called, and to take effect as appropriate (as the
readdir
extension does). Or you may want it to take effect
based upon the value of an awk
variable, as the XML extension
from the gawkextlib
project does (see section The gawkextlib
Project).
In the latter case, code in a BEGINFILE
rule
can look at FILENAME
and ERRNO
to decide whether or
not to activate an input parser (see section The BEGINFILE
and ENDFILE
Special Patterns).
You register your input parser with the following function:
void register_input_parser(awk_input_parser_t *input_parser);
- Register the input parser pointed to by
input_parser
withgawk
.
If you would like to override the default field parsing mechanism for a given
record, then you must populate an awk_fieldwidth_info_t
structure,
which looks like this:
typedef struct { awk_bool_t use_chars; /* false ==> use bytes */ size_t nf; /* number of fields in record (NF) */ struct awk_field_info { size_t skip; /* amount to skip before field starts */ size_t len; /* length of field */ } fields[1]; /* actual dimension should be nf */ } awk_fieldwidth_info_t;
The fields are:
awk_bool_t use_chars;
- Set this to
awk_true
if the field lengths are specified in terms of potentially multi-byte characters, and set it toawk_false
if the lengths are in terms of bytes. Performance will be better if the values are supplied in terms of bytes. size_t nf;
- Set this to the number of fields in the input record, i.e.
NF
. struct awk_field_info fields[nf];
- This is a variable-length array whose actual dimension should be
nf
. For each field, theskip
element should be set to the number of characters or bytes, as controlled by theuse_chars
flag, to skip before the start of this field. Thelen
element provides the length of the field. The values infields[0]
provide the information for$1
, and so on through thefields[nf-1]
element containing the information for$NF
.
A convenience macro awk_fieldwidth_info_size(numfields)
is provided to
calculate the appropriate size of a variable-length
awk_fieldwidth_info_t
structure containing numfields
fields. This can
be used as an argument to malloc()
or in a union to allocate space
statically. Please refer to the readdir_test
sample extension for an
example.
Next: Output Wrappers, Previous: Extension Version String, Up: Registration Functions [Contents][Index]