More CSV (The GNU Awk User’s Guide)
From Get docs
Gawk/docs/latest/More-CSV
Up: Splitting By Content [Contents][Index]
4.7.1 More on CSV Files
Manuel Collado notes that in addition to commas, a CSV field can also contains quotes, that have to be escaped by doubling them. The previously described regexps fail to accept quoted fields with both commas and quotes inside. He suggests that the simplest FPAT
expression that recognizes this kind of fields is /([^,]*)|("([^"]|"")+")/
. He provides the following input data to test these variants:
p,"q,r",s p,"q""r",s p,"q,""r",s p,"",s p,,s
And here is his test program:
BEGIN { fp[0] = "([^,]+)|(\"[^\"]+\")" fp[1] = "([^,]*)|(\"[^\"]+\")" fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")" FPAT = fp[fpat+0] }
{ print "<" $0 ">" printf("NF = %s ", NF) for (i = 1; i <= NF; i++) { printf("<%s>", $i) } print "" }
When run on the third variant, it produces:
$ gawk -v fpat=2 -f test-csv.awk sample.csv -| <p,"q,r",s> -| NF = 3 <p><"q,r"><s> -| <p,"q""r",s> -| NF = 3 <p><"q""r"><s> -| <p,"q,""r",s> -| NF = 3 <p><"q,""r"><s> -| <p,"",s> -| NF = 3 <p><""><s> -| <p,,s> -| NF = 3 <p><><s>