Character Encoding (GNU Grep 3.7)
Next: Matching Non-ASCII and Non-printable Characters, Previous: Basic vs Extended Regular Expressions, Up: Regular Expressions [Contents][Index]
3.7 Character Encoding
The LC_CTYPE
locale specifies the encoding of characters in patterns and data, that is, whether text is encoded in UTF-8, ASCII, or some other encoding. See Environment Variables.
In the ‘C
’ or ‘POSIX
’ locale, every character is encoded as a single byte and every byte is a valid character. In more-complex encodings such as UTF-8, a sequence of multiple bytes may be needed to represent a character, and some bytes may be encoding errors that do not contribute to the representation of any character. POSIX does not specify the behavior of grep
when patterns or input data contain encoding errors or null characters, so portable scripts should avoid such usage. As an extension to POSIX, GNU grep
treats null characters like any other character. However, unless the -a
(--binary-files=text
) option is used, the presence of null characters in input or of encoding errors in output causes GNU grep
to treat the file as binary and suppress details about matches. See File and Directory Selection.
Regardless of locale, the 103 characters in the POSIX Portable Character Set (a subset of ASCII) are always encoded as a single byte, and the 128 ASCII characters have their usual single-byte encodings on all but oddball platforms.