Performance (GNU Grep 3.7)

From Get docs
Grep/docs/latest/Performance

Next: Reporting bugs, Previous: Usage, Up: grep   [Contents][Index]



5 Performance

Typically grep is an efficient way to search text. However, it can be quite slow in some cases, and it can search large files where even minor performance tweaking can help significantly. Although the algorithm used by grep is an implementation detail that can change from release to release, understanding its basic strengths and weaknesses can help you improve its performance.

The grep command operates partly via a set of automata that are designed for efficiency, and partly via a slower matcher that takes over when the fast matchers run into unusual features like back-references. When feasible, the Boyer–Moore fast string searching algorithm is used to match a single fixed pattern, and the Aho–Corasick algorithm is used to match multiple fixed patterns.

Generally speaking grep operates more efficiently in single-byte locales, since it can avoid the special processing needed for multi-byte characters. If your patterns will work just as well that way, setting LC_ALL to a single-byte locale can help performance considerably. Setting ‘LC_ALL='C'’ can be particularly efficient, as grep is tuned for that locale.

Outside the ‘C’ locale, case-insensitive search, and search for bracket expressions like ‘[a-z]’ and ‘[[=a=]b]’, can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements.

A back-reference such as ‘\1’ can hurt performance significantly in some cases, since back-references cannot in general be implemented via a finite state automaton, and instead trigger a backtracking algorithm that can be quite inefficient. For example, although the pattern ‘^(.*)\1{14}(.*)\2{13}$’ matches only lines whose lengths can be written as a sum 15x + 14y for nonnegative integers x and y, the pattern matcher does not perform linear Diophantine analysis and instead backtracks through all possible matching strings, using an algorithm that is exponential in the worst case.

On some operating systems that support files with holes—large regions of zeros that are not physically present on secondary storage—grep can skip over the holes efficiently without needing to read the zeros. This optimization is not available if the -a (--binary-files=text) option is used (see File and Directory Selection), unless the -z (--null-data) option is also used (see Other Options).

For more about the algorithms used by grep and about related string matching algorithms, see:

  • Aho AV. Algorithms for finding patterns in strings. In: van Leeuwen J. Handbook of Theoretical Computer Science, vol. A. New York: Elsevier; 1990. p. 255–300. This surveys classic string matching algorithms, some of which are used by grep.
  • Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. CACM. 1975;18(6):333–40. https://dx.doi.org/10.1145/360825.360855. This introduces the Aho–Corasick algorithm.
  • Boyer RS, Moore JS. A fast string searching algorithm. CACM. 1977;20(10):762–72. https://dx.doi.org/10.1145/359842.359859. This introduces the Boyer–Moore algorithm.
  • Faro S, Lecroq T. The exact online string matching problem: a review of the most recent results. ACM Comput Surv. 2013;45(2):13. https://dx.doi.org/10.1145/2431211.2431212. This surveys string matching algorithms that might help improve the performance of grep in the future.



Next: Reporting bugs, Previous: Usage, Up: grep   [Contents][Index]