Sed/Text-search-across-multiple-lines
Next: Line length adjustment, Previous: Reverse chars of lines, Up: Examples [Contents][Index]
7.7 Text search across multiple lines
This section uses N
and D
commands to search for
consecutive words spanning multiple lines. See Multiline techniques.
These examples deal with finding doubled occurrences of words in a document.
Finding doubled words in a single line is easy using GNU grep
and similarly with GNU sed
:
$ cat two-cities-dup1.txt It was the best of times, it was the worst of times, it was the the age of wisdom, it was the age of foolishness, $ grep -E '\b(\w+)\s+\1\b' two-cities-dup1.txt it was the the age of wisdom, $ grep -n -E '\b(\w+)\s+\1\b' two-cities-dup1.txt 3:it was the the age of wisdom, $ sed -En '/\b(\w+)\s+\1\b/p' two-cities-dup1.txt it was the the age of wisdom, $ sed -En '/\b(\w+)\s+\1\b/{=;p}' two-cities-dup1.txt 3 it was the the age of wisdom,
- The regular expression ‘
\b\w+\s+
’ searches for word-boundary (‘\b
’), followed by one-or-more word-characters (‘\w+
’), followed by whitespace (‘\s+
’). See regexp extensions. - Adding parentheses around the ‘
(\w+)
’ expression creates a subexpression. The regular expression pattern ‘(PATTERN)\s+\1
’ defines a subexpression (in the parentheses) followed by a back-reference, separated by whitespace. A successful match means thePATTERN
was repeated twice in succession. See Back-references and Subexpressions. - The word-boundery expression (‘
\b
’) at both ends ensures partial words are not matched (e.g. ‘the then
’ is not a desired match). - The
-E
option enables extended regular expression syntax, alleviating the need to add backslashes before the parenthesis. See ERE syntax.
When the doubled word span two lines the above regular expression
will not find them as grep
and sed
operate line-by-line.
By using N
and D
commands, sed
can apply
regular expressions on multiple lines (that is, multiple lines are stored
in the pattern space, and the regular expression works on it):
$ cat two-cities-dup2.txt It was the best of times, it was the worst of times, it was the the age of wisdom, it was the age of foolishness, $ sed -En '{N; /\b(\w+)\s+\1\b/{=;p} ; D}' two-cities-dup2.txt 3 worst of times, it was the the age of wisdom,
- The
N
command appends the next line to the pattern space (thus ensuring it contains two consecutive lines in every cycle). - The regular expression uses ‘
\s+
’ for word separator which matches both spaces and newlines. - The regular expression matches, the entire pattern space is printed with
p
. No lines are printed by default due to the-n
option. - The
D
removes the first line from the pattern space (up until the first newline), readying it for the next cycle.
See the GNU coreutils
manual for an alternative solution using
tr -s
and uniq
at
https://gnu.org/s/coreutils/manual/html_node/Squeezing-and-deleting.html.
Next: Line length adjustment, Previous: Reverse chars of lines, Up: Examples [Contents][Index]