#] #] ********************* #] "$d_SysMaint""Linux/regular expression notes.txt" www.BillHowell.ca ?date? initial - waaayyy back <<<20Oct2021 $*.[,\] -> special characters must normally be escaped to search as character, but NOT in a [list] : opl_regexpr IS host 'grep "^#] " "$d_SysMaint""Linux/regular expression notes.txt" | sed "s/^#\]/ /" ' opl_find IS host 'grep "^#] " "$d_SysMaint""Linux/find notes.txt" | sed "s/^#\]/ /" ' opl_grep IS host 'grep "^#] " "$d_SysMaint""Linux/grep summary.txt" | sed "s/^#\]/ /" ' opl_sed IS host 'grep "^#] " "$d_SysMaint""Linux/sed summary.txt" | sed "s/^#\]/ /" ' opl_diff IS host 'grep "^#] " "$d_SysMaint""Linux/diff notes.txt" | sed "s/^#\]/ /" ' opl_geany IS host 'grep "^#] " "$d_SysMaint""text processors/geany notes.txt" | sed "s/^#\]/ /" ' 24************************24 #************ # Table of Contents, generated with : # $ grep "^#]" "$d_SysMaint""Linux/regular expression notes.txt" | sed "s/^#\]/ /" ********************* "$d_SysMaint""Linux/regular expression notes.txt" [* =>0,\+ =>1, \? =[0,1]] [\{i\}, \{i,j\}, \{i,\}] regexp1[,\|]regexp2 search matches [^$] [begin,end] of line ^[] to negate list contents (matches NOT in list), start of line if @ beginning \'$'\n how-to-insert-a-newline-in-front-of-a-pattern \(.\)\{12\} ignore a character sequence of specified length [A-Za-z0-9._-] character ranges 's/^\([0-9]\+\):.*/\1/' first occurrence of a letter \+ \? \{i\} \{i,j\} \{i,\} multiple occurences of [chr, [expr]] ^\(\) don't match regular expression info for other [systems, software] LibreOffice - see "$d_SysMaint"'LibreOffice/LibreOffice notes.txt' kwrite - see "$d_SysMaint"'kwrite/0_kwrite notes.txt' emacs - see "$d_SysMaint"'text processors/0_emacs codes, reference.txt' 24************************24 #] [* =>0,\+ =>1, \? =[0,1]] [\{i\}, \{i,j\}, \{i,\}] regexp1[,\|]regexp2 search matches https://www.computerhope.com/unix/used.htm * Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by "\", a ".", a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by "*"; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this, and portable scripts should instead use "\*" in these contexts. #] \+ Like *, but matches one or more. It is a GNU extension. #] \? Like *, but only matches zero or one. It is a GNU extension. >> 27Dec2019 Howell : great example of plucking out something >> cat "$d_invest""PayPal/amount raw.txt" | sed 's/\([^0-9]\+\)\([0-9]\+\)\(\.\)\([0-9]\+\)\(.*\)/\2\3\4/' >> This was hard to nail down #] \{i\} Like *, but matches exactly i sequences (i is a decimal integer; #] for compatibility, you should keep it between 0 and 255, inclusive). #] \{i,j\} Matches between i and j, inclusive, sequences. #] \{i,\} Matches more than or equal to i sequences. regexp1\|regexp2 Matches either regexp1 or regexp2. Use parentheses to use complex alternative regular expressions. The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. This option is a GNU extension. regexp1regexp2 Matches the concatenation of regexp1 and regexp2. Concatenation binds more tightly than \|, ^, and $, but less tightly than the other regular expression operators. \digit Matches the digit-th \(...\) parenthesized subexpression in the regular expression. This option is called a back reference. Subexpressions are implicitly numbered by counting occurrences of \( left-to-right. \n Matches the newline character. \char Matches char, where char is one of $, *, ., [, \, or ^. Note that the only C-like backslash sequences that you can portably assume to be interpreted are \n and \\; in particular \t is not portable, and matches a ‘t’ under most implementations of sed, rather than a tab character. +-----+ #] [^$] [begin,end] of line - good instructions from webpage are posted at end of this file ^ matches start of line $ matches end-of-line cat "/media/bill/PROJECTS/a_INNS Lexicom email server/0_email Post-sendout Email responses.txt" | sed 's/\(user unknown.*$\)/\1999\,/' >"/media/bill/ramdisk/0_email Post-sendout Email responses.txt" >> works well +-----+ #] ^[] to negate list contents (matches NOT in list), start of line if @ beginning >> 27Dec2019 Howell : great example of plucking out something >> cat "$d_invest""PayPal/amount raw.txt" | sed 's/\([^0-9]\+\)\([0-9]\+\)\(\.\)\([0-9]\+\)\(.*\)/\2\3\4/' >> This was hard to nail down +-----+ #] \'$'\n how-to-insert-a-newline-in-front-of-a-pattern # https://stackoverflow.com/questions/723157/how-to-insert-a-newline-in-front-of-a-pattern sed - insert newline as in cat "$paper_tbk" | grep "|>" | sed 's/\(.*\)\(<|\)\(.*\)\(|>\)\(.*\)/\3\'$'\n/' sed wildcard ".*" not "?*" +-----+ #] \(.\)\{12\} ignore a character sequence of specified length (could use cut here?) $ cat "/media/bill/PROJECTS/2019 IJCNN Budapest/Publications/CrossCheck/190112 CrossCheck similarities for papers on hand.txt" | sed 's/^\(.\)\{12\}\(.*\)/\2/' >"/media/bill/ramdisk/190118 CrossCheck nodate.txt" +-----+ #] [A-Za-z0-9._-] character ranges 's/\([^0-9]\+\)\([0-9]\+\)\(\.\)\([0-9]\+\)\(.*\)/\2\3\4/' from "$d_bin""email - remove emails from text file.sh" d_remove_emails() { if [ -d "$d_withEml" ]; then if [ -d "$d_noEmail" ]; then find "$d_withEml" -maxdepth 1 -type f -name "*" | sed "s#$d_withEml##" >"$d_temp""d_remove_emails files.txt" while read -u 9 line; do cat "$d_withEml""$line" | sed 's/[A-Za-z0-9._-]*@[A-Za-z0-9._-]*//g' >"$d_noEmail""$line" done 9< "$d_temp""d_remove_emails files.txt" else echo "directory doesnt exist : $d_noEmail" fi else echo "directory doesnt exist : $d_withEml" fi } +-----+ #] 's/^\([0-9]\+\):.*/\1/' first occurrence of a letter 23Oct2019 search "Linux grep and how do I locate the first occurrence of a letter?" I used sed - let "lineNum=$( echo $line | sed 's/^\([0-9]\+\):.*/\1/' )" +-----+ #] \+ \? \{i\} \{i,j\} \{i,\} multiple occurences of [chr, [expr]] https://www.computerhope.com/unix/used.htm \+ Like *, but matches one or more. It is a GNU extension. \? Like *, but only matches zero or one. It is a GNU extension. \{i\} Like *, but matches exactly i sequences (i is a decimal integer; for compatibility, you should keep it between 0 and 255, inclusive). \{i,j\} Matches between i and j, inclusive, sequences. \{i,\} Matches more than or equal to i sequences. 08********08 20Oct2021 #] ^\(\) don't match 08********08 11Sep2021 regexp option for words (\b boundaries) https://stackoverflow.com/questions/1032023/sed-whole-word-search-and-replace \b in regular expressions match word boundaries (i.e. the location between the first word character and non-word character): $ echo "bar embarassment" | sed "s/\bbar\b/no bar/g" no bar embarassment edited Jun 23 '09 at 11:54 answered Jun 23 '09 at 11:41 Joakim Lundborg *********** http://www.computerhope.com/unix/used.htm Overview Of Regular Expression Syntax To know how to use sed, you should understand regular expressions ("regexp" for short). A regular expression is a pattern that is matched against a subject string from left to right. Most characters are ordinary: they stand for themselves in a pattern, and match the corresponding characters in the subject. As a simple example, the pattern The quick brown fox matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of special characters, which do not stand for themselves but instead are interpreted in some special way. Here is a brief description of regular expression syntax as used in sed: char A single ordinary character matches itself. * Matches a sequence of zero or more instances of matches for the preceding regular expression, which must be an ordinary character, a special character preceded by "\", a ".", a grouped regexp (see below), or a bracket expression. As a GNU extension, a postfixed regular expression can also be followed by "*"; for example, a** is equivalent to a*. POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this, and portable scripts should instead use "\*" in these contexts. \+ Like *, but matches one or more. It is a GNU extension. \? Like *, but only matches zero or one. It is a GNU extension. \{i\} Like *, but matches exactly i sequences (i is a decimal integer; for compatibility, you should keep it between 0 and 255, inclusive). \{i,j\} Matches between i and j, inclusive, sequences. \{i,\} Matches more than or equal to i sequences. \(regexp\) Groups the inner regexp as a whole; this is used to: Apply postfix operators, like \(abcd\)*: this will search for zero or more whole sequences of ‘abcd’, while abcd* would search for ‘abc’ followed by zero or more occurrences of ‘d’. Note that support for \(abcd\)* is required by POSIX 1003.1-2001, but many non-GNU implementations do not support it and hence it is not universally portable. Use back references (see below). . Matches any character, including a newline. ^ Matches the null string at beginning of the pattern space, i.e. what appears after the ^ must appear at the beginning of the pattern space. In most scripts, pattern space is initialized to the content of each line. So, it is a useful simplification to think of ^#include as matching only lines where ‘#include’ is the first thing on line—if there are spaces before, for example, the match fails. This simplification is valid as long as the original content of pattern space is not modified, for example with an s command. ^ acts as a special character only at the beginning of the regular expression or subexpression (that is, after \( or \|). Portable scripts should avoid ^ at the beginning of a subexpression, though, as POSIX allows implementations that treat ^ as an ordinary character in that context. $ It is the same as ^, but refers to end of pattern space. $ also acts as a special character only at the end of the regular expression or subexpression (that is, before \) or \|), and its use at the end of a subexpression is not portable. [list] [^list] Matches any single character in list: for example, [aeiou] matches all vowels. A list may include sequences like char1-char2, which matches any character between char1 and char2. For examble, [b-e] matches any of the characters b, c, d, or e. A leading ^ reverses the meaning of list, so that it matches any single character not in list. To include ] in the list, make it the first character (after the ^ if needed); to include - in the list, make it the first or last; to include ^ put it after the first character. The characters $, *, ., [, and \ are normally not special within list. For example, [\*] matches either ‘\’ or ‘*’, because the \ is not special here. However, strings like [.ch.], [=a=], and [:space:] are special within list and represent collating symbols, equivalence classes, and character classes, respectively, and [ is therefore special within list when it is followed by ., =, or :. Also, when not in POSIXLY_CORRECT mode, special escapes like \n and \t are recognized within list. See Escapes for more information. regexp1\|regexp2 Matches either regexp1 or regexp2. Use parentheses to use complex alternative regular expressions. The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. This is a GNU extension. regexp1regexp2 Matches the concatenation of regexp1 and regexp2. Concatenation binds more tightly than \|, ^, and $, but less tightly than the other regular expression operators. \digit Matches the digit-th \(...\) parenthesized subexpression in the regular expression. This is called a back reference. Subexpressions are implicity numbered by counting occurrences of \( left-to-right. \n Matches the newline character. \char Matches char, where char is one of $, *, ., [, \, or ^. Note that the only C-like backslash sequences that you can portably assume to be interpreted are \n and \\; in particular \t is not portable, and matches a ‘t’ under most implementations of sed, rather than a tab character. Note that the regular expression matcher is greedy, i.e., matches are attempted from left to right and, if two or more matches are possible starting at the same character, it selects the longest. For example: abcdef Matches "abcdef". a*b Matches zero or more "a" characters, followed by a single "b". For example, "b" or "aaaaaaab". a\?b Matches "b" or "ab". a\+b\+ Matches one or more "a" characters followed by one or more "b"s. "ab" is the shortest possible match, but other examples are "aaaaab", "abbbbbb", or "aaaaaabbbbbbb". .*or .\+ Either of these expressions will match all of the characters in a non-empty string, but only .* will match the empty string. ^main.*(.*) This matches a string starting with "main", followed by an opening and closing parenthesis. The "n", "(" and ")" need not be adjacent. ^# This matches a string beginning with "#". \\$ This matches a string ending with a single backslash. The regexp contains two backslashes for escaping. \$ This matches a string consisting of a single dollar sign. [a-zA-Z0-9] In the C locale, this matches any ASCII letters or digits. [^ tab]\+ (Here tab stands for a single tab character.) This matches a string of one or more characters, none of which is a space or a tab. Usually this means a word. ^\(.*\)\n\1$ This matches a string consisting of two equal substrings separated by a newline. .\{9\}A$ This matches nine characters followed by an ‘A’. ^.\{15\}A This matches the start of a string that contains 16 characters, the last of which is an ‘A’. Often-Used Commands ******* #] regular expression info for other [systems, software] #] LibreOffice - see "$d_SysMaint"'LibreOffice/LibreOffice notes.txt' #] kwrite - see "$d_SysMaint"'kwrite/0_kwrite notes.txt' #] emacs - see "$d_SysMaint"'text processors/0_emacs codes, reference.txt' 'M-x replace-regexp $ ' put str at end of line : select area to do, then invoke code ‘C-M-% REGEXP NEWSTRING ’ Replace some matches for REGEXP with NEWSTRING. '$' end of line # enddoc