| 1 | # main topics: |
|---|
| 2 | UP arb.hlp |
|---|
| 3 | UP glossary.hlp |
|---|
| 4 | |
|---|
| 5 | # sub topics: |
|---|
| 6 | SUB srt.hlp |
|---|
| 7 | SUB aci.hlp |
|---|
| 8 | |
|---|
| 9 | # format described in ../help.readme |
|---|
| 10 | |
|---|
| 11 | |
|---|
| 12 | TITLE Regular Expressions (REG) |
|---|
| 13 | |
|---|
| 14 | OCCURRENCE Many places |
|---|
| 15 | |
|---|
| 16 | SECTION Ways to use regular expressions |
|---|
| 17 | |
|---|
| 18 | There are two ways to use regular expressions: |
|---|
| 19 | |
|---|
| 20 | [1] /Search Regexpr/Replace String/ |
|---|
| 21 | [2] /Search Regexpr/ |
|---|
| 22 | |
|---|
| 23 | [1] searches the input for occurrences of 'Search Regexpr' and |
|---|
| 24 | replaces every occurrence with 'Replace String'. |
|---|
| 25 | |
|---|
| 26 | [2] searches the input for the FIRST occurrence of 'Search |
|---|
| 27 | Regexpr' and returns the found match. |
|---|
| 28 | If nothing matches, it returns an empty string. |
|---|
| 29 | |
|---|
| 30 | Notes: |
|---|
| 31 | |
|---|
| 32 | * You can use regular expressions everywhere where you can use |
|---|
| 33 | ACI and SRT expressions. |
|---|
| 34 | * At some places only [2] is available (e.g. in Search&Query). |
|---|
| 35 | * Normally regular expressions work case sensitive. To make them |
|---|
| 36 | work case insensitive, simply append an 'i' to the |
|---|
| 37 | expression (i.e. '/expr/i' or '/expr/repl/i') |
|---|
| 38 | |
|---|
| 39 | SECTION Syntax of POSIX extended regular expressions as used in ARB |
|---|
| 40 | |
|---|
| 41 | A regular expression specifies a set of character strings, |
|---|
| 42 | e.g. the expression '/pseu/i' specifies all strings containing |
|---|
| 43 | "pseu", "Pseu" or "pSeu" and so on. We say the expression "matches" |
|---|
| 44 | (a part of) these strings. |
|---|
| 45 | |
|---|
| 46 | Several characters have special meanings in regular expressions. |
|---|
| 47 | All other characters just match against themselves. |
|---|
| 48 | |
|---|
| 49 | Special characters: |
|---|
| 50 | |
|---|
| 51 | '.' matches any character (e.g. '/h.s/' matches "has" and "his") |
|---|
| 52 | '[xyz]' matches 'x', 'y' or 'z' |
|---|
| 53 | '[a-z]' matches all lower case letters |
|---|
| 54 | '^' matches the beginning of the string |
|---|
| 55 | (e.g. '/^pseu/i' matches all strings starting with "pseu") |
|---|
| 56 | '$' matches the end of the string |
|---|
| 57 | (e.g. '/cens$/i' matches all strings ending in "cens") |
|---|
| 58 | |
|---|
| 59 | '*' matches the preceding element zero or more times |
|---|
| 60 | (e.g. '/th*is/' matches "tis", "this", "thhhhhhiss", ..) |
|---|
| 61 | '?' matches the preceding element zero or one time |
|---|
| 62 | (e.g. '/th?is/' matches "tis" or "this", but not "thhis") |
|---|
| 63 | '+' matches the preceding element one or more times |
|---|
| 64 | (e.g. '/th+is/' matches "this" or "thhhis", but not "tis") |
|---|
| 65 | '{mi,ma}' matches the preceding element 3 to 5 times |
|---|
| 66 | (e.g. '/th{2,4}is/' matches "thhis", "thhhis" or "thhhhis") |
|---|
| 67 | |
|---|
| 68 | '|' marks an alternative |
|---|
| 69 | |
|---|
| 70 | Example: '/bacter|spiri/i' matches all strings containing |
|---|
| 71 | either "bacter" or "spiri". |
|---|
| 72 | |
|---|
| 73 | '()' marks a subexpression. |
|---|
| 74 | |
|---|
| 75 | Subexpressions can be used to separate alternatives or to mark parts |
|---|
| 76 | for reference in the replace expression (see section about |
|---|
| 77 | replacement below). |
|---|
| 78 | |
|---|
| 79 | Examples: |
|---|
| 80 | * '/bact|spiri.*cens/' |
|---|
| 81 | |
|---|
| 82 | matches '/bact/' or '/spiri.*cens/'. |
|---|
| 83 | |
|---|
| 84 | * whereas '/(bact|spiri).*cens/' |
|---|
| 85 | |
|---|
| 86 | matches '/bact.*cens/' or '/spiri.*cens/'. |
|---|
| 87 | |
|---|
| 88 | To match against special characters themselves, escape them |
|---|
| 89 | using a '\' (e.g. '/\*/' matches the character "*", '/\\/' matches "\") |
|---|
| 90 | |
|---|
| 91 | |
|---|
| 92 | Character classes: |
|---|
| 93 | |
|---|
| 94 | [...] is called a character class. It matches against any of the characters |
|---|
| 95 | listed in between the brackets. |
|---|
| 96 | [^...] If the character class starts with '^' it matches against any character |
|---|
| 97 | NOT listed (e.g. '[^78]' matches all but '7' or '8') |
|---|
| 98 | [5-9] When the character class contains a '-', it will be interpreted as |
|---|
| 99 | "range of characters". Here '5-9' is equivalent to '56789'. |
|---|
| 100 | You may mix ranges and single characters, |
|---|
| 101 | e.g. '14-79' is same as '145679', '7-91-3' is same as '789123'. |
|---|
| 102 | |
|---|
| 103 | To add special characters to a character class, escape them using '\'. |
|---|
| 104 | |
|---|
| 105 | There are several special predefined character classes like |
|---|
| 106 | * [:alpha:] = [a-zA-Z] |
|---|
| 107 | * [:digit:] = [0-9] |
|---|
| 108 | * [:alnum:] = [[:alpha:][:digit:]] |
|---|
| 109 | * [:punct:] = Punctuation characters |
|---|
| 110 | * [:print:] = Visible characters and the space character |
|---|
| 111 | * [:blank:] = Space and tab |
|---|
| 112 | * [:space:] = Whitespace characters (including newlines) |
|---|
| 113 | * [:cntrl:] = Control characters |
|---|
| 114 | |
|---|
| 115 | Use these inside brackets (e.g. '/[[:cntrl:]]//' will remove all control characters). |
|---|
| 116 | See links below for details. |
|---|
| 117 | |
|---|
| 118 | |
|---|
| 119 | Links: |
|---|
| 120 | |
|---|
| 121 | * A more in-depth explanation of POSIX extended regular expressions can be |
|---|
| 122 | found at LINK{http://en.wikipedia.org/wiki/Regular_expression#POSIX}. |
|---|
| 123 | * Many examples are given in this guide: LINK{http://www.digitalamit.com/article/regular_expression.phtml} |
|---|
| 124 | |
|---|
| 125 | Notes: |
|---|
| 126 | |
|---|
| 127 | * if an expression matches one string multiple times, the longest leftmost |
|---|
| 128 | match is used (e.g: '/a*e*/' matches 'aaeee' at position 3 of the |
|---|
| 129 | string 'bbaaeeeffaegg', not 'ae' at position 10). |
|---|
| 130 | |
|---|
| 131 | |
|---|
| 132 | SECTION Special syntax for search and replace |
|---|
| 133 | |
|---|
| 134 | Syntax: '/regexp/replace/' |
|---|
| 135 | |
|---|
| 136 | The part of the input string matched by 'regexp' gets replaced by 'replace'. |
|---|
| 137 | |
|---|
| 138 | Simple example: |
|---|
| 139 | |
|---|
| 140 | Input string: 'The quick brown fox jumps over the lazy dog' |
|---|
| 141 | Search&replace: '/fox|dog/cat/' |
|---|
| 142 | Result: 'The quick brown cat jumps over the lazy cat' |
|---|
| 143 | |
|---|
| 144 | Additionally the match (or parts of it) can be referenced in the replace string: |
|---|
| 145 | |
|---|
| 146 | \0 refers to the whole match |
|---|
| 147 | \1 refers to the first subexpression |
|---|
| 148 | \2 refers to the second subexpression |
|---|
| 149 | ... |
|---|
| 150 | \9 refers to the ninth subexpression |
|---|
| 151 | |
|---|
| 152 | Example using refs: |
|---|
| 153 | |
|---|
| 154 | Input string: 'The quick brown fox jumps over the lazy dog' |
|---|
| 155 | Search&replace: '/(brown|lazy)\s+(fox|dog)/\2 \1/' |
|---|
| 156 | Result: 'The quick fox brown jumps over the dog lazy' |
|---|
| 157 | |
|---|
| 158 | WARNINGS POSIX extended regular expressions are not greedy, i.e. an expression |
|---|
| 159 | like '_*' does normally match an empty string (if used w/o context). |
|---|
| 160 | |
|---|
| 161 | This makes some replacements difficult, e.g. if you have data containing |
|---|
| 162 | multiple consecutive characters and you'd like to replace these. |
|---|
| 163 | The expression "/_*/_/" does not work as expected and reports |
|---|
| 164 | an error: "regular expression '_*' matched an empty string". |
|---|
| 165 | |
|---|
| 166 | A workaround is the following expression: |
|---|
| 167 | "/(_+)([^_]|$)/_\2/" |
|---|
| 168 | |
|---|
| 169 | Other, simpler workarounds do use the BOL/EOL operators ('^'/'$'), |
|---|
| 170 | e.g. to remove all trailing underscores: |
|---|
| 171 | "/_*$//" |
|---|
| 172 | |
|---|
| 173 | Or all leading underscores: |
|---|
| 174 | "/^_*//" |
|---|
| 175 | |
|---|
| 176 | BUGS No bugs known |
|---|
| 177 | |
|---|