1 | #Please insert up references in the next lines (line starts with keyword UP) |
---|
2 | UP arb.hlp |
---|
3 | UP glossary.hlp |
---|
4 | |
---|
5 | #Please insert subtopic references (line starts with keyword SUB) |
---|
6 | SUB srt.hlp |
---|
7 | SUB aci.hlp |
---|
8 | |
---|
9 | # Hypertext links in helptext can be added like this: LINK{ref.hlp|http://add|bla@domain} |
---|
10 | |
---|
11 | #************* Title of helpfile !! and start of real helpfile ******** |
---|
12 | TITLE Regular Expressions (REG) |
---|
13 | |
---|
14 | OCCURRENCE Many places |
---|
15 | |
---|
16 | SECTION Ways to use regular expressions |
---|
17 | |
---|
18 | There are two ways to use regular expressions: |
---|
19 | |
---|
20 | [1] /Search Regexpr/Replace String/ |
---|
21 | [2] /Search Regexpr/ |
---|
22 | |
---|
23 | [1] searches the input for occurrences of 'Search Regexpr' and |
---|
24 | replaces every occurrence with 'Replace String'. |
---|
25 | |
---|
26 | [2] searches the input for the FIRST occurrence of 'Search |
---|
27 | Regexpr' and returns the found match. |
---|
28 | If nothing matches, it returns an empty string. |
---|
29 | |
---|
30 | Notes: |
---|
31 | |
---|
32 | * You can use regular expressions everywhere where you can use |
---|
33 | ACI and SRT expressions. |
---|
34 | * At some places only [2] is available (e.g. in Search&Query). |
---|
35 | * Normally regular expressions work case sensitive. To make them |
---|
36 | work case insensitive, simply append an 'i' to the |
---|
37 | expression (i.e. '/expr/i' or '/expr/repl/i') |
---|
38 | |
---|
39 | SECTION Syntax of POSIX extended regular expressions as used in ARB |
---|
40 | |
---|
41 | A regular expression specifies a set of character strings, |
---|
42 | e.g. the expression '/pseu/i' specifies all strings containing |
---|
43 | "pseu", "Pseu" or "pSeu" and so on. We say the expression "matches" |
---|
44 | (a part of) these strings. |
---|
45 | |
---|
46 | Several characters have special meanings in regular expressions. |
---|
47 | All other characters just match against themselves. |
---|
48 | |
---|
49 | Special characters: |
---|
50 | |
---|
51 | '.' matches any character (e.g. '/h.s/' matches "has" and "his") |
---|
52 | '[xyz]' matches 'x', 'y' or 'z' |
---|
53 | '[a-z]' matches all lower case letters |
---|
54 | '^' matches the beginning of the string |
---|
55 | (e.g. '/^pseu/i' matches all strings starting with "pseu") |
---|
56 | '$' matches the end of the string |
---|
57 | (e.g. '/cens$/i' matches all strings ending in "cens") |
---|
58 | |
---|
59 | '*' matches the preceding element zero or more times |
---|
60 | (e.g. '/th*is/' matches "tis", "this", "thhhhhhiss", ..) |
---|
61 | '?' matches the preceding element zero or one time |
---|
62 | (e.g. '/th?is/' matches "tis" or "this", but not "thhis") |
---|
63 | '+' matches the preceding element one or more times |
---|
64 | (e.g. '/th+is/' matches "this" or "thhhis", but not "tis") |
---|
65 | '{mi,ma}' matches the preceding element 3 to 5 times |
---|
66 | (e.g. '/th{2,4}is/' matches "thhis", "thhhis" or "thhhhis") |
---|
67 | |
---|
68 | '|' marks an alternative |
---|
69 | (e.g. '/bacter|spiri/i' matches all strings containing "bacter" or "spiri") |
---|
70 | |
---|
71 | '()' marks a subexpression. Subexpressions can be used to separate alternatives |
---|
72 | or to mark parts for use in the replace expression (see below). |
---|
73 | |
---|
74 | (e.g. '/bact|spiri.*cens/' match '/bact/' or '/spiri.*cens/', |
---|
75 | whereas '/(bact|spiri).*cens/' match '/bact.*cens/' or '/spiri.*cens/') |
---|
76 | |
---|
77 | To match against special characters themselves, escape them |
---|
78 | using a '\' (e.g. '/\*/' matches the character "*", '/\\/' matches "\") |
---|
79 | |
---|
80 | |
---|
81 | Character classes: |
---|
82 | |
---|
83 | [...] is called a character class. It matches against any of the characters |
---|
84 | listed in between the brackets. |
---|
85 | [^...] If the character class starts with '^' it matches against any character |
---|
86 | NOT listed (e.g. '[^78]' matches all but '7' or '8') |
---|
87 | [5-9] If the character class contains a '-' it is interpreted as "range of characters". |
---|
88 | Here '5-9' is equivalent to '56789'. |
---|
89 | You may mix ranges and single characters, e.g. '14-79' is same as '145679', |
---|
90 | '7-91-3' is same as '789123'. |
---|
91 | |
---|
92 | To add special characters to a character class, escape them using '\'. |
---|
93 | |
---|
94 | There are several special predefined character classes like |
---|
95 | * [:alpha:] = [a-zA-Z] |
---|
96 | * [:digit:] = [0-9] |
---|
97 | * [:alnum:] = [[:alpha:][:digit:]] |
---|
98 | * [:punct:] = Punctuation characters |
---|
99 | * [:print:] = Visible characters and the space character |
---|
100 | * [:blank:] = Space and tab |
---|
101 | * [:space:] = Whitespace characters (including newlines) |
---|
102 | * [:cntrl:] = Control characters |
---|
103 | |
---|
104 | Use these inside brackets (e.g. '/[[:cntrl:]]//' will remove all control characters). |
---|
105 | See links below for details. |
---|
106 | |
---|
107 | |
---|
108 | Links: |
---|
109 | |
---|
110 | * A more in-depth explanation of POSIX extended regular expressions can be |
---|
111 | found at LINK{http://en.wikipedia.org/wiki/Regular_expression#POSIX}. |
---|
112 | * Many examples are given in this guide: LINK{http://www.digitalamit.com/article/regular_expression.phtml} |
---|
113 | |
---|
114 | Notes: |
---|
115 | |
---|
116 | * if an expression matches one string multiple times, the longest leftmost |
---|
117 | match is used (e.g: '/a*e*/' matches 'aaeee' at position 3 of the |
---|
118 | string 'bbaaeeeffaegg', not 'ae' at position 10). |
---|
119 | |
---|
120 | |
---|
121 | SECTION Special syntax for search and replace |
---|
122 | |
---|
123 | Syntax: '/regexp/replace/' |
---|
124 | |
---|
125 | The part of the input string matched by 'regexp' gets replaced by 'replace'. |
---|
126 | |
---|
127 | Simple example: |
---|
128 | |
---|
129 | Input string: 'The quick brown fox jumps over the lazy dog' |
---|
130 | Search&replace: '/fox|dog/cat/' |
---|
131 | Result: 'The quick brown cat jumps over the lazy cat' |
---|
132 | |
---|
133 | Additionally the match (or parts of it) can be referenced in the replace string: |
---|
134 | |
---|
135 | \0 refers to the whole match |
---|
136 | \1 refers to the first subexpression |
---|
137 | \2 refers to the second subexpression |
---|
138 | ... |
---|
139 | \9 refers to the ninth subexpression |
---|
140 | |
---|
141 | Example using refs: |
---|
142 | |
---|
143 | Input string: 'The quick brown fox jumps over the lazy dog' |
---|
144 | Search&replace: '/(brown|lazy)\s+(fox|dog)/\2 \1/' |
---|
145 | Result: 'The quick fox brown jumps over the dog lazy' |
---|
146 | |
---|
147 | WARNINGS POSIX extended regular expressions are not greedy, i.e. an expression |
---|
148 | like '_*' does normally match an empty string (if used w/o context). |
---|
149 | |
---|
150 | This makes some replacements difficult, e.g. if you have data containing |
---|
151 | multiple consecutive characters and you'd like to replace these. |
---|
152 | The expression "/_*/_/" does not work as expected and reports |
---|
153 | an error: "regular expression '_*' matched an empty string". |
---|
154 | |
---|
155 | A workaround is the following expression: |
---|
156 | "/(_+)([^_]|$)/_\2/" |
---|
157 | |
---|
158 | Other, simpler workarounds do use the BOL/EOL operators ('^'/'$'), |
---|
159 | e.g. to remove all trailing underscores: |
---|
160 | "/_*$//" |
---|
161 | |
---|
162 | Or all leading underscores: |
---|
163 | "/^_*//" |
---|
164 | |
---|
165 | BUGS No bugs known |
---|
166 | |
---|