Difference between revisions of "Regular expressions"

From TED Notepad
Line 128: Line 128:
  
 
=====Alternations=====
 
=====Alternations=====
 +
 +
Alternations allow the user to combine two or more separate patterns into one bigger pattern, matching whatever any of those separate patterns would match. Alternations thus provide a logical ''or'' in the pattern syntax, meaning that the bigger pattern can either match what one separate patterns would match or the other separate patterns would match.
  
 
:; {{string|<nowiki>|</nowiki>}}
 
:; {{string|<nowiki>|</nowiki>}}
 
:: Divides pattern alternations. As long as any of the alternations matches, the entire pattern matches.
 
:: Divides pattern alternations. As long as any of the alternations matches, the entire pattern matches.
 +
::* Example: {{regexp|<nowiki>abc|xy</nowiki>}} can match either {{string|abc}} or {{string|xy}}.
 +
::* Note: Enclosing groups always demarcate borders for alternations.
 +
::** Example: {{regexp|<nowiki>(abc|xy)2</nowiki>}} can match either {{string|abc2}} or {{string|xy2}}.
 +
::** Example: {{regexp|<nowiki>(abc|xy)2|uw</nowiki>}} can match either {{string|abc2}} or {{string|xy2}} or {{string|uw}}.
  
 
=====Character classes=====
 
=====Character classes=====

Revision as of 13:23, 30 September 2013

This section is up to date for TED Notepad version 6.3.1.0.
Basics and escape sequences

Following constructs match specific characters at positions at which they are encountered.

.
Matches any single character (except for newline).


\w
Matches any word character.
\W
Matches any non-word character.
\s
Matches any white-space character.
\S
Matches any non-white-space character.
\d
Matches any digit character.
\D
Matches any non-digit character.


\n
Matches one newline sequence (CR, NL or CR/NL)
\t
Matches a horizontal tab character (TAB).
\f
Matches a form feed character (FF).
\a
Matches a bell character (BEL).
\v
Matches a vertical tab (VT).
\e
Matches an escape character (ESC).
\0
Matches a null character (NUL).


\xhh
Matches a character in hex notation.
\uhhhh
Matches a character in unicode notation (unicode version only).
\cA
Matches a character in control notation.


\Q ... \E
Quoted string. Anything between \Q and \E is treated as plain-text string and is matched exactly as it appears in the pattern.
Intervals and repetitions

In order to match more than one character in a row by a single construct, regular expressions offer quantifiers.

A quantifier may appear after any supported construct or after any single character. After construct, it quantifies the entire construct (e.g. entire quoted string or group). After ordinary character, it quantifies the given character only.

*
Match the construct 0 or more times.
+
Match the construct 1 or more times.
?
Match the construct 0 or 1 times.
{n}
Match the construct exactly n times.
{n,}
Match the construct at least n times.
{n,m}
Match the construct at least n, but no more than m times.

Any of these quantifiers may be followed by an interval modifier. An interval modifier must appear directly after a quantifier to be recognized.

?
Use lazy matching instead of greedy matching.
  • Greedy matching is the default, does not need any modifiers, and matches as much as possible.
  • Lazy matching allows to match as little as possible with the given quantified construct.
+
Use possesive greedy interval matching.
  • By default, when a quantified construct does not allow the rest of the pattern to match, back-tracking is used to find out how much that construct should match to allow the entire pattern to match. Possesive matching means no such back-tracking. Once a match is established by the construct, it can only be back-tracked as a whole.
  • Note: There is no lazy possesive interval matching, since that would not make much sence.
Zero-length assertions

Following zero-length pattern conditions do not match any specific characters, they only assert that a specific condition is met at the position at which they are encountered.

^
Matches only at line beginnings.
$
Matches only at line ends.
\p
Matches only at paragraph beginnings.
\P
Matches only at paragraph ends.
\A
Matches only at document beginning.
\Z
Matches only at document end.


\b
Matches only at word boundary, i.e. one of the characters around the current matching position must be a word character and the other may not.
\B
Matches only inside a word, i.e. both characters around the current matching position must be word characters.
\y
Matches only at word beginning, i.e. the second of the characters around the current matching position must be a word character and the first one may not.
\Y
Matches only at word end, i.e. the first of the characters around the current matching position must be a word character and the second one may not.


\G
Matches only at the original starting position. Guarantees that only the position within the document, where the search started, is matched at by this construct. Starting position is usually the one with the caret before the search, also indicated by the Status Bar.
Alternations

Alternations allow the user to combine two or more separate patterns into one bigger pattern, matching whatever any of those separate patterns would match. Alternations thus provide a logical or in the pattern syntax, meaning that the bigger pattern can either match what one separate patterns would match or the other separate patterns would match.

|
Divides pattern alternations. As long as any of the alternations matches, the entire pattern matches.
  • Example: abc|xy can match either abc or xy.
  • Note: Enclosing groups always demarcate borders for alternations.
    • Example: (abc|xy)2 can match either abc2 or xy2.
    • Example: (abc|xy)2|uw can match either abc2 or xy2 or uw.
Character classes

Character classes can be used to define special sets of characters. While regular expressions offer some common character sets as part of the language (i.e. \w or \W, \d or \D, etc.), it is also possible to use any user-defined sets. For example, [abz] creates a user-defined character class which includes only letters a, b and z. This user-defined set of characters now matches only one of these three letters, very much like a \d would match one single digit.

The following syntax is used to specify user-defined character class within patterns:

[
Opens character class definition.
  • Each opened character class definition requires a closing ] bracket as described below.
  • Everything until such ] closing bracket is considered part of the character class definition and must comply with character class definition syntax as described in this section.
  • Note that syntax for entering character classes differs substantially from pattern syntax. For example, a dot . has no special meaning inside character class definition, while within a pattern it matches almost anything.
]
Closes character class definition.
  • Note: If a ] bracket appears after the opening [ bracket of a new character class definition, it is considered an ordinary ] character.
    • Pattern []] is equivalent to [\]] and can be used to match a single ]. Pattern [^]] is equivalent to [^\]] and can be used to match any single character but ].
    • Note that [] and [^] are both invalid patterns. They are both considered containing unclosed character class definition.


Any of these constructs may appear in the character class.

This section is incomplete and wants to be finished later.
TODO:
 ^      character class negation (must follow class opening)
 -      character range (only when used "correctly")
 -      ordinary dash (whenever could not be treated as range)
 \t     horizontal tab character (TAB)
 \f     form feed character (FF)
 \a     bell character (BEL)
 \v     vertical tab (VT)
 \e     escape character (ESC)
 \b     backspace character (BS)
 \n     line feed newline character (NL)
 \r     carriage return character (CR)
 \0     null (NUL)
 \x     character in hex notation
 \u     character in unicode notation (unicode version only)
 \c     character in control notation
 \\     escape for \
 \^     escape for ^
 \[     escape for [
 \]     escape for ]
 \-     escape for -
 \:     escape for :
 \w     any word character
 \W     any non-word character
 \s     any white-space character
 \S     any non-white-space character
 \d     any digit character
 \D     any non-digit character
 [:alpha:]   class function: alphas
 [:alnum:]   class function: alphanums
 [:blank:]   class function: blank chars
 [:cntrl:]   class function: control chars
 [:digit:]   class function: digits
 [:graph:]   class function: graphs
 [:lower:]   class function: lowercase chars
 [:print:]   class function: printable chars
 [:punct:]   class function: punctuations
 [:space:]   class function: white-spaces
 [:upper:]   class function: uppercase chars
 [:xdigi:]   class function: hex digits
Capture groups
(
Begins a new capture group. Capture groups are useful for back-references in both search and replace patterns.
)
Ends current capture group. Note: Capture groups can be nested.


\1, \2, ..., \9
Back-reference to a specific captured group. Matches exaclty the same text as was previously matched by the given capture group. Note that this does not try to repeat the sub-pattern within that capture group, but matches against the specific text matched by that group.
  • Iteration thru a capture group and a back-reference to it may be repeated several times upon matching by an interval on an enclosing group. In such case, the back-reference always matches against the most recent text captured by the capture group.
Cluster groups
(?:pattern)
Cluster group. Cluster groups are non-capturing groups. They act like capturing groups, but do not consume resource for capturing and do not consume capture group numbers for back-referencing.
(?|pattern)
Branch reset cluster group. Inside a branch reset group, capture groups are numbered from the same starting group number in each alternation. Thus several capture groups are assigned the same group number, and then, depending on which alternation actually matches, the number references the correct matching capture group.
Other groups
(?>pattern)
Possesive independent give-nothing-back sub-pattern. This cluster group effectively prevents back-tracking upon matching. Note: Back-tracking is allowed inside of the group before the group ends, but once the group matches as an independent sub-pattern, further back-tracking inside of that group is not performed and the group is unmatched at that point as a whole. This is like grab all you can, and then give nothing back operator.
Look-behind assertion
\K
Removes everything that is to the left of the current matching position from the \& replace back-reference. This effectively provides a look-behind assertion, since it can be used to verify, that a match is preceded by some other pattern.
More escape sequences

Since many characters have special meanings in regular expressions, escapes are provided to allow using these characters in searches.

\\
Matches character \. Note: Unescaped single \ has a special meaning.
\^
Matches character ^. Note: Unescaped single ^ has a special meaning.
\$
Matches character $. Note: Unescaped single $ has a special meaning.
\.
Matches character .. Note: Unescaped single . has a special meaning.
\|
Matches character |. Note: Unescaped single | has a special meaning.
\(
Matches character (. Note: Unescaped single ( has a special meaning.
\)
Matches character ). Note: Unescaped single ) has a special meaning.
\[
Matches character [. Note: Unescaped single [ has a special meaning.
\]
Matches character ]. Note: Unescaped single ] has a special meaning.
\*
Matches character *. Note: Unescaped single * has a special meaning.
\+
Matches character +. Note: Unescaped single + has a special meaning.
\?
Matches character ?. Note: Unescaped single ? has a special meaning.
\{
Matches character {. Note: Unescaped single { has a special meaning.
\}
Matches character }. Note: Unescaped single } has a special meaning.
\<
Matches character <. Note: Unescaped single < has a special meaning.
\>
Matches character >. Note: Unescaped single > has a special meaning.
\:
Matches character :. Note: Unescaped single : has a special meaning.
Replace patterns

Any of these constructs may appear anywhere in the replace pattern, as long as regular expressions are turned on.

\\
Inserts a backslash.


\n
Inserts a newline sequence (CR, NL or CR/NL; depends on current document options).
\t
Inserts a horizontal tab character (TAB).
\f
Inserts a form feed character (FF).
\a
Inserts a vell character (BEL).
\v
Inserts a vertical tab (VT).
\e
Inserts an escape character (ESC).
\0
Inserts a null character (NUL).


\xhh
Inserts a character in hex notation.
\uhhhh
Inserts a character in unicode notation (unicode version only).
\cA
Inserts a character in control notation.


\Q ... \E
Quoted string. Anything between \Q and \E is treated as plain-text string and is inserted exactly as it appears in the pattern.


\&
Back-reference to the entire match.
\1, \2, ..., \9
Back-reference to a specific captured group.
\+
Back-reference to the last successfull captured group. Consider having several alternations, each with a group inside it. Only one of the alternations will match, thus only one of those groups will be valid upon replacing. This back-reference allows referencing the correct one of those groups, based on which of the alternations matched.
  • Note: This can also be achieved by using branch reset cluster groups.
Expression examples
Find e-mail adress (simplyfied method, see below for more)
Find [a-z0-9.+-]+@[a-z0-9.-]+
Words
Find words beginning with KEY
Find \bKEY\w*\b
Find words beginning with letters K or R or M
Find \b[KRM]\w*\b
Find words ending with KEY
Find \b\w*KEY\b
Find words ending with letters K or R or M
Find \b\w*[KRM]\b
Find words containing KEY
Find \b\w*KEY\w*\b
Find words containing no letter K
Find \b[^K\W]+\b
Find words 3 to 5 characters long
Find \b\w{3,5}\b
Newlines and spaces
Add one extra newline after each existing line
Find $ and replace with \n
Spacify the text - Add one extra space between each two characters
Find nothing and replace with space
Join each line ending with character = with the following line
Find =\n and replace with nothing
HTML/XML tags
Find and remove all html/xml tags
Find <[^>]*> and replace with nothing
Find and remove all html/xml multi-line tags
Find <([^>]*|\n)*> and replace with nothing
Find and remove html/xml tags containing word pre
Find <[^>]*pre[^>]*> and replace with nothing
Find all <h1> headline tags and surround the text with ==
Find </?h1> and replace with ==
Find all html entities, i.e. &nbsp; &amp; &#39;
Find &[^;]+;