Difference between revisions of "Regular expressions"

From TED Notepad
Line 30: Line 30:
 
::* Example: A pattern like {{regexp|\Q$%^#$&\E}} simply searches for {{string|$%^#$&}} even though {{string|^}} and {{string|$}} have special purpose.
 
::* Example: A pattern like {{regexp|\Q$%^#$&\E}} simply searches for {{string|$%^#$&}} even though {{string|^}} and {{string|$}} have special purpose.
  
So far, regular expressions sounded rather boring, eh? Never mind, basics and priciples are never supposed to be something extraordinary anyway.
+
<small>So far, regular expressions sounded rather boring, eh? Never mind, basics and priciples are never supposed to be something extraordinary anyway.</small>
  
 
=====Special-purpose characters=====
 
=====Special-purpose characters=====

Revision as of 16:45, 9 November 2013

This section is up to date for TED Notepad version 6.3.1.0.
Control page Control:feature:Regular expressions
Basic syntax and priciples

Regular expressions intend to offer a standardized way to enrich plain-text search patterns with an ability to use wildcards, repetitions, multiple alternatives, etc. While a plain-text search for abc will always yield an occurence of abc, a regular expression search pattern like abc|xy can search for both abc and xy at the same time and report which one was found first. And this is just the beginning. Regular expressions can be used to find any e-mail address in the document, locate every html tag, find every word that ends with ing and much much more.

a, b, c, ..., z
0, 1, 2, ..., 9
Ordinary alpha-numeric letters have no special meaning in regular expressions patterns, they always stand for themselves.
  • Example: A pattern like abc simply searches for abc and xy searches for xy.
  • Note: It is common to say that a regular expressions pattern matches something instead of saying that it searches for something. Get used to it.. :)


\Q ... \E
Quoted string. Whatever text appears between \Q and \E is treated as a plain text and is matched exactly as it appears.
  • Whenever not sure about special-purpose characters, the \Q and \E can be used to temporarily turn the regular expressions syntax off and safely match in a WYSIWYG plain-text manner.
  • Example: A pattern like \Q$%^#$&\E simply searches for $%^#$& even though ^ and $ have special purpose.

So far, regular expressions sounded rather boring, eh? Never mind, basics and priciples are never supposed to be something extraordinary anyway.

Special-purpose characters

Following are the most typical special-purpose characters and escapes for matching entire sets of characters:

.
Wildcard character. Matches any single character except for newline.
  • Example: Pattern abc.xy matches any string, which begins with abc, then contains one character of any sort, and then continues with xy.
  • Note: A dot . never matches a newline and this behaviour cannot be changed via any option or setting.
\w
Matches any word character.
  • Example: Pattern \w\w\w matches any 3-letters-long word.
\W
Matches any non-word character.
  • This is a complementary character set to the \w, but does not contain newline.
  • Example: Pattern \W\W\W matches any 3 consequtive non-word characters.
\s
Matches any white-space character.
  • Example: Pattern \s\s matches any 2 consequtive white-spaces.
\S
Matches any non-white-space character.
  • This is a complementary character set to the \s, but does not contain newline.
\d
Matches any digit character.
  • Example: Pattern \d\d\d\d matches any 4-digit-long number.
\D
Matches any non-digit character.
  • This is a complementary character set to the \d, but does not contain newline.

Following are escapes for including special hard-to-type characters in a pattern:

\n
Matches one newline sequence (CR, NL or CR/NL)
\t
Matches a horizontal tab character (TAB).
\f
Matches a form feed character (FF).
\a
Matches a bell character (BEL).
\v
Matches a vertical tab (VT).
\e
Matches an escape character (ESC).
\0
Matches a null character (NUL).
\xhh
Matches a character given in hh hex notation.
\uhhhh
Matches a character given in hhhh unicode notation (unicode version only).
\cK
Matches a character given in control notation. The letter K can be any capital letter from A to Z, where A stands for \x00, B stands for \x01, C stands for \x03, etc.
Intervals and repetitions

In order to match a character more than once, regular expressions offer quantifiers. Any character, escape, or even a complex pattern construct can be repeated by adding such quantifier to that construct.

  • Note: A complex construct is a quoted string, for example. A quantifier always quantifies the entire such construct, therefore quantifier in pattern \Qabc\E* would repeat the entire abc as a whole.
  • Note: After ordinary character, the quantifier quantifies only the one character after which it appears, therefore quantifier in pattern abc* would make only the letter c repeatable.
?
Match the quantified construct 0 or 1 times, matching as much as possible.
  • Example: Pattern c? matches an empty string or single c.
  • Example: Pattern abc? matches ab or abc.
*
Match the quantified construct 0 or more times, matching as much as possible.
  • Example: Pattern c* matches an empty string or single c or double cc or even any cccccccc.
  • Example: Pattern abc* matches ab or abc or abcc or even abcccccccc.
+
Match the quantified construct 1 or more times, matching as much as possible.
  • Example: Pattern c+ matches a single c or double cc or even any cccccccc.
  • Example: Pattern abc+ matches abc or abcc or even abcccccccc.
{n}
Match the quantified construct exactly n times.
  • Example: Pattern c{3} matches a triple ccc.
  • Example: Pattern abc{3} matches abccc.
{,n}
Match the quantified construct upto n times, matching as much as possible.
  • Example: Pattern c{,3} matches an empty string or single c or double cc or ccc.
  • Example: Pattern abc{,3} matches ab or abc or abcc or abccc.
{n,}
Match the quantified construct at least n times, matching as much as possible.
  • Example: Pattern c{3,} matches triple ccc or cccc or even cccccccccc.
  • Example: Pattern abc{3,} matches abccc or abcccc or even abcccccccc.
{n,m}
Match the quantified construct at least n, matching as much as possible, but no more than m times.
  • Example: Pattern c{3,5} matches triple ccc or cccc or ccccc.
  • Example: Pattern abc{3,5} matches abccc or abcccc or even abccccc.

Any of the quantifiers above may be followed by an interval modifier. An interval modifier must appear right after a quantifier to be recognized.

?
Use lazy matching instead of greedy matching.
  • Lazy matching allows to match as little as possible with the given quantified construct.
  • Example: Pattern c*? matches an empty string or single c or even cccccccc, but prefers to match an empty string rather than c.
  • Example: Pattern abc?? matches ab or abc, but prefers to match ab rather than abc.
  • Note: Greedy matching is the default, matches as much as possible, and does not need any modifiers.
+
Use possesive greedy matching.
  • By default, when a greedy quantified construct would not allow the rest of the pattern to match, back-tracking is used to find out how much that construct should match to allow the entire pattern to match. Possesive matching means no such back-tracking. Once a match is established by the construct, it can only be back-tracked as a whole.
  • Example: Pattern c*+\w possesively matches all available letters c which are followed by any word-character.
    1. This is okay in string cccca, where the c*+ would match all the letters c and the \w would then match the letter a.
    2. However, in a string ccccc, the pattern would fail to match, because all letters c would be eaten by the possesive c*+ and the \w would be left with nothing to match.
  • Note: There is no lazy possesive interval matching, since that would not make much sence.. :)
Zero-length assertions

Besides various wildcards and repetitions described above, regular expressions also offer a way to verify that at some given point during matching, a condition about nearby context is true. Following constructs, which allow context conditions verifications do not ever match any characters themselves, i.e. they are always zero-length. They allow the containing pattern to match sucessfully only if the relevant condition is true (at the point where they appear in the pattern).

^
Matches only at line beginnings.
  • Example: Pattern ^abc matches any abc that is placed right at the beginning of a line.
    • Any abc occurences that are preceeded by other characters on a line are not matched.
$
Matches only at line ends.
  • Example: Pattern abc$ matches any abc that is placed right at the end of a line.
    • Any abc occurences that are followed by other characters on a line are not matched.
  • Example: Pattern ^abc$ matches whole lines which contain abc and nothing else.
\p
Matches only at paragraph beginnings.
  • Example: Pattern \pabc matches any abc that is placed at the beginning of a paragraph.
\P
Matches only at paragraph ends.
  • Example: Pattern \pabc\P matches whole paragraph which contain abc and nothing else.
\A
Matches only at document beginning.
  • Example: Pattern \Aabc matches one abc, if the current document starts with that abc.
\Z
Matches only at document end.
  • Example: Pattern abc\Z matches one abc, if the current document ends with that abc.
  • Example: Pattern \Aabc\Z would match only if the entire content of the current document would be abc.


\b
Matches only at word boundary, i.e. word beginning or word end, i.e. one of the characters around the current matching position must be a word character and the other may not.
  • Example: Pattern \babc matches any abc, if that abc is the beginning of a word.
  • Example: Pattern \babc\b matches any abc, if that abc is the whole word.
  • Example: Pattern \b\w*\b matches whole words.
\B
Matches only at word no-boundary, i.e. inside of words between two word characters, or outside words, if there is no word character nearby.
  • Actually, pattern \B matches exactly at those positions where \b cannot match.
\y
Matches only at word beginning, i.e. the character after the current matching position must be a word character, and the current position must be a word boundary.
  • Example: Pattern \y. matches the first character of every word, whatever that character is.
\Y
Matches only at word end, i.e. the characters before the current matching position must be a word character, and the current position must be a word boundary.
  • Example: Pattern \y\w*\Y matches whole words.
  • Example: Pattern \Y\w*\y can never match.


\G
Matches only at the original starting position. Guarantees that only the position within the document, where the search started, is matched at by this construct. Starting position is usually the one with the caret before the search, also indicated by the Status Bar.
  • Example: Pattern abc\Gxy matches abcxy if the caret is preceeded by abc and followed by xy upon invoking the search.
Character classes

Character classes can be used to define special sets of characters. While regular expressions offer some common character sets as part of the language (i.e. \w or \W, \d or \D, etc.), it is also possible to use any user-defined sets. For example, [abz] creates a user-defined character class which includes only letters a, b and z. This user-defined set of characters now matches only one of these three letters, very much like a \d would match one single digit.

The following syntax is used to specify user-defined character class within patterns:

[
Opens character class definition.
  • Each opened character class definition requires a closing ] bracket as described below.
  • Everything until such ] closing bracket is considered part of the character class definition and must comply with character class definition syntax as described in this section.
  • Note that syntax for entering character classes differs substantially from pattern syntax. For example, a dot . has no special meaning inside character class definition, while within a pattern it matches almost anything.
]
Closes character class definition.
  • Note: If a ] bracket appears after the opening [ bracket of a new character class definition, it is considered an ordinary ] character.
    • Pattern []] is equivalent to [\]] and can be used to match a single ]. Pattern [^]] is equivalent to [^\]] and can be used to match any single character but ].
    • Note that [] and [^] are both invalid patterns. They are both considered containing unclosed character class definition.


Syntax inside character class definitions
a, b, c, ..., z
0, 1, 2, ..., 9
Ordinary alpha-numeric letters have no special meaning inside character class definition, they always stand for themselves.
  • Example: Pattern [abc] matches one character, which can be either letter a, b or c.


A-Z
Range of multiple characters from A to Z, where the A and Z can be any two characters from the ASCII table.
  • Non-ASCII characters can also be used, however, the results might be somewhat unexpected, especially in Unicode versions.
  • Example: Pattern [a-c] matches one character, which can be either letter a, b or c.
  • Example: Pattern [a-z] matches one character, which can be any US-ASCII letter a...z.
The dash must be used between two actual characters to be recognized as a valid range. Otherwise it is considered an ordinary dash - character.
  • If used right after the opening [ bracket, it is considered an ordinary dash - character.
    • Example: Pattern [-a] matches either letter a or dash -.
  • If used right before the closing ] bracket, it is considered an ordinary dash - character.
    • Example: Pattern [a-] matches either letter a or dash -.
  • If chained, i.e. using two dashes around one character, the second dash is considered an ordinary dash - character.
    • Example: Pattern [a-c-z] matches either letter a, b, c, or z or dash -.
    • Example: Pattern [a-c-z] is equivalent to [a-cz-]. It may be considered bad practice to chain dashes like this.
  • Using a reversed character range, i.e. pattern [z-a], is allowed in TED Notepad, but not recommended. It is considered bad practice.


This section is incomplete and wants to be finished later.
TODO:
 Any of these constructs may appear in the character class.
 \t     horizontal tab character (TAB)
 \f     form feed character (FF)
 \a     bell character (BEL)
 \v     vertical tab (VT)
 \e     escape character (ESC)
 \b     backspace character (BS)
 \n     line feed newline character (NL)
 \r     carriage return character (CR)
 \0     null (NUL)
 \x     character in hex notation
 \u     character in unicode notation (unicode version only)
 \c     character in control notation
 \\     escape for \
 \^     escape for ^
 \[     escape for [
 \]     escape for ]
 \-     escape for -
 \:     escape for :
 \w     any word character
 \W     any non-word character
 \s     any white-space character
 \S     any non-white-space character
 \d     any digit character
 \D     any non-digit character
 [:alpha:]   class function: alphas
 [:alnum:]   class function: alphanums
 [:blank:]   class function: blank chars
 [:cntrl:]   class function: control chars
 [:digit:]   class function: digits
 [:graph:]   class function: graphs
 [:lower:]   class function: lowercase chars
 [:print:]   class function: printable chars
 [:punct:]   class function: punctuations
 [:space:]   class function: white-spaces
 [:upper:]   class function: uppercase chars
 [:xdigi:]   class function: hex digits


[^class]
Negates the whole character class definition.
  • The ^ must appear right after the opening [ bracket to be recognized.
  • Negated character class matches all characters not mentioned in that character class definition, except for newline.
  • Example: Pattern [^abc] matches any character except letter a, b, c and newline.
    • Newline is never part of any negated character class, mostly for user convenience.
Capture groups
(
Begins a new capture group. Capture groups are useful for back-references in both search and replace patterns.
)
Ends current capture group. Note: Capture groups can be nested.


\1, \2, ..., \9
Back-reference to a specific captured group. Matches exaclty the same text as was previously matched by the given capture group. Note that this does not try to repeat the sub-pattern within that capture group, but matches against the specific text matched by that group.
  • Iteration thru a capture group and a back-reference to it may be repeated several times upon matching by an interval on an enclosing group. In such case, the back-reference always matches against the most recent text captured by the capture group.
Cluster groups
(?:pattern)
Cluster group. Cluster groups are non-capturing groups. They act like capturing groups, but do not consume resource for capturing and do not consume capture group numbers for back-referencing.
(?|pattern)
Branch reset cluster group. Inside a branch reset group, capture groups are numbered from the same starting group number in each alternation. Thus several capture groups are assigned the same group number, and then, depending on which alternation actually matches, the number references the correct matching capture group.
Other groups
(?>pattern)
Possesive independent give-nothing-back sub-pattern. This cluster group effectively prevents back-tracking upon matching. Note: Back-tracking is allowed inside of the group before the group ends, but once the group matches as an independent sub-pattern, further back-tracking inside of that group is not performed and the group is unmatched at that point as a whole. This is like grab all you can, and then give nothing back operator.
Alternations

Alternations allow the user to combine two or more separate patterns into one bigger pattern, matching whatever any of those separate patterns would match. Alternations thus provide a logical or in the pattern syntax, meaning that the bigger pattern can either match what one separate patterns would match or the other separate patterns would match.

|
Divides pattern alternations. As long as any of the alternations matches, the entire pattern matches.
  • Example: abc|xy can match either abc or xy.
  • Note: Enclosing groups always demarcate borders for alternations.
    • Example: (abc|xy)2 can match either abc2 or xy2.
    • Example: (abc|xy)2|uw can match either abc2 or xy2 or uw.
Look-behind assertion
\K
Removes everything that is to the left of the current matching position from the \& replace back-reference. This effectively provides a look-behind assertion, since it can be used to verify, that a match is preceded by some other pattern.
More escape sequences

Since many characters have special meanings in regular expressions, escapes are provided to allow using these characters in searches.

\\
Matches character \. Note: Unescaped single \ has a special meaning.
\^
Matches character ^. Note: Unescaped single ^ has a special meaning.
\$
Matches character $. Note: Unescaped single $ has a special meaning.
\.
Matches character .. Note: Unescaped single . has a special meaning.
\|
Matches character |. Note: Unescaped single | has a special meaning.
\(
Matches character (. Note: Unescaped single ( has a special meaning.
\)
Matches character ). Note: Unescaped single ) has a special meaning.
\[
Matches character [. Note: Unescaped single [ has a special meaning.
\]
Matches character ]. Note: Unescaped single ] has a special meaning.
\*
Matches character *. Note: Unescaped single * has a special meaning.
\+
Matches character +. Note: Unescaped single + has a special meaning.
\?
Matches character ?. Note: Unescaped single ? has a special meaning.
\{
Matches character {. Note: Unescaped single { has a special meaning.
\}
Matches character }. Note: Unescaped single } has a special meaning.
\<
Matches character <. Note: Unescaped single < has a special meaning.
\>
Matches character >. Note: Unescaped single > has a special meaning.
\:
Matches character :. Note: Unescaped single : has a special meaning.
Replace patterns

Any of these constructs may appear anywhere in the replace pattern, as long as regular expressions are turned on.

\\
Inserts a backslash.


\n
Inserts a newline sequence (CR, NL or CR/NL; depends on current document options).
\t
Inserts a horizontal tab character (TAB).
\f
Inserts a form feed character (FF).
\a
Inserts a vell character (BEL).
\v
Inserts a vertical tab (VT).
\e
Inserts an escape character (ESC).
\0
Inserts a null character (NUL).


\xhh
Inserts a character in hex notation.
\uhhhh
Inserts a character in unicode notation (unicode version only).
\cA
Inserts a character in control notation.


\Q ... \E
Quoted string. Anything between \Q and \E is treated as plain-text string and is inserted exactly as it appears in the pattern.


\&
Back-reference to the entire match.
\1, \2, ..., \9
Back-reference to a specific captured group.
\+
Back-reference to the last successfull captured group. Consider having several alternations, each with a group inside it. Only one of the alternations will match, thus only one of those groups will be valid upon replacing. This back-reference allows referencing the correct one of those groups, based on which of the alternations matched.
  • Note: This can also be achieved by using branch reset cluster groups.
Expression examples
Find e-mail adress (simplyfied method, see below for more)
Find [a-z0-9.+-]+@[a-z0-9.-]+
Words
Find words beginning with KEY
Find \bKEY\w*\b
Find words beginning with letters K or R or M
Find \b[KRM]\w*\b
Find words ending with KEY
Find \b\w*KEY\b
Find words ending with letters K or R or M
Find \b\w*[KRM]\b
Find words containing KEY
Find \b\w*KEY\w*\b
Find words containing no letter K
Find \b[^K\W]+\b
Find words 3 to 5 characters long
Find \b\w{3,5}\b
Newlines and spaces
Add one extra newline after each existing line
Find $ and replace with \n
Spacify the text - Add one extra space between each two characters
Find nothing and replace with space
Join each line ending with character = with the following line
Find =\n and replace with nothing
HTML/XML tags
Find and remove all html/xml tags
Find <[^>]*> and replace with nothing
Find and remove all html/xml multi-line tags
Find <([^>]*|\n)*> and replace with nothing
Find and remove html/xml tags containing word pre
Find <[^>]*pre[^>]*> and replace with nothing
Find all <h1> headline tags and surround the text with ==
Find </?h1> and replace with ==
Find all html entities, i.e. &nbsp; &amp; &#39;
Find &[^;]+;