Difference between revisions of "Regular expressions"
Line 1: | Line 1: | ||
− | <noinclude>{{manversion|6. | + | <noinclude>{{manversion|6.3.0.8|feature}}__NOTOC__</noinclude> |
* '''[[#Basic syntax and priciples|RegExps Patterns]]''' | * '''[[#Basic syntax and priciples|RegExps Patterns]]''' | ||
** [[#Special-purpose characters|Special-purpose characters]] | ** [[#Special-purpose characters|Special-purpose characters]] |
Latest revision as of 22:03, 25 November 2021
Control page Control:feature:Regular expressions
Basic syntax and priciples
Regular expressions (also called RegExp, RegExps; or RegEx, RegExes, RegExen) intend to offer a standardized way to enrich plain-text search patterns with an ability to use wildcards, repetitions, multiple alternatives, etc. While a plain-text search for abc will always yield an occurence of abc, a regular expression search pattern like abc|xy can search for both abc and xy at the same time and report which one was found first. And this is just the beginning. Regular expressions can be used to find any e-mail address in the document, locate every html tag, find every word that ends with ing and much much more.
- a, b, c, ..., z
- 0, 1, 2, ..., 9
- Ordinary alpha-numeric letters have no special meaning in regular expressions patterns, they always stand for themselves.
- Example: A pattern like abc simply searches for abc and xy searches for xy.
- Note: It is common to say that a regular expressions pattern matches something instead of saying that it searches for something. Get used to it.. :)
- \Q ... \E
- Quoted string. Whatever text appears between \Q and \E is treated as a plain text and is matched exactly as it appears.
- Whenever not sure about special-purpose characters, the \Q and \E can be used to temporarily turn the regular expressions syntax off and safely match in a WYSIWYG plain-text manner.
- Example: A pattern like \Q$%^#$&\E simply searches for $%^#$& even though ^ and $ have special purpose.
So far, regular expressions sounded rather boring, eh? Never mind, basics and priciples are never supposed to be something extraordinary anyway.
Special-purpose characters
Following are the most typical special-purpose characters and escapes for matching entire sets of characters:
- .
- Wildcard character. Matches any single character except for newline.
- Example: Pattern abc.xy matches any string, which begins with abc, then contains one character of any sort, and then continues with xy.
- Note: A dot . never matches a newline and this behaviour cannot be changed via any option or setting.
- \w
- Matches any
word character
.- Example: Pattern \w\w\w matches any 3-letters-long word.
- \W
- Matches any
non-word character
.- This is a complementary character set to the \w, but does not contain newline.
- Example: Pattern \W\W\W matches any 3 consequtive non-word characters.
- \s
- Matches any
white-space character
.- Example: Pattern \s\s matches any 2 consequtive white-spaces.
- \S
- Matches any
non-white-space character
.- This is a complementary character set to the \s, but does not contain newline.
- \d
- Matches any
digit character
.- Example: Pattern \d\d\d\d matches any 4-digit-long number.
- \D
- Matches any
non-digit character
.- This is a complementary character set to the \d, but does not contain newline.
Following are escapes for including special hard-to-type characters in a pattern:
- \n
- Matches one newline sequence (CR, NL or CR/NL)
- \t
- Matches a horizontal tab character (TAB).
- \f
- Matches a form feed character (FF).
- \a
- Matches a bell character (BEL).
- \v
- Matches a vertical tab (VT).
- \e
- Matches an escape character (ESC).
- \0
- Matches a null character (NUL).
- \xhh
- Matches a character given in hh hex notation.
- \uhhhh
- Matches a character given in hhhh unicode notation (unicode version only).
- \cK
- Matches a character given in control notation. The letter K can be any capital letter from A to Z, where A stands for \x00, B stands for \x01, C stands for \x03, etc.
Intervals and repetitions
In order to match a character more than once, regular expressions offer quantifiers. Any character, escape, or even a complex pattern construct can be repeated by adding such quantifier to that construct.
- Note: A complex construct is a quoted string, for example. A quantifier always quantifies the entire such construct, therefore quantifier in pattern \Qabc\E* would repeat the entire abc as a whole.
- Note: After ordinary character, the quantifier quantifies only the one character after which it appears, therefore quantifier in pattern abc* would make only the letter c repeatable.
- ?
- Match the quantified construct 0 or 1 times, matching as much as possible.
- Example: Pattern c? matches an empty string or single c.
- Example: Pattern abc? matches ab or abc.
- *
- Match the quantified construct 0 or more times, matching as much as possible.
- Example: Pattern c* matches an empty string or single c or double cc or even any cccccccc.
- Example: Pattern abc* matches ab or abc or abcc or even abcccccccc.
- +
- Match the quantified construct 1 or more times, matching as much as possible.
- Example: Pattern c+ matches a single c or double cc or even any cccccccc.
- Example: Pattern abc+ matches abc or abcc or even abcccccccc.
- {n}
- Match the quantified construct exactly n times.
- Example: Pattern c{3} matches a triple ccc.
- Example: Pattern abc{3} matches abccc.
- {,n}
- Match the quantified construct up to n times, matching as much as possible.
- Example: Pattern c{,3} matches an empty string or single c or double cc or ccc.
- Example: Pattern abc{,3} matches ab or abc or abcc or abccc.
- {n,}
- Match the quantified construct at least n times, matching as much as possible.
- Example: Pattern c{3,} matches triple ccc or cccc or even cccccccccc.
- Example: Pattern abc{3,} matches abccc or abcccc or even abcccccccc.
- {n,m}
- Match the quantified construct at least n, matching as much as possible, but no more than m times.
- Example: Pattern c{3,5} matches triple ccc or cccc or ccccc.
- Example: Pattern abc{3,5} matches abccc or abcccc or even abccccc.
Any of the quantifiers above may be followed by an interval modifier. An interval modifier must appear right after a quantifier to be recognized.
- ?
- Use lazy matching instead of greedy matching.
- Lazy matching allows to match as little as possible with the given quantified construct.
- Example: Pattern c*? matches an empty string or single c or even cccccccc, but prefers to match an empty string rather than c.
- Example: Pattern abc?? matches ab or abc, but prefers to match ab rather than abc.
- Note: Greedy matching is the default, matches as much as possible, and does not need any modifiers.
- +
- Use possesive greedy matching.
- By default, when a greedy quantified construct would not allow the rest of the pattern to match, back-tracking is used to find out how much that construct should match to allow the entire pattern to match. Possesive matching means no such back-tracking. Once a match is established by the construct, it can only be back-tracked as a whole.
- Example: Pattern c*+\w possesively matches all available letters c which are followed by any word-character.
- This is okay in string cccca, where the c*+ would match all the letters c and the \w would then match the letter a.
- However, in a string ccccc, the pattern would fail to match, because all letters c would be eaten by the possesive c*+ and the \w would be left with nothing to match.
- Note: There is no lazy possesive interval matching, since that would not make much sence.. :)
Zero-length assertions
Besides various wildcards and repetitions described above, regular expressions also offer a way to verify that at some given point during matching, a condition about nearby context is true. Following constructs, which allow context conditions verifications do not ever match any characters themselves, i.e. they are always zero-length. They allow the containing pattern to match sucessfully only if the relevant condition is true (at the point where they appear in the pattern).
- ^
- Matches only at line beginnings.
- Example: Pattern ^abc matches any abc that is placed right at the beginning of a line.
- Any abc occurences that are preceeded by other characters on a line are not matched.
- Example: Pattern ^abc matches any abc that is placed right at the beginning of a line.
- $
- Matches only at line ends.
- Example: Pattern abc$ matches any abc that is placed right at the end of a line.
- Any abc occurences that are followed by other characters on a line are not matched.
- Example: Pattern ^abc$ matches whole lines which contain abc and nothing else.
- Example: Pattern abc$ matches any abc that is placed right at the end of a line.
- \p
- Matches only at paragraph beginnings.
- Example: Pattern \pabc matches any abc that is placed at the beginning of a paragraph.
- \P
- Matches only at paragraph ends.
- Example: Pattern \pabc\P matches whole paragraph which contain abc and nothing else.
- \A
- Matches only at document beginning.
- Example: Pattern \Aabc matches one abc, if the current document starts with that abc.
- \Z
- Matches only at document end.
- Example: Pattern abc\Z matches one abc, if the current document ends with that abc.
- Example: Pattern \Aabc\Z would match only if the entire content of the current document would be abc.
- \b
- Matches only at word boundary, i.e. word beginning or word end, i.e. one of the characters around the current matching position must be a
word character
and the other may not.- Example: Pattern \babc matches any abc, if that abc is the beginning of a word.
- Example: Pattern \babc\b matches any abc, if that abc is the whole word.
- Example: Pattern \b\w*\b matches whole words.
- \B
- Matches only at word no-boundary, i.e. inside of words between two
word characters
, or outside words, if there is noword character
nearby.- Actually, pattern \B matches exactly at those positions where \b cannot match.
- \y
- Matches only at word beginning, i.e. the character after the current matching position must be a
word character
, and the current position must be a word boundary.- Example: Pattern \y. matches the first character of every word, whatever that character is.
- \Y
- Matches only at word end, i.e. the characters before the current matching position must be a
word character
, and the current position must be a word boundary.- Example: Pattern \y\w*\Y matches whole words.
- Example: Pattern \Y\w*\y can never match.
- \G
- Matches only at the original starting position. Guarantees that only the position within the document, where the search started, is matched at by this construct. Starting position is usually the one with the caret before the search, also indicated by the Status Bar.
- Example: Pattern abc\Gxy matches abcxy if the caret is preceeded by abc and followed by xy upon invoking the search.
Character classes
Character classes can be used to define special sets of characters. While regular expressions offer some common character sets as part of the language (i.e. \w or \W, \d or \D, etc.), it is also possible to use any user-defined sets. For example, [abz] creates a user-defined character class which includes only letters a, b and z. This user-defined set of characters now matches only one of these three letters, very much like a \d would match one single digit.
The following syntax is used to specify user-defined character class within patterns:
- [
- Opens character class definition.
- Each opened character class definition requires a closing ] bracket as described below.
- Everything until such ] closing bracket is considered part of the character class definition and must comply with character class definition syntax as described in this section.
- Note that syntax for entering character classes differs substantially from pattern syntax. For example, a dot . has no special meaning inside character class definition, while within a pattern it matches almost anything.
- ]
- Closes character class definition.
- Note: If a ] bracket appears after the opening [ bracket of a new character class definition, it is considered an ordinary ] character.
- Pattern []] is equivalent to [\]] and can be used to match a single ]. Pattern [^]] is equivalent to [^\]] and can be used to match any single character but ].
- Note that [] and [^] are both invalid patterns. They are both considered containing unclosed character class definition.
- Note: If a ] bracket appears after the opening [ bracket of a new character class definition, it is considered an ordinary ] character.
- [^ ... ]
- Negates the whole character class definition.
- The ^ must appear right after the opening [ bracket to be recognized.
- Negated character class matches all characters not mentioned in that character class definition, except for newline.
- Example: Pattern [^abc] matches any character except letter a, b, c and newline.
- Newline is never part of any negated character class, mostly for user convenience.
- Syntax inside character class definitions
- a, b, c, ..., z
- 0, 1, 2, ..., 9
- Ordinary alpha-numeric letters have no special meaning inside character class definition, they always stand for themselves.
- Example: Pattern [abc] matches one character, which can be either letter a, b or c.
- Example: Pattern [abc]* adds repetitive quantifier * to the class [abc], thus matches whole sequence of characters, which can consist of any combination of letters a, b and c.
- A-Z
- Range of multiple characters from A to Z, where the A and Z can be any two characters from the ASCII table.
- Non-ASCII characters can also be used, however, the results might be somewhat unexpected, especially in Unicode versions.
- Example: Pattern [a-c] matches one character, which can be either letter a, b or c.
- Example: Pattern [a-z] matches one character, which can be any US-ASCII letter a...z.
- Example: Pattern [a-z]+ matches any word consisting of US-ASCII letters a...z.
- The dash must be used between two actual characters to be recognized as a valid range. Otherwise it is considered an ordinary dash - character.
- If used right after the opening [ bracket, it is considered an ordinary dash - character.
- Example: Pattern [-a] matches either letter a or dash -.
- If used right before the closing ] bracket, it is considered an ordinary dash - character.
- Example: Pattern [a-] matches either letter a or dash -.
- If chained, i.e. using two dashes around one character, the second dash is considered an ordinary dash - character.
- Example: Pattern [a-c-z] matches either letter a, b, c, or z or dash -.
- Example: Pattern [a-c-z] is equivalent to [a-cz-]. It may be considered bad practice to chain dashes like this.
- Using a reversed character range, i.e. pattern [z-a], is allowed in TED Notepad, but not recommended. It is considered bad practice.
- If used right after the opening [ bracket, it is considered an ordinary dash - character.
- \s
- All
white-space
characters.- Example: Pattern [\s_]+ matches any sequence of
white-spaces
and underscores.
- Example: Pattern [\s_]+ matches any sequence of
- \S
- All
non-white-space
characters.- Newline is part of neither \s nor \S. Otherwise, \S is a complement of \s.
- Example: Pattern [\s\S] matches any character except for newline.
- Newline is part of neither \s nor \S. Otherwise, \S is a complement of \s.
- \t
- Adds a horizontal tab character (TAB) to the class.
- \f
- Adds a form feed character (FF) to the class.
- \a
- Adds a bell character (BEL) to the class.
- \v
- Adds a vertical tab (VT) to the class.
- \e
- Adds an escape character (ESC) to the class.
- \b
- Adds a backspace character (BS) to the class.
- \0
- Adds a null character (NUL) to the class.
- \n
- Adds a line feed character (NL) to the class.
- \r
- Adds a carriage return character (CR) to the class.
- Note: Adding line feed and carriage return characters to a class is not the same as matching newlines. While a newline pattern matches any conbination of valid newline characters as a single newline, character class with line feed and carriage return characters always matches individual characters and thus potentially breaks newlines that consist of multiple characters.
- Example: Pattern [\w\W\r\n] matches any single character.
- Note: Adding line feed and carriage return characters to a class is not the same as matching newlines. While a newline pattern matches any conbination of valid newline characters as a single newline, character class with line feed and carriage return characters always matches individual characters and thus potentially breaks newlines that consist of multiple characters.
- \xhh
- Adds a character given in hh hex notation.
- \uhhhh
- Adds a character given in hhhh unicode notation (unicode version only).
- \cK
- Adds a character given in control notation. The letter K can be any capital letter from A to Z, where A stands for \x00, B stands for \x01, C stands for \x03, etc.
- \\
- Adds a \ to the class. Note: Unescaped single \ has a special meaning of introducing escapes.
- \^
- Adds a ^ to the class. Note: Unescaped single ^ has a special meaning, though only at the very beginning of a class.
- \[
- Adds a [ to the class. Note: Unescaped single [ has a special meaning, if followed by a : character.
- \]
- Adds a ] to the class. Note: Unescaped single ] has a special meaning, unless it is at the very beginning of a class.
- \-
- Adds a - to the class. Note: Unescaped single - might have a special meaning, unless it is at the very beginning or very end of a class.
- \:
- Adds a : to the class. Note: Unescaped single : has a special meaning, if following a [ character.
- [:posix:]
- Adds a standard POSIX character class to the class.
- [:alpha:], [:alnum:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], [:xdigi:].
- Example: Pattern [[:alnum:]] matches any single POSIX alphanumeric character.
- Example: Pattern [[:alpha:][:digit:]] matches any single POSIX letter or digit.
- Example: Pattern [[:xdigi:]:-]+ matches a sequence of hexadecimal numbers, colons and dashes.
Capture groups
- (
- Begins a new capture group. Capture groups are useful for back-references in both search and replace patterns.
- )
- Ends current capture group. Note: Capture groups can be nested.
- \1, \2, ..., \9
- Back-reference to a specific captured group. Matches exaclty the same text as was previously matched by the given capture group. Note that this does not try to repeat the sub-pattern within that capture group, but matches against the specific text matched by that group.
- Iteration thru a capture group and a back-reference to it may be repeated several times upon matching by an interval on an enclosing group. In such case, the back-reference always matches against the most recent text captured by the capture group.
Alternations
Alternations allow the user to combine two or more separate patterns into one bigger pattern, matching whatever any of those separate patterns would match. Alternations thus provide a logical or in the pattern syntax, meaning that the bigger pattern can either match what one separate patterns would match or the other separate patterns would match.
- |
- Divides pattern alternations. As long as any of the alternations matches, the entire pattern matches.
- Example: abc|xy can match either abc or xy.
- Note: Enclosing groups always demarcate borders for alternations.
- Example: (abc|xy)2 can match either abc2 or xy2.
- Example: (abc|xy)2|uw can match either abc2 or xy2 or uw.
Cluster groups
- (?:pattern)
- Cluster group. Cluster groups are non-capturing groups. They act like capturing groups, but do not consume resource for capturing and do not consume capture group numbers for back-referencing.
- (?|pattern)
- Branch reset cluster group, also called Alternation reset cluster group. Inside a branch reset group, capture groups are numbered from the same starting group number in each alternation. Thus several capture groups are assigned the same group number, and then, depending on which alternation actually matches, the number references the correct matching capture group.
- Example: In (abc)|(xy)|(ted), the abc is matched by the capture group number 1 and can be referenced by \1, the xy is matched by the capture group number 2 and can be referenced by \2, and the ted is matched by the capture group number 3 and can be referenced by \3. However, in (?|(abc)|(xy)|(ted)), either abc or xy or ted is matched by the capture group number 1 and can be referenced by \1. This is because there are three capture groups with number 1 thanks to the enclosing branch reset cluster group. Note: The cluster group itself is not a capture group and does not take a number and cannot be back-referenced.
- Example: In (before)(?|(abc)|(xy)|(ted)), the before is matched by the capture group number 1, and either abc or xy or ted is matched by the capture group number 2.
- Example: In (?|(abc)|(xy)(z)|(ted)(npad))(after), either abc or xy or ted is matched by the capture group 1, either z or npad is matched by the capture group 2, and the after is matched by the capture group 3. The numbering of groups following a branch reset cluster group continues according to which branch used-up the most group numbers.
- Note: If abc is matched by the capture group 1, then capture group number 2 is undefined and cannot be referenced, since there is no capture group number 2 in that particular alternation. The after is still matched by the capture group number 3.
Other groups
- (?>pattern)
- Possesive independent give-nothing-back sub-pattern. This cluster group effectively prevents back-tracking upon matching. Note: Back-tracking is allowed inside of the group before the group ends, but once the group matches as an independent sub-pattern, further back-tracking inside of that group is not performed and the group is unmatched at that point as a whole. This is like grab all you can, and then give nothing back operator.
Look-behind assertion
- \K
- Removes everything that is to the left of the current matching position from the \& replace back-reference. This effectively provides a look-behind assertion, since it can be used to verify, that a match is preceded by some other pattern.
More escape sequences
Since many characters have special meanings in regular expressions, escapes are provided to allow using these characters in searches.
- \\
- Matches character \. Note: Unescaped single \ has a special meaning of introducing escapes.
- \^
- Matches character ^. Note: Unescaped single ^ has a special meaning of matching at line beginnings.
- \$
- Matches character $. Note: Unescaped single $ has a special meaning of matching at line ends.
- \.
- Matches character .. Note: Unescaped single . has a special meaning of matching almost anything.
- \|
- Matches character |. Note: Unescaped single | has a special meaning of dividing alternations or starting a branching cluster group.
- \(
- Matches character (. Note: Unescaped single ( has a special meaning of starting a capture group.
- \)
- Matches character ). Note: Unescaped single ) has a special meaning of ending a capture group.
- \[
- Matches character [. Note: Unescaped single [ has a special meaning of starting a character class.
- \]
- Matches character ]. Note: Unescaped single ] has a special meaning of ending a character class.
- \*
- Matches character *. Note: Unescaped single * has a special meaning of repetition quantifier.
- \+
- Matches character +. Note: Unescaped single + has a special meaning of repetition quantifier.
- \?
- Matches character ?. Note: Unescaped single ? has a special meaning of repetition quantifier or starting a cluster group.
- \{
- Matches character {. Note: Unescaped single { has a special meaning of starting a repetition quantifier.
- \}
- Matches character }. Note: Unescaped single } has a special meaning of ending a repetition quantifier.
- \<
- Matches character <. Note: Unescaped single < might have a special meaning in the future.
- \>
- Matches character >. Note: Unescaped single > has a special meaning of starting a possesive cluster group.
- \:
- Matches character :. Note: Unescaped single : has a special meaning of starting a simple cluster group.
Replace patterns
Any of these constructs may appear anywhere in the replace pattern, as long as regular expressions are turned on.
- \\
- Inserts a backslash.
- \n
- Inserts a newline sequence (CR, NL or CR/NL; depends on current document options).
- \t
- Inserts a horizontal tab character (TAB).
- \f
- Inserts a form feed character (FF).
- \a
- Inserts a vell character (BEL).
- \v
- Inserts a vertical tab (VT).
- \e
- Inserts an escape character (ESC).
- \0
- Inserts a null character (NUL).
- \xhh
- Inserts a character in hex notation.
- \uhhhh
- Inserts a character in unicode notation (unicode version only).
- \cA
- Inserts a character in control notation.
- \Q ... \E
- Quoted string. Anything between \Q and \E is treated as plain-text string and is inserted exactly as it appears in the pattern.
- \&
- Back-reference to the entire match.
- \1, \2, ..., \9
- Back-reference to a specific captured group.
- \+
- Back-reference to the last successfull captured group. Consider having several alternations, each with a group inside it. Only one of the alternations will match, thus only one of those groups will be valid upon replacing. This back-reference allows referencing the correct one of those groups, based on which of the alternations matched.
- Note: This can also be achieved by using branch reset cluster groups.
- \L&
- Inserts the length of the entire match.
- \L1, \L2, ..., \L9
- Inserts the length of a specific captured group.
- \L+
- Inserts the length of the last successfull captured group. See \+ above for details.
- \#d
- Inserts the number of the current replacement as a decimal number.
- Note: This replace pattern only make sense during Replace All or in the Extended Replace tool.
- When using Replace on individual replacements, the replacement number is always 1.
- \#x
- Inserts the number of the current replacement as a lowercase hexadecimal number.
- \#X
- Inserts the number of the current replacement as an uppercase hexadecimal number.
- \#b
- Inserts the number of the current replacement as a binary number.
- \#o
- Inserts the number of the current replacement as an octal number.
- \*d
- Inserts a random decimal number.
- \*x
- Inserts a random lowercase hexadecimal number.
- \*X
- Inserts a random uppercase hexadecimal number.
- \*b
- Inserts a random binary number.
- \*o
- Inserts a random octal number.
Expression examples
- Find e-mail adress (rather simplyfied method)
- Find [a-z0-9.+-]+@[a-z0-9.-]+
Words
- Find words beginning with KEY
- Find \bKEY\w*\b
- Find words beginning with letters K or R or M
- Find \b[KRM]\w*\b
- Find words ending with KEY
- Find \b\w*KEY\b
- Find words ending with letters K or R or M
- Find \b\w*[KRM]\b
- Find words containing KEY
- Find \b\w*KEY\w*\b
- Find words containing no letter K
- Find \b[^K\W]+\b
- Find words 3 to 5 characters long
- Find \b\w{3,5}\b
Newlines and spaces
- Add one extra newline after each existing line
- Find $ and replace with \n
- Spacify the text - Add one extra space between each two characters
- Find nothing and replace with space
- Join each line ending with character = with the following line
- Find =\n and replace with nothing
HTML/XML tags
- Find and remove all html/xml tags
- Find <[^>]*> and replace with nothing
- Find and remove all html/xml multi-line tags
- Find <([^>]*|\n)*> and replace with nothing
- Find all <h1> headline tags and surround the text with ==
- Find </?h1> and replace with ==
- Find and remove html/xml tags starting with word pre
- Find <pre[^>]*> and replace with nothing
- Note: This does not find closing tags, because they start with a / slash.
- Search for </?pre[^>]*> to find for both opening and closing tags.
- Find and remove html/xml tags containing word pre
- Find <[^>]*pre[^>]*> and replace with nothing
- Find all html entities, i.e. & '
- Find &[#a-z0-9]+; with
ignore case
turned on