Cheat sheet for regular expressions PERL

On this topic:


Chapter 6.4. Regular expressions

6.4.1. Regular Expression Syntax

Regular expressions are patterns for finding specified combinations of characters in text strings and replacing them with other symbol combinations (these operations are respectively called pattern matching and substitution ). The regular expression in PERL looks like

  / Pattern / modifiers 

Here pattern is a string that specifies a regular expression, and modifiers are optional one-letter modifiers that specify the rules for using this regular expression.

A regular expression can consist of ordinary characters; In this case it will match the specified combination of characters in the string. For example, the expression / cat / corresponds to the selected substrings in the following lines: " cat ok", "for cat ", " cut ". However, the true strength of PERL regular expressions makes it possible to use special metacharacters in them.

Table 6.9. Metacharacters in regular expressions
Symbol Description
\ For characters that are usually treated literally, means that the next character is a metacharacter. For example, / n / matches the letter n, and / \ n / matches the newline character.
For metacharacters, it means that the character must be understood literally. For example, / ^ / denotes the beginning of the line, and / \ ^ / corresponds simply to the character ^. / \\ / matches the backslash \.
^ Corresponds to the beginning of the line (compare modifier m ).
$ Matches the end of the line (compare modifier m ).
. Matches any character except for a line break (compare modifier s ).
* Corresponds to repeating the previous character zero or more times.
+ Corresponds to repeating the previous character one or more times.
? Corresponds to the repetition of the previous character zero or one time.
( Pattern ) Corresponds to the pattern line and remembers the match found .
X | Y Matches x or y .
{ N } N is a nonnegative number. Matches exactly n occurrences of the previous character.
{ N ,} N is a nonnegative number. Matches n or more occurrences of the previous character. / X {1,} / is equivalent to / x + /. / X {0,} / is equivalent to / x * /.
{ N , m } N and m are nonnegative numbers. Matches at least n and no more than m occurrences of the previous character. / X {0,1} / is equivalent to / x? /.
[ Xyz ] Matches any character from the enclosed in square brackets.
[^ Xyz ] Matches any character except those enclosed in square brackets.
[A - z ] Matches any character in the specified range.
[^ A - z ] Matches any character other than the specified range.
\ A Matches the bell symbol (BEL).
\ A Matches only the beginning of the line, even with the modifier m .
\ B Matches the word boundary, i.e. the position between \ w and \ W in any order.
\ B Matches any position other than the word boundary.
\ From X Matches the character Ctrl + X. For example, / \ cI / is equivalent to / \ t /.
\ C Corresponds to one byte, even with the use utf8 directive.
\ D Corresponds to the figure. Equivalent to [0-9].
\ D Matches a non-numeric character. Equivalent to [^ 0-9].
\ E Matches the escape character (ESC).
\ E The end of the transformations \ L , \ Q , \ U.
\ F Matches the translation character of the format (FF).
\ G Corresponds to the position in the line equal to pos () .
\ L Converts the next character to lowercase.
\ L Converts characters to lowercase before \ E.
\ N Corresponds to the line break.
\ P property Matches Unicode characters that have the property property . If the property is specified by several characters, use the \ p { property } syntax.
\ P property Matches Unicode characters that do not have the property property . If the property is specified by several characters, use the syntax \ P { property } .
\ Q Adds the character "\" before the metacharacters to \ E.
\ R Matches a carriage return (CR) character.
\ S Matches the space character. Equivalent to / [\ f \ n \ r \ t] /.
\ S Matches any non-whitespace character. Equivalent to / [^ \ f \ n \ r \ t] /.
\ T Matches the tab character (HT).
\ U Converts the next character to uppercase.
\ U Converts characters to uppercase before \ E.
\ W Corresponds to the Latin letter, number or underscore. Equivalent to / [A-Za-z0-9_] /.
\ W Matches any character except for a letter, a number, or an underscore. Equivalent to / [^ A-Za-z0-9_] /.
\ X It corresponds to a sequence of Unicode characters from the main character and a set of diacritic icons. Equivalent to the expression / C <(?: \ PM \ pM *)> /.
\ Z Matches only the end of the line, even with the modifier m .
\ Z Matches only the end of the line or the line break at the end of the line, even with the m modifier.
\ N N is a positive number. Corresponds to the nth stored substring . If the left parenthesis before this character is less than n , and n > 9, then it is equivalent to \ 0 n .
\ 0 n N is an octal number not greater than 377. It corresponds to the character with the octal code n . For example, / \ 011 / is equivalent to / \ t /.
\ X n N is a hexadecimal number consisting of two digits. Matches a character with hexadecimal code n . For example, / \ x31 / is equivalent to / 1 /.
\ X { n } N is a hexadecimal number consisting of four digits. Matches a Unicode character with hexadecimal n . For example, / \ x {2663} / is equivalent to / ♣ /.

6.4.2. Modifiers

Different operations with regular expressions use different modifiers to specify the operation to be performed. However, the four modifiers have a general purpose.

I
Ignores the case of characters when compared to a pattern. When using the use locale directive, the casting of characters to one register is made taking into account the national setting.
M
Considers the source string as a buffer from several lines of text separated by line breaks. This means that the metacharacters ^ and $ correspond not only to the beginning and end of the entire string, but to the beginning and end of a line of text bounded by line breaks.
S
Considers the source string as a single line of text, ignoring line breaks. This means that the metacharacter . Matches any character, including a line break.
X
Allows the use of spaces and comments. Spaces that do not have a preceding character \ and are not enclosed in [] are ignored. The # symbol starts a comment, which is also ignored.

6.4.3. Unicode and POSIX character classes

We can use the syntax in regular expressions

  [: Class:] 

Where class specifies the name of the POSIX character class, that is, the mobile standard for the C language. When using the utf8 directive, instead of the POSIX class, you can use the Unicode character classes in the construct

  \ P {class} 

The following table summarizes all POSIX symbol classes, the corresponding Unicode character classes, and metacharacters, if any.

Table 6.10. Character classes
POSIX Unicode Metacharacter Description
Alpha IsAlpha Letters
Alnum IsAlnum Letters and numbers
Ascii IsAscii ASCII characters
Cntrl IsCntrl Control characters
Digit IsDigit \ D Figures
Graph IsGraph Letters, numbers and punctuation marks
Lower IsLower Lower case
Print IsPrint Letters, numbers, punctuation and space
Punct IsPunct Punctuation
Space IsSpace \ S Spacing symbols
Upper IsUpper Uppercase letters
Word IsWord \ W Letters, numbers and underscore
Xdigit IsXDigit Hexadecimal digits

For example, you can specify a decimal number in any of the following three ways:

  / \ D + /
 / [: Digit:] + /
 / \ P {IsDigit} + / # use utf8 

To indicate that the symbol does not belong to a given class, constructions are used

  [: ^ Class:]
 \ P {class}

For example, the following expressions have the same meaning:

  [: ^ Digit:] \ D \ P {IsDigit}
 [: ^ Space:] \ S \ P {IsSpace}
 [: ^ Word:] \ W \ P {IsWord} 

6.4.4. Memorizing substrings

The use of parentheses in a regular expression causes the substring corresponding to the pattern in brackets to be stored in a special buffer. To access the nth stored substring within a regular expression, use the construct \ n , and outside it $ n , where n can take any values ​​starting at 1. However, remember that PERL uses the expressions \ 10 , \ 11 and t Etc. as synonyms for the octal codes of the characters \ 010 , \ 011 , etc. Ambiguity here is resolved as follows. The character \ 10 is considered to be a reference to the 10th stored substring, if there are at least ten left parentheses in front of it in the regular expression; Otherwise, it is a character with octal code 10. Metacharacters \ 1 , ... \ 9 are always considered as references to stored substrings. Examples:

  If (/(.)\1/) {# look for the first repeating character
  Print "'$ 1' - the first repeating character \ n";
 }
 If (/ Time: (..) :( ..) :( ..) /) {# extract the time components
  $ Hours = $ 1;
  $ Minutes = $ 2;
  $ Seconds = $ 3;
 } 

In addition to variables $ 1 , $ 2 , ... there are a few special variables in which the results of the last operation with the regular expression are stored, namely:

Variable Description
$ & Last found substring.
$ ` Substring before last found substring.
$ ' Substring after the last substring found.
$ + The last stored substring.

Let's give an example:

  'AAA111BBB222' = ~ / (\ d +) /;
 Print "$` \ n ";  # AAA
 Print "$ & \ n";  # 111
 Print "$ '\ n";  # BBB222
 Print "$ + \ n";  # 111 

All these special variables retain their values ​​until the end of the enclosing block or until the next successful match with the pattern.

6.4.5. Extended samples

PERL contains several additional constructs that can be used in regular expressions to extend their capabilities. All these constructs are enclosed in parentheses and begin with a symbol ? , Which distinguishes them from memorizing substrings.

(? # Text )
A comment. The whole construction is ignored.
(? Modifiers - modifiers )
Enables or disables the specified modifiers . Modifiers, standing before the symbol - , are included, standing after it - are turned off. Example:
  If (/ aaa /) {...} # case-sensitive matching
 If (/ (? I) aaa /) {...} # case-insensitive comparison 
(?: Pattern )
(? Modifiers - modifiers : pattern )
Allows you to group subexpressions of a regular expression without remembering the match found. The second form additionally turns on or off the specified modifiers . For example, the expression /ко(?:т|шка)/ is a short entry of the expression /кот|кошка/ .
(? = Pattern )
Matching with forward looking without memorizing the match found. For example, the expression /Windows (?=95|98|NT|2000)/ matches "Windows" in the "Windows 98" line, but does not match in the "Windows 3.1" line. After matching, the search continues from the position following the match found, without taking into account the forward look.
(?! Pattern )
Inconsistency with forward looking without memorizing the match found. For example, the expression /Windows (?!95|98|NT|2000)/ matches "Windows" in the string "Windows 3.1", but does not match in the string "Windows 98". After matching, the search continues from the position following the match found, without taking into account the forward look.
(? <= Pattern )
Matching with a look back without memorizing the match found. For example, the expression /(?<=\t)\w+/ matches the word that follows the tab character, and the tab character is not included in $ & . A fragment corresponding to looking back should have a fixed width.
(? <! Pattern )
Inconsistency with looking back without remembering the match found. For example, the expression /(?<!\t)\w+/ corresponds to a word before which there is no tab character. A fragment corresponding to looking back should have a fixed width.

6.4.6. Operations with regular expressions

Until now, we have enclosed regular expressions in the symbols // . In fact, regular expression symbols are defined by the q-operation , which we apply to them. This section describes in detail all the operations of PERL with regular expressions.

6.4.6.1. Comparison with the sample

  Syntax : / pattern / modifiers m / pattern / modifiers
 

This operation maps the specified string to the pattern pattern and returns true or false, depending on the result of the mapping. The matching string is specified by the left operand of the operation = ~ or ! ~ , For example:

  $ Mynumber = '12345';
 If ($ mynumber = ~ / ^ \ d + $ /) {# if the string $ mynumber consists of decimal digits, then ...
  ...
 } 

If the string is not specified, then the contents of the special variable $ _ are compared. In particular, the previous example can be rewritten as follows:

  $ _ = '12345';
 If (/ ^ \ d + $ /) {
  ...
 } 

In addition to the standard, here can be used the following modifiers:

Modifier Description
C Do not reset the search position if the mapping is unsuccessful (only with the g modifier).
G Global matching, that is, searching for all occurrences of a pattern.
O Compile the regular expression only once.

If the regular expression is enclosed in // , then the initial m is optional. The construction with initial m allows us to use as symbols of the regular expression any characters that are allowed in q-operations. Useful special cases:

  • If the delimiters are '' , then the pattern is not interpolated. In other cases, the sample interpolates and if it contains variables, then each compilation performs its compilation. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables entering the pattern remain unchanged).
  • If the limiters are characters ?? , Then the single comparison rule applies .

If pattern is an empty string, then the last successfully matched regular expression is used instead.

If the g modifier is not set and the result of the mapping is assigned to the list, an empty list is returned if the mapping is unsuccessful. The result of a successful match depends on the presence of parentheses in the pattern. If not, the list (1) is returned. Otherwise, a list is returned that consists of the values ​​of the variables $ 1, $ 2, etc., that is, a list of all the stored substrings. The following example

  ($ W1, $ w2, $ rest) = ($ x = ~ / ^ (\ S +) \ s + (\ S +) \ s*(.*)/); 

Puts the first word of the string $ x in the variable $ w1, the second word in the variable $ w2 , and the rest of this line in the variable $ rest .

The modifier g enables global matching with the pattern, that is, the search for all matches in the string. His behavior depends on the context. If the result of the mapping is assigned to the list, then a list of all the stored substrings is returned. If the sample does not contain parentheses, then the list of all matches to the sample is returned, as if it were entirely enclosed in parentheses. The following example

  $ _ = "12:23:45";
 @result = / \ d + / g;
 Foreach $ elem (@result) {
  Print "$ elem \ n";
 } 

Will display lines 12 , 23 and 45 .

In a scalar context, matching with the g modifier each time looks for the next match to the pattern and returns true or false, depending on the search result. The position in the line after the last match can be read or changed by the pos () function. An unsuccessful search usually resets the search position to zero, but we can avoid this by adding the modifier c . Changing the line also causes the search position to be reset in it.

Additional possibilities are provided by the \ G metacharacter, which only makes sense in conjunction with the g modifier. This metacharacter corresponds to the current search position in the string. The use of the construction m / \ G ... / gc is convenient, in particular, for writing lexical analyzers performing various actions for the tokens encountered in the analyzed text. The following example

  $ _ = 'Word1, word2, and 12345.';
 LOOP:
  {
  Print ("number"), redo LOOP if /\G\d+\b[,.]?\s*/gc;
  Print ("word"), redo LOOP if /\ G[A-Za-z0-9]+\b[,.]] ?\s*/gc;
  Print ("unknown"), redo LOOP if / \ G [^ A-Za-z0-9] + / gc;
  } 

Will display the string word word word number .

6.4.6.2. The only comparison with the sample

  Syntax:?
  Pattern ?
  M?  Pattern ? 

This design is completely analogous to the m / pattern / construct, with the only difference: a successful pattern match is performed only once between calls to the reset () function. This is convenient, for example, when we need to find only the first occurrence of a pattern in each file from the viewed set, for example:

  While (<>) {
  If (? ^ $?) {
  ... # process the first empty line of the file
  }
 } Continue {
  Reset if eof;  # Reset status ??  For the next file
 } 

6.4.6.3. Creating a Regular Expression

  Syntax : qr / string / modifiers
 

This construct creates a regular expression with string text and modifiers and compiles it. If the delimiters are '' , then interpolation of string is not performed. In other cases, the sample interpolates and if it contains variables, then each compilation performs its compilation. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables entering the pattern remain unchanged).

Once created, a regular expression can be used both independently and as a fragment of other regular expressions. Examples:

  $ Re = qr / \ d + /;
 $ String = ~ / \ s * $ {re} \ s * /;  # Include in another regular expression
 $ String = ~ $ re;  # Independent use
 $ String = ~ / $ re /;  # same

 $ Re = qr / $ header / is;
 S / $ re / text /;  # Same as s / $ header / text / is 

6.4.6.4. Substitution

  Syntax : s / pattern / string / modifiers
 

This operation maps the specified string to the pattern pattern and replaces the found fragments with string . It returns the number of replacements made or false (more precisely, an empty string) if the match fails. The matching string is specified by the left operand of the operation = ~ or ! ~ . It must be a scalar variable, an array element, or an element of an associative array, for example:

  $ Path = '/ usr / bin / perl';
 $ Path = ~ s | / usr / bin | / usr / local / bin |; 

If the string is not specified, then the substitution operation is performed on the special variable $ _ . In particular, the previous example can be rewritten as follows:

  $ _ = '/ Usr / bin / perl';
 S | / usr / bin | / usr / local / bin |; 

In addition to the standard, here can be used the following modifiers:

Modifier Description
E Process string as a PERL expression.
G Global substitution, that is, replacing all occurrences of the sample.
O Compile the regular expression only once.

We can use any character that is valid in q-operations instead of // . If pattern is enclosed in parentheses, then string must have its own pair of delimiters, for example, s(foo)[bar] or s<foo>/bar/ .

If the delimiters are '' , then the pattern is not interpolated. In other cases, the sample interpolates and if it contains variables, then each compilation performs its compilation. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables entering the pattern remain unchanged).

If pattern is an empty string, then the last successfully matched regular expression is used instead.

By default, only the first sample found is replaced. To replace all occurrences of a pattern in a string, you must use the g modifier.

The e modifier indicates that string is an expression. In this case, the function eval () is first applied to string , and then the substitution is performed. Example:

  $ _ = '123';
 S / \ d + / $ & * 2 / e;  # $ _ = '246'
 S / \ d / $ & * 2 / eg;  # same 

Here are some more typical examples of using a substitution operation. Removing comments of the form / * ... * / from the Java or C-program text:

  $ Program = ~ s {
  / \ * # Beginning of comment
  . *?  # The minimum number of characters
  \ * / # End of comment
 } [] Gsx; 

Removing the leading and trailing spaces in the $ var line:

  For ($ var) {
  S / ^ \ s + //;
  S / \ s + $ //;
 } 

Permutation of the first two fields in $ _ . Note that the replacement line uses the variables $ 1 and $ 2 , not the metacharacters \ 1 and \ 2 :

  S / ([^] *) * ([^] *) / $ 2 $ 1 /; 

Replace tabs with spaces with alignment on columns that are multiples of eight:

  1 while s / \ t + / '' x (length ($ &) * 8 - length ($ `)% 8) / e; 

6.4.6.5. Transliteration

  Syntax : tr / list1 / list2 / modifiers y / list1 / list2 / modifiers
 

Transliteration consists in replacing all characters from the list list1 with the corresponding characters from the list2 list. It returns the number of replaced or deleted characters. Lists should consist of separate characters and / or ranges of the form az . The string to be converted is given by the left operand of the operation = ~ or ! ~ . It must be a scalar variable, an array element, or an element of an associative array, for example:

  $ Test = 'ABCDEabcde';
 $ Test = ~ tr / AZ / az /;  # Replace lowercase letters with uppercase letters 

If the string is not specified, then the substitution operation is performed on the special variable $ _ . In particular, the previous example can be rewritten as follows:

  $ _ = 'ABCDEabcde';
 Tr (AZ) az); 

We can use any character that is valid in q-operations instead of // . If list1 is enclosed in parentheses , then list2 must have its own pair of delimiters, for example tr(AZ)[az] or tr<AZ>/az/ .

Usually this operation is called tr . The synonym y is introduced for fanatics of the sed editor and is used only by them. Transliteration supports the following modifiers:

Modifier Description
C Replace characters that are not in list1 .
D Delete characters for which there is no replacement.
S Delete duplicate characters when replacing.
U Convert to / from the UTF-8 encoding.
C Convert to / from a single-byte encoding.

The c modifier causes transliteration of all characters that are not in the list1 list. For example, the operation tr/a-zA-Z/ /c replaces all non-Latin letters with spaces.

By default, if list2 is shorter than list1 , it is updated with its last character, and if it's empty, it is assumed to be equal to list1 (this is convenient for counting the number of characters of a certain class in a string). The modifier d modifies these rules: all the characters from list1 that do not match in list2 are removed from the string. For example, the operation tr/a-zA-Z//cd will remove from the string all characters that are not Latin letters.

The s modifier removes duplicates: if several characters are successively replaced by the same symbol, only one instance of this symbol will be left. For example, the tr/ / /s operation removes repeated spaces in the string.

Modifiers C and U are designed for character conversion from system encoding to UTF-8 and vice versa. The first one points to the original encoding, and the second to the encoding of the result. For example, tr/\0-\xFF//CU recodes the string from the system encoding to UTF-8, and tr/\0-\xFF//UC performs the reverse conversion.

Transliteration is done without interpolating the list of characters, so to use the variables in it, you need to call the eval () function, for example:

  Eval "tr / $ oldlist / $ newlist /";