This page has been robot translated, sorry for typos if any. Original content here.

PERL Regular Expressions Cheat Sheet

On this topic:


Chapter 6.4. Regular expressions

6.4.1. Regular expression syntax

Regular expressions are patterns to search for specific combinations of characters in text strings and replace them with other combinations of characters (these operations are called pattern matching and substitution, respectively). The regular PERL expression has the form

  / pattern / modifiers 

Here, pattern is a string that specifies a regular expression, and modifiers are optional single-letter modifiers that specify the rules for using this regular expression.

A regular expression may consist of ordinary characters; in this case, it will correspond to the specified combination of characters in the string. For example, the expression / cat / corresponds to the highlighted substrings in the following lines: " cat ok", "for cat ", "for cat ". However, the real power of PERL regular expressions comes from the possibility of using special metacharacters in them.

Table 6.9. Regex metacharacters
Symbol Description
\ For characters that are usually interpreted literally, means that the next character is a metacharacter. For example, / n / corresponds to the letter n, and / \ n / corresponds to a newline character.
For metacharacters, means that the character must be understood literally. For example, / ^ / means the beginning of the line, and / \ ^ / corresponds simply to the symbol ^. / \\ / corresponds to the backslash \.
^ It corresponds to the beginning of the line (cf. modifier m ).
$ Corresponds to the end of the line (cf. modifier m ).
. Matches any character except a line break (cf. modifier s ).
* Match the previous character zero or more times.
+ Match the previous character one or more times.
? Matches the repetition of the previous character zero or one time.
( pattern ) Corresponds to the pattern line and remembers the match found .
x | y Matches x or y .
{ n } n is a non-negative number. Match exactly n occurrences of the previous character.
{ n ,} n is a non-negative number. Match n or more occurrences of the previous character. / x {1,} / is equivalent to / x + /. / x {0,} / is equivalent to / x * /.
{ n , m } n and m are non-negative numbers. Matches at least n and at most m occurrences of the previous character. / x {0,1} / is equivalent to / x? /.
[ xyz ] Matches any character enclosed in square brackets.
[^ xyz ] Matches any character except enclosed in square brackets.
[ a - z ] Matches any character in the specified range.
[^ a - z ] Matches any character except those in the specified range.
\ a Corresponds to the bell symbol (BEL).
\ A Matches only the beginning of the line, even with the m modifier.
\ b Corresponds to the boundary of the word, i.e. the position between \ w and \ W in any order.
\ B Matches any position other than the word boundary.
\ s X Matches the Ctrl + X character. For example, / \ cI / is equivalent to / \ t /.
\ C Corresponds to one byte, even with the use utf8 directive.
\ d Corresponds to the number. Equivalent to [0-9].
\ D Matches a non-numeric character. Equivalent to [^ 0-9].
\ e Matches escape character (esc).
\ E The end of the transformations \ L , \ Q , \ U.
\ f Corresponds to the format transfer (FF) character.
\ G Corresponds to the position in the line equal to pos () .
\ l Converts the next character to lowercase.
\ L Converts characters to lower case before \ E.
\ n Corresponds to line breaks.
\ p property Matches Unicode characters with property property . If a property is specified with multiple characters, use the syntax \ p { property } .
\ P property Matches Unicode characters that do not have a property . If a property is specified with multiple characters, use the syntax \ P { property } .
\ Q Adds a "\" character in front of the metacharacters before \ E.
\ r Corresponds to the carriage return (CR) symbol.
\ s Matches a space character. Equivalent to / [\ f \ n \ r \ t] /.
\ S Matches any non-whitespace character. Equivalent to / [^ \ f \ n \ r \ t] /.
\ t Corresponds to the tab character (HT).
\ u Converts the next character to upper case.
\ U Converts characters to upper case before \ E.
\ w Corresponds to the Latin letter, number or underscore. Equivalent to / [A-Za-z0-9_] /.
\ W Matches any character except the Latin letter, numbers or underscores. Equivalent to / [^ A-Za-z0-9_] /.
\ X Corresponds to a sequence of Unicode characters from the main character and a set of diacritical icons. Equivalent to / C <(?: \ PM \ pM *)> /.
\ z Matches only the end of the line, even with the modifier m .
\ Z Matches only the end of a line or a line break at the end of a line, even with the modifier m .
\ n n is a positive number. Corresponds to the nth memorized substring . If the left parenthesis before this character is less than n , and n > 9, then it is equivalent to \ 0 n .
\ 0 n n is the octal number not greater than 377. It corresponds to the symbol with the octal code n . For example, / \ 011 / is equivalent to / \ t /.
\ x n n is a hexadecimal number consisting of two digits. Matches the character with the hexadecimal code n . For example, / \ x31 / is equivalent to / 1 /.
\ x { n } n is a hexadecimal number consisting of four digits. Corresponds to the Unicode character with the hexadecimal code n . For example, / \ x {2663} / is equivalent to / ♣ /.

6.4.2. Modifiers

Different operations with regular expressions use different modifiers to refine the operation being performed. However, the four modifiers have a general purpose.

i
Ignore case of characters when pattern matching. When using the use locale directive, the conversion of characters to one register is done according to the national setting.
m
Treats the source line as a buffer of several lines of text separated by line breaks. This means that the metacharacters ^ and $ correspond not only to the beginning and end of the entire line, but also to the beginning and end of a line of text bounded by line breaks.
s
Regards the source line as a single line of text, ignoring line breaks. This means the metacharacter . matches any character, including line breaks.
x
Allows the use of spaces and comments. Spaces that do not have a preceding \ character and are not enclosed in [] are ignored. The # character starts a comment, which is also ignored.

6.4.3. Unicode and POSIX character classes

We can use syntax in regular expressions

  [: class:] 

where class specifies the name of the POSIX character class, i.e., the mobile standard for the C language. When using the use utf8 directive, the Unicode character classes can be used in the construction instead of the POSIX classes.

  \ p {class} 

The following table summarizes all the POSIX character classes, the corresponding Unicode character classes, and the metacharacters, if any.

Table 6.10. Character classes
Posix Unicode Metacharacter Description
alpha IsAlpha Letters
alnum IsAlnum Letters and numbers
ascii IsAscii ASCII characters
cntrl IsCntrl Control characters
digit IsDigit \ d Numbers
graph IsGraph Letters, numbers and punctuation marks
lower IsLower Lower case
print Isprint Letters, numbers, punctuation and space
punct IsPunct Punctuation marks
space IsSpace \ s Space characters
upper Isupper Upper case letters
word Isward \ w Letters, numbers and underscores
xdigit IsXdigit Hexadecimal numbers

For example, a decimal number can be specified in any of the following three ways:

  / \ d + /
 / [: digit:] + /
 / \ p {IsDigit} + / # use utf8 

To indicate that a character does not belong to a given class, constructs are used.

  [: ^ class:]
 \ P {class}

For example, the following expressions have the same meaning:

  [: ^ digit:] \ D \ P {IsDigit}
 [: ^ space:] \ S \ P {IsSpace}
 [: ^ word:] \ W \ P {IsWord} 

6.4.4. Memorization of substrings

The use of parentheses in a regular expression leads to the fact that the substring corresponding to the pattern in brackets is remembered in a special buffer. To access the nth memorized substring inside a regular expression, the \ n construct is used, and outside it, $ n , where n can take any values, starting with 1. However, remember that PERL uses the expressions \ 10 , \ 11 and t . as synonyms for octal character codes \ 010 , \ 011 , etc. The ambiguity here is resolved as follows. The symbol \ 10 is considered to be the reference to the 10th memorized substring if it has at least ten left parentheses before it in the regular expression; otherwise, it is a character with an octal code 10. Metacharacters \ 1 , ... \ 9 are always considered as references to memorized substrings. Examples:

  if (/(.)\1/) {# looking for the first repeating character
  print "'$ 1' - the first repeating character \ n";
 }
 if (/ Time: (..) :( ..) :( ..) /) {# extract time components
  $ hours = $ 1;
  $ minutes = $ 2;
  $ seconds = $ 3;
 } 

In addition to the variables $ 1 , $ 2 , ... there are several special variables in which the results of the last operation with a regular expression are stored, namely:

Variable Description
$ & The last substring found.
$ ` The substring before the last substring found.
$ ' The substring after the last found substring.
$ + Last stored substring.

Let's give an example:

  'AAA111BBB222' = ~ / (\ d +) /;
 print "$` \ n ";  # AAA
 print "$ & \ n";  # 111
 print "$ '\ n";  # BBB222
 print "$ + \ n";  # 111 

All of these special variables retain their values ​​until the end of the enclosing block or until the next successful pattern match.

6.4.5. Extended samples

PERL contains several additional constructions that can be used in regular expressions to expand their capabilities. All of these constructions are enclosed in parentheses and begin with the symbol ? that distinguishes them from remembering substrings.

(? # text )
Comment. The whole construction is ignored.
(? modifiers - modifiers )
Enables or disables specified modifiers . Modifiers standing before the - symbol are turned on, those after it are turned off. Example:
  if (/ aaa /) {...} # case sensitive mapping
 if (/ (? i) aaa /) {...} # case insensitive mapping 
(?: pattern )
(? modifiers - modifiers : pattern )
Allows you to group regular expression subexpressions without memorizing the found match. The second form additionally enables or disables the specified modifiers . For example, the expression /ко(?:т|шка)/ is a brief record of the expression /кот|кошка/ .
(? = pattern )
Matching with looking ahead without memorizing the found match. For example, the expression /Windows (?=95|98|NT|2000)/ matches "Windows" in the string "Windows 98", but does not match in the string "Windows 3.1". After matching, the search continues from the position following the found match, without looking ahead.
(?! pattern )
Mismatch with looking ahead without memorizing the found match. For example, the expression /Windows (?!95|98|NT|2000)/ matches "Windows" in the string "Windows 3.1", but does not match in the string "Windows 98". After matching, the search continues from the position following the found match, without looking ahead.
(? <= pattern )
Matching with glancing back without memorizing the found match. For example, the expression /(?<=\t)\w+/ matches the word following the tab character, and the tab character is not included in $ & . The fragment corresponding to peering back should have a fixed width.
(? <! pattern )
The discrepancy with looking back without memorizing the found match. For example, the expression /(?<!\t)\w+/ matches a word before which there is no tab character. The fragment corresponding to peering back should have a fixed width.

6.4.6. Regular expression operations

So far, we have enclosed regular expressions in // characters. In fact, the limit characters of a regular expression are determined by the q-operation that we apply to them. This section describes in detail all PERL operations with regular expressions.

6.4.6.1. Pattern matching

  Syntax : / pattern / modifiers m / pattern / modifiers
 

This operation matches the specified string with the pattern pattern and returns true or false depending on the matching result. The mapped string is specified by the left operand of the operation = ~ or ! ~ , For example:

  $ mynumber = '12345';
 if ($ mynumber = ~ / ^ \ d + $ /) {# if the string $ mynumber consists of decimal digits, then ...
  ...
 } 

If the string is not specified, then a comparison is made with the contents of the special $ _ variable. In particular, the previous example can be rewritten as:

  $ _ = '12345';
 if (/ ^ \ d + $ /) {
  ...
 } 

In addition to the standard, the following modifiers can be used here:

Modifier Description
c Do not reset the search position in case of unsuccessful matching (only with the g modifier).
g Global matching, i.e. search for all occurrences of the sample.
o Compile a regular expression only once.

If the regular expression is enclosed in // , then the initial m is optional. The construction with initial m allows using any characters allowed in q-operations as limiters of a regular expression. Useful special cases:

  • If the delimiters are the characters '' , then the pattern string is not interpolated. In other cases, an interpolation of the sample occurs and if it contains variables, then at each matching it is compiled. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables included in the sample remain unchanged).
  • If the delimiters are characters ?? then the single match rule applies.

If pattern is an empty string, then the last successfully matched regular expression is used instead.

If the g modifier is not specified and the result of the mapping is assigned to the list, then an empty list is returned if the mapping fails. The result of a successful match depends on the presence of parentheses in the sample. If not, then the list is returned (1) . Otherwise, it returns a list consisting of the values ​​of the variables $ 1, $ 2, and so on, that is, a list of all the stored substrings. Following example

  ($ w1, $ w2, $ rest) = ($ x = ~ /^(\S+)\s+(\S+)\s **(.*)/); 

puts the first word of the string $ x into the variable $ w1 , its second word into the variable $ w2 , and the remainder of the string into the variable $ rest .

The modifier g includes a global pattern matching pattern, i.e. a search for all matches in a string. His behavior depends on the context. If the result of the match is assigned to the list, then a list of all the stored substrings is returned. If the sample does not contain parentheses, then a list of all matches to the sample is returned, as if it were entirely enclosed in parentheses. Following example

  $ _ = "12:23:45";
 @result = / \ d + / g;
 foreach $ elem (@result) {
  print "$ elem \ n";
 } 

will display lines 12 , 23 and 45 .

In scalar context, matching with modifier g each time searches for the next match for the pattern and returns true or false depending on the search result. The position in the string after the last match can be read or changed by the pos () function. An unsuccessful search usually resets the search position to zero, but we can avoid this by adding the c modifier. Changing a string also resets the search position in it.

Additional features are provided by the \ G metacharacter, which makes sense only in combination with the g modifier. This metacharacter corresponds to the current search position in the string. Using the m / \ G ... / gc construction is convenient, in particular, for writing lexical analyzers that perform various actions for the lexemes found in the analyzed text. Following example

  $ _ = 'Word1, word2, and 12345.';
 LOOP:
  {
  print ("number"), redo LOOP if /\G d + C b ;,;;?s / / gc;
  print ("word"), redo LOOP if /\GAA-Za-z0-9 ++b, ;.
  print ("unknown"), redo LOOP if / \ G [^ A-Za-z0-9] + / gc;
  } 

will display the string word word word number .

6.4.6.2. The only pattern matching

  Syntax :?  pattern ?
  m?  pattern ? 

This construction is completely analogous to the m / pattern / construction with the only difference: successful pattern matching is performed only once between the calls to the reset () function. This is convenient, for example, when we need to find only the first occurrence of the sample in each file in the set being viewed, for example:

  while (<>) {
  if (? ^ $?) {
  ... # process the first empty line of the file
  }
 } continue {
  reset if eof;  # reset status ??  for the next file
 } 

6.4.6.3. Creating a regular expression

  Syntax : qr / string / modifiers
 

This construct creates a regular expression with string text and modifiers modifiers and compiles it. If delimiters are '' characters, then no interpolation of string is performed. In other cases, an interpolation of the sample occurs and if it contains variables, then at each matching it is compiled. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables included in the sample remain unchanged).

Once a regular expression is created, it can be used both independently and as a fragment of other regular expressions. Examples:

  $ re = qr / \ d + /;
 $ string = ~ / \ s * $ {re} \ s * /;  # inclusion in another regular expression
 $ string = ~ $ re;  # independent use
 $ string = ~ / $ re /;  # same

 $ re = qr / $ header / is;
 s / $ re / text /;  # same as s / $ header / text / is 

6.4.6.4. Substitution

  Syntax : s / pattern / string / modifiers
 

This operation matches the specified string with the pattern pattern and replaces the found fragments with the string . It returns the number of replacements made or false (more precisely, an empty string) if the match failed. The mapped string is specified by the left operand of the operation = ~ or ! ~ . It must be a scalar variable, an element of an array, or an element of an associative array, for example:

  $ path = '/ usr / bin / perl';
 $ path = ~ s | / usr / bin | / usr / local / bin |; 

If the string is not specified, then a substitution operation is performed on the $ _ special variable. In particular, the previous example can be rewritten as:

  $ _ = '/ usr / bin / perl';
 s | / usr / bin | / usr / local / bin |; 

In addition to the standard, the following modifiers can be used here:

Modifier Description
e Treat string as a PERL expression.
g Global substitution, i.e. replacement of all occurrences of the sample.
o Compile a regular expression only once.

We can use in place of // any character allowed in q-operations. If the pattern is enclosed in paired brackets, then the string must have its own pair of terminators, for example s(foo)[bar] or s<foo>/bar/ .

If the delimiters are the characters '' , then the pattern string is not interpolated. In other cases, an interpolation of the sample occurs and if it contains variables, then at each matching it is compiled. To avoid this, use the o modifier (of course, if you are sure that the values ​​of the variables included in the sample remain unchanged).

If pattern is an empty string, then the last successfully matched regular expression is used instead.

By default, only the first pattern found is replaced. To replace all occurrences of a sample in a string, use the g modifier.

The e modifier indicates that string is an expression. In this case, the eval () function is first applied to the string , and then the substitution is performed. Example:

  $ _ = '123';
 s / \ d + / $ & * 2 / e;  # $ _ = '246'
 s / \ d / $ & * 2 / eg;  # same 

Let us give some more typical examples of using the substitution operation. Delete comments like / * ... * / from Java or C program text:

  $ program = ~ s {
  / \ * # Begin comment
  . *?  # Minimum number of characters
  \ * / # End of comment
 } [] gsx; 

Remove leading and trailing spaces in the $ var string:

  for ($ var) {
  s / ^ \ s + //;
  s / \ s + $ //;
 } 

Rearrange the first two fields in $ _ . Note that the replacement string uses the variables $ 1 and $ 2 , and not the \ 1 and \ 2 metacharacters:

  s / ([^] *) * ([^] *) / $ 2 $ 1 /; 

Replacing tabs with spaces with alignment on columns that are multiples of eight:

  1 while s / \ t + / '' x (length ($ &) * 8 - length ($ `)% 8) / e; 

6.4.6.5. Transliteration

  Syntax : tr / list1 / list2 / modifiers y / list1 / list2 / modifiers
 

Transliteration consists in replacing all the characters from list1 with the corresponding characters from list2 . It returns the number of characters replaced or deleted. Lists must consist of individual characters and / or ranges of the form az . The string to be converted is specified by the left operand of the operation = ~ or ! ~ . It must be a scalar variable, an element of an array, or an element of an associative array, for example:

  $ test = 'ABCDEabcde';
 $ test = ~ tr / AZ / az /;  # replacing lowercase letters to uppercase 

If the string is not specified, then a substitution operation is performed on the $ _ special variable. In particular, the previous example can be rewritten as:

  $ _ = 'ABCDEabcde';
 tr / AZ / az /; 

We can use in place of // any character allowed in q-operations. If list1 is enclosed in brackets, then list2 must have its own pair of delimiters, for example, tr(AZ)[az] or tr<AZ>/az/ .

This operation is usually called tr . The synonym y is introduced for fanatics of sed and is used only by them. Transliteration supports the following modifiers:

Modifier Description
c Replace characters that are not in list1 .
d Delete characters for which there is no replacement.
s Remove duplicate characters when replacing.
U Convert to / from UTF-8 encoding.
C Convert to / from single-byte encoding.

The modifier c causes the transliteration of all characters that are not included in the list1 list. For example, the operation tr/a-zA-Z/ /c will replace all characters that are not Latin letters with spaces.

By default, if list2 is shorter than list1 , it is supplemented with its last character, and if it is empty, it is taken to be list1 (this is convenient for counting the number of characters of a particular class in a line). The d modifier changes these rules: all characters from list1 that have no match in list2 are removed from the string. For example, the operation tr/a-zA-Z//cd will remove from the string all characters that are not Latin letters.

The modifier s removes repetitions: if several characters are replaced by one and the same character in a row, then only one instance of this character will be left. For example, the operation tr/ / /s removes repeated spaces in a string.

C and U modifiers are used to translate characters from system encoding to UTF-8 and back. The first one points to the original encoding, and the second one - to the result encoding. For example, tr/\0-\xFF//CU recodes a string from system encoding to UTF-8, and tr/\0-\xFF//UC performs the inverse transcoding.

Transliteration is performed without interpolating lists of characters, so to use variables in it, you must call the eval () function, for example:

  eval "tr / $ oldlist / $ newlist /";