Crib on regular expressions JavaScript

On this topic:


general description

Regular expressions are patterns for finding specified combinations of characters in text strings (such a search is called matching with a pattern). There are two ways to assign variables to regular expressions, namely:
Using the object initializer: var re = / pattern / switch ?.
Using the RegExp constructor: var re = new RegExp ("pattern" [, "switch"]?).
Here pattern is a regular expression, and switch is optional search options.

Object initializers, for example, var re = / ab + c / , should be used in cases where the value of the regular expression remains unchanged while the script is running. Such regular expressions are compiled during the loading of the script and, therefore, run faster.

A call to the constructor, for example, var re = new RegExp ("ab + c") , should be used in cases where the value of the variable will change. If you are going to use the regular expression several times, it makes sense to compile it using the compile method to find patterns more efficiently.

When creating a regular expression, you should consider that enclosing it in quotation marks entails the need to use escape sequences, as in any other string constant. For example, the following two expressions are equivalent:

	 Var re = / \ w + / g;
	 Var re = new RegExp ("\\ w +", "g");  // In the line "\" should be replaced with "\\"
	

Note: the regular expression can not be empty: two characters // in succession set the beginning of the comment. Therefore, to specify an empty regular expression, use the expression /.?/.

Regular expressions are used by the execEx and test methods of the RegExp object and the match, replace, search, and split methods of the String object. If we just need to check whether the given string contains a substring corresponding to the pattern, then the test or search methods are used. If we need to extract a substring (or substrings) that match the pattern, then we need to use the exec or match methods. The replace method searches for a given substring and replaces it with another string, and the split method allows you to split the string into multiple substrata, based on a regular expression or an ordinary text string. For more information about the use of regular expressions, see the description of the corresponding methods.

Regular Expression Syntax

A regular expression can consist of ordinary characters; In this case it will match the specified combination of characters in the string. For example, the expression / com / matches the selected substrings in the following lines: "lump", "gourmet", "fleet chief". However, the flexibility and power of regular expressions makes it possible to use special characters in them, which are listed in the following table.

Special characters in regular expressions:

\ - For characters that are usually treated literally, means that the next character is special. For example, / n / matches the letter n, and / \ n / matches the newline character. For characters that are usually treated as special, it means that the character must be understood literally. For example, / ^ / denotes the beginning of the line, and / \ ^ / corresponds simply to the character ^. / \\ / matches the backslash \.

^ - Matches the beginning of the line.

$ - Matches the end of the line.

* - Matches the repetition of the previous character zero or more times.

+ - Corresponds to repeating the previous character one or more times.

? - Matches the repetition of the previous character zero or one time.

. Match any character except the newline character.

( Pattern ) - Corresponds to the pattern line and remembers the match found.

(?: Pattern ) - Matches the string pattern, but does not remember the match found. Used to group parts of a sample, for example, / to (?: T | cd) / is a short entry of the expression / cat / cat /.

(? = Pattern ) - Matching with "looking ahead" occurs when the pattern line matches without remembering the match found. For example, / Windows (? = 95 | 98 | NT | 2000) / matches "Windows" in the string "Windows 98", but does not match in the line "Windows 3.1". After matching, the search continues from the position following the match found, without taking into account the forward look.

(?! Pattern ) - Matching with "looking ahead" occurs when the pattern line does not match without storing the match found. For example, / Windows (?! 95 | 98 | NT | 2000) / matches "Windows" in the string "Windows 3.1", but does not match in the string "Windows 98". After matching, the search continues from the position following the match found, without taking into account the forward look.

X | y - Matches x or y.

{N} - n is a nonnegative number. Matches exactly n occurrences of the previous character.

{N,} - n is a nonnegative number. Matches n or more occurrences of the previous character. / X {1,} / is equivalent to / x + /. / X {0,} / is equivalent to / x * /.

{N, m} - n and m is a nonnegative number. Matches at least n and no more than m occurrences of the previous character. / X {0,1} / is equivalent to / x? /.

[ Xyz ] - Matches any character from the enclosed in square brackets.

[^ Xyz ] - Matches any character except those enclosed in square brackets.

[ Az ] - Matches any character in the specified range.

[^ Az ] - Matches any character other than those within the specified range.

\ B - Matches the word boundary, i.e., the position between a word and a space or a line feed.

\ B - Matches any position other than the word boundary.

\ CX - Matches the character Ctrl + X. For example, / \ cI / is equivalent to / \ t /.

\ D - Matches the digit. Equivalent to [0-9].

\ D - Matches a non-numeric character. Equivalent to [^ 0-9].

\ F - Matches the translation character of the format (FF).

\ N - Matches the line feed (LF) character.

\ R - Matches the carriage return (CR) character.

\ S - Matches the space character. Equivalent to / [\ f \ n \ r \ t \ v] /.

\ S - Matches any non-whitespace character. Equivalent to / [^ \ f \ n \ r \ t \ v] /.

\ T - Matches the tab character (HT).

\ V - Matches the vertical tab (VT) character.

\ W - Matches a letter, number, or underscore. Equivalent to / [A-Za-z0-9_] /.

\ W - Matches any character except for a letter, number, or underscore. Equivalent to / [^ A-Za-z0-9_] /.

\ N n is a positive number. Corresponds to the nth stored substring. Calculated by counting left parentheses. If the left parenthesis before this character is less than n, then it is equivalent to \ 0n.

\ 0n n is an octal number not greater than 377. It corresponds to the character with the octal code n. For example, / \ 011 / is equivalent to / \ t /.

\ Xn n is a hexadecimal number consisting of two digits. Matches a character with hexadecimal code n. For example, / \ x31 / is equivalent to / 1 /.

\ Un n is a hexadecimal number consisting of four digits. Matches a Unicode character with hexadecimal n. For example, / \ u00A9 / is equivalent to / c /.

Regular expressions are evaluated similarly to the rest of the JavaScript expressions, i.e., taking into account the priority of the operations: operations that have higher priority are performed first. If the operations have equal priority, then they are executed from left to right. The following table shows the list of regular expression operations in descending order of priority; The operations in one row of the table have the same priority.

Operations:
  \
 () (? :) (? =) (?!) []
	 * +?  .  {N} {n,} {n, m}
	 ^ $ \ Metacharacter
	 | | |
	

Search Options

When creating a regular expression, we can specify additional search options:
I (ignore case). Do not distinguish between uppercase and lowercase letters.
G (global search). Global search for all occurrences of a pattern.
M (multiline). Multiline search.
Any combinations of these three options, for example ig or gim.

Example

	 Var s = "Learning the JavaScript language";
	 Var re = / JAVA /;
	 Var result = re.test (s)?  "" " : "" not ";
	 Document.write ("String" "+ s + result +" corresponds to the pattern "+ re);
	

Because regular expressions distinguish between lowercase and uppercase letters, this example will display the text in the browser window:

The string "Learning the JavaScript language" does not match the sample / JAVA /

If we now replace the second line of the example with var re = / JAVA / i ;, then the text will be displayed:

The string "Learning the JavaScript language" corresponds to the sample / JAVA / i


Now consider the global search option. It is usually used by the replace method when searching for a pattern and replacing the found substring with a new one. The fact is that by default this method replaces only the first found substring and returns the result. Consider the following scenario:

	 Var s = "We write scripts in JavaScript," +
	 "But JavaScript is not the only scripted language.";
	 Var re = / JavaScript /;
	 Document.write (s.replace (re, "VBScript"));
	

It outputs text that clearly does not match the desired result: We write scripts in VBScript, but JavaScript is not the only scripted language. In order for all occurrences of the string "JavaScript" to be replaced with "VBScript", we must change the value of the regular expression to var re = / JavaScript / g; . Then the resulting string will look like:

We write scripts in VBScript, but VBScript is not the only scripted language.

Finally, the multi-line search option allows you to match a pattern of a string expression consisting of several lines of text, connected by line break characters. By default, the pattern matching is terminated if a line break character is found. This option overcomes the specified constraint and provides a sample search across the entire source line. It also affects the interpretation of some special characters in regular expressions, namely: Usually, the ^ character is only mapped to the first element of the string. If the multiline search option is enabled, it is also matched with any line element preceded by a line break character. Usually the $ character matches only the last element of the string. If the multiline search option is enabled, it is also mapped to any line item that is a line break character.

Memorizing found substrings

If part of the regular expression is enclosed in parentheses, the corresponding substring will be stored for later use. To access stored substrings, use the properties $ 1,:, $ 9 of the RegExp object, or the elements of the array returned by the exec and match methods. In the latter case, the number of found and stored substrings is unlimited.

The following script uses the replace method to rearrange words in a string. To replace the found text, the properties $ 1 and $ 2 are used.

	 Var re = / (\ w +) \ s (\ w +) / i;
	 Var str = "Mikhail Bulgakov";
	 Document.write (str.replace (re, "$ 2, $ 1"))
	

This script will display the text in the browser window:

Bulgakov, Mikhail

Since \ W = [A-Za-z0-9_], then the Russian letters will not work. If we want to use Russian letters, then we will have to slightly modify the code:

	 Var re = / ([а-я] +) \ s ([а-я] +) / i; 
	 Var str = "Mikhail Bulgakov"; 
	 Document.write (str.replace (re, "$ 2, $ 1"));  // Bulgakov, Mikhail
	

This script will display the text in the browser window:

Bulgakov, Mikhail

Introduction

Basic concepts

Regular expressions are a powerful tool for processing incoming data. A task that requires a replacement or search for text can be beautifully solved with the help of this "language within the language". And although the maximum effect from regular expressions can be achieved when using server-side languages, you should not underestimate the capabilities of this application on the client side.

Regular expression is a tool for processing strings or a sequence of characters that defines a text template.

Modifier - designed to "instruct" the regular expression.
Metacharacters are special characters that serve as commands for the regular expression language.

A regular expression is specified as an ordinary variable, only a slash is used instead of the quotes, for example:

  Var reg = / reg_Expression / 

Under the simplest templates we will understand such patterns that do not need any special symbols.

Let's say our task is to replace all the letters "p" (small and capital) with the Latin capital letter "R" in the phrase Regular expressions.

Create a template var reg = / p / and using the replace method, we do the intended

	 <Script language = "JavaScript">
		 Var str = "Regular expressions"
		 Var reg = / p /
		 Var result = str.replace (reg, "R")
		 Document.write (result)
	 </ Script>
	

As a result, we get the string 'REGULAR expressions', the replacement occurred only at the first occurrence of the letter "p" with regard to the register. But under the conditions of our problem this result does not fit ... Here we need the modifiers "g" and "i", which can be used both separately and together.
These modifiers are placed at the end of the regular expression pattern, after the slash, and have the following values: modifier "g" - sets the search on the line as "global", i.e. In our case, the replacement will occur for all occurrences of the letter "p". Now the template looks like this: var reg = / p / g , substituting it in our code

	 <Script language = "JavaScript">
		 Var str = "Regular expressions"
		 Var reg = / p / g
		 Var result = str.replace (reg, "R")
		 Document.write (result)
	 </ Script>
	
We get the line 'Regulatory Expressions'.

The modifier "i" - sets the search on the line without regard for the case, adding this modifier to our template var reg = / p / gi, after execution of the script we get the desired result of our task - 'Rectangular expressions'.

Special characters (metacharacters)

Metacharacters specify the type of characters of the search string, the method of surrounding the search string in the text, as well as the number of characters of an individual type in the text being viewed. Therefore, metacharacters can be divided into three groups:

Match metacharacters.
Quantitative metacharacters.
Metacharacters of positioning.
Match metacharacters

\ B the word boundary, specifies the condition under which the template should be executed at the beginning or end of the words.

\ B is not a word boundary, it specifies a condition under which the pattern is not executed at the beginning or end of a word.

\ D is a digit from 0 to 9.

\ D is not a digit.

\ S a single blank character that matches the space character.

\ S is a single non-empty character, any one character except for a space.

\ W letter, number, or underscore.

\ W is not a letter, number, or underscore.

. Any symbol, any characters, letters, numbers, etc.

[] Character set, specifies the condition under which the pattern should be executed with any matching of the characters enclosed in square brackets.

[^] A set of non-incoming characters, specifies a condition under which the template should not be executed when any characters match in brackets.

Quantitative metacharacters

* Zero and more times.

? Zero or one time

+ One or more times.

{N} exactly n times.

{N,} n or more times.

{N, m} at least n times, but not more than m times.

Metacharacters of positioning

^ At the beginning of the line.

$ At the end of the line.

Some methods for working with templates

Replace - we already used this method at the very beginning of the article, it is intended to search for a pattern and replace the found substring with a new substring.

Test - this method checks if there is a match in the string relative to the pattern and returns false if the pattern match fails, otherwise true.

Example

	 <Script language = "JavaScript">
		 Var str = "JavaScript"
		 Var reg = / PHP /
		 Var result = reg.test (str)
		 Document.write (result)
	 </ Script>
	

Will print false as the result; The string "JavaScript" does not equal the string "PHP".

Also, the test method can return any other string specified by the programmer instead of true or false.
eg:

	 <Script language = "JavaScript">
		 Var str = "JavaScript"
		 Var reg = / PHP /
		 Var result = reg.test (str)?  "String Matched": "String did not match"
		 Document.write (result)
	 </ Script>
	
In this case, the result is the string 'The string did not match'.

Exec - this method compares the string with the pattern specified by the template. If the pattern matching fails, null is returned. Otherwise, the result is an array of substrings corresponding to the specified pattern. / * The first element of the array will be equal to the original string that satisfies the specified pattern * /
eg:

	 <Script language = "JavaScript">
		 Var reg = / (\ d +). (\ D +). (\ D +) /
		 Var arr = reg.exec ("I was born on September 15, 1980")
		 Document.write ("Date of birth:", arr [0], "<br>")
		 Document.write ("Birthday:", arr [1], "<br>")
		 Document.write ("Month of Birth:", arr [2], "<br>")
		 Document.write ("Year of birth:", arr [3], "<br>")
	 </ Script>
	

The result is four lines:
Date of birth: 15.09.1980
Birthday Birthday Unknown
Month of birth: 09
Year of birth: 1980

Conclusion

Not all features and delights of regular expressions are displayed in the article, for a deeper study of this issue I will advise you to study the RegExp object. I also want to pay attention to the fact that the syntax of regular expressions does not differ either in JavaScript or in PHP. For example, to verify the correctness of entering e-mail, a regular expression for that for JavaScript, that for PHP will look the same
/[0-9a-z_]+@[0-9a-z_^.]+.[az]{2,3}/i.

Regular expressions are a kind of pattern for finding specific combinations of characters in strings, with regular expressions enclosed in slashes. Below are the literals of regular expressions.

Literals

Literal Description
\ It is part of special characters, and also lets the interpreter understand that the next character is not a literal. Used before octal character codes to extract stored subexpressions from memory and to use a literal in the regular expression.
^ Start of line
$ End of line
* Means that the preceding character should meet on the line many times or not meet
+ Means that the preceding character should meet in the row one or more times
? Means that the preceding character should meet once per line or not meet
{number} This means that the preceding character must meet in the line the specified number of times
{Number,} Means that the preceding character must meet on the line the specified number of times and more
{Number1, number2} This means that the preceding character must occur on the line from the first to the second number of times
. Indicates any character other than \ n (new line)
(A subexpression) Looks for a subexpression and stores the found group of symbols
\ Group number Retrieves the specified character group stored with the previous literal
Character1 | character 2 Searches for one of two characters
[Character set] Searches for a character from a given set
[^ Character set] Looks for any character not included in the set
\ B Indicates the word boundary, the position between the word and a space
\ B Indicates the space boundary, the position between a space and a word
\ C It coincides with the control character of the form "Ctrl" + "symbol"
\ D Any number
\ D Any non-numeric character
\ F Page translation symbol
\ N New line character
\ R Carriage return symbol
\ S Space, tab, new line or line feed
\ T Tabulation
\ V Vertical tabulation
\ W Letter, number or underscore
\ X Code Symbol with the specified hexadecimal code
\ OCod Symbol with the specified code
\ Group number Removes from memory, a previously stored group of characters with a specified number

Example

	 /(\w+)@({\w\._}+)/
	

This expression looks for any email address, dividing it into two parts: the mailbox name and the server name, and stores them in memory in the form of groups of characters under numbers 1 and 2.

Class RegExp

This class is responsible for processing the lines with the help of regulary expressions. Its constructor has the following form:

  RegExp ( Regular expression , Flags )
	

It takes only one mandatory paremeter - this is the "Regular Expression", which is quoted. The "flags" parameter is an additional search condition and can take values:

  • G - the global search is set, if this flag is set, then the expression will return all matching words.
  • I - ignoring the case of characters
  • M - multiline search

To work with regular variances, three methods of the String class are used:

  • Match - performs a string search using the regular expression passed as a parameter and returns an array with search results. If nothing is found, null is returned.
  • Replase - performs a search and replace in a string using regular expressions, and returns the resulting string.
  • Search - performs a pointer in a string, using the regular expression passed as a parameter, and returns the position of the first substring that matches the regular expression.

Properties

Property Description
LastIndex Specifies the start position of the search in the line
Sourse Returns a regular expression string (read-only)
Global Defines the presence of the flag g and returns either true or false
IgnoreCase Defines the presence of the i flag and returns either true or false
Multilane Defines the presence of the flag m and returns either true or false

Methods

Method Description
Compile ( Regular precipitation of g, i, m ) Compiles a regular expression into an internal format to speed up the work, can be used to modify a regular expression
Exec ( string ) Similar to the match method of the String class, but the string where you want to search is passed as a parameter
Test ( string ) Similar to the search method of the String class, it returns true or false depending on the search results

Example

	 Var result, re, str;
	 Str = "http://www.netscape.com";
	 Re = new RegExp ("w {3}", "i");
	 Result = str.match (re)
	

Here, this section of the script searches for the text "www" in the string that was assigned to the variable "str" ​​without case, and the match method returns a result array containing the search results.

Global RegExp object

This global object serves to leverage the search results using regular expressions. This object is created by the interpreter and is always available. The format of access to its properties:

  RegExp.  property
 

Properties

Property Description
$номер подвыражения Returns one of the last found subexpressions (depends on the number). The number can be 1-9, because the interpreter stores in this property only the last nine subexpressions found, to access the rest, an array returned by the match or exec methods is used
Index Возвращает позоцию в строке найденой подстроки
Input | & _ Возвращает строку, где производится поиск
LastIndex Specifies the start position of the search in the line
LastMatch | $ & Returns the last found substring
lastParent|$+ Returns the last found group of characters if the regular expression used subexpressions
leftContext|$' Возвращает строку, составленую из всех символов от начала строки до последней найденой подстроки, не включая ее
RightContext | $ ' Возвращает строку, составленую из всех символов от последней найденой подстроки, не включая ее, до конца исходной строки

Examples of using Regular Expressions

Разбор URL

	var re, str, protocol, address, filename, result;
	 Str = "http // www.somedomain.ru / index2.html";
	 Re = new RegExp ("((\ w +): \ / \ /)? ([^ /] +) (. *)?", "I");
	 Result = re.exec (str);
	 If (result! = Null)
	 {
		 Protocol = RegExp. $ 2;
		address=RegExp.$3;
		filename=RegExp.$4;
	 }
	

Данный скрипт разбивает интернет адрес на несколько составных частей. In the regular expression, several subexpressions were used, which are saved by the interpreter as follows: first the external expression is stored, and then the internal one. After the regular expression, the string (result = re.exec (str);) starts, which begins to break the address, and then the address is checked and, in the case of a positive result, the corresponding parts of the address are assigned to the variables.

Function of removing spaces at the beginning and end of a line

	 Function trim (str)
	 {
	 return str.toString().replace(/^[ ]+/, '').replace(/[ ]+$/, '');
	 }
	
Another variant:
	function trim(str) {
	 return str.replace(/^\s+|\s+$/g, '');
	  }
	

Find music on the page

 Var mmReg = / (?: http: \ / \ / [\ w.] + \ /)? (?!: \ / \ /) [^ <^> ^ "^ '^ \ S] + \. (? : Aiff | au | avi | flv | mid | mov | mp3 | ogg | ra | rm | spl | swf | wav | wma | wmv) (?! \ W) / ig; this.tmp = this.allText.match ( MmReg); if (this.tmp && this.search_common_embeds) if ((this.tmp.length> 1)) if ((this.tmp.length> 2) || (this.tmp [0]! = This.tmp [1])) ...