This page has been robot translated, sorry for typos if any. Original content here.

Crib on regular expressions JavaScript

On this topic:


general description

Regular expressions are patterns for searching for specified character combinations in text strings (such a search is called matching with a pattern). There are two ways to assign variables to regular expressions, namely:
Using the object initializer: var re = / pattern / switch ?.
Using the RegExp constructor: var re = new RegExp ("pattern" [, "switch"]?).
Here pattern is a regular expression, and switch is optional search options.

Object initializers, for example, var re = / ab + c / , should be used when the value of the regular expression remains unchanged while the script is running. Such regular expressions are compiled during the loading of the script and, therefore, are faster.

The call to the constructor, for example, var re = new RegExp ("ab + c") , should be used in cases where the value of the variable will change. If you are going to use the regular expression several times, it makes sense to compile it using the compile method to find patterns more efficiently.

When creating a regular expression, it should be noted that enclosing it in quotation marks entails the need to use escape sequences, as in any other string constant. For example, the following two expressions are equivalent:

	 var re = / \ w + / g;
	 var re = new RegExp ("\\ w +", "g");  // The line "\" should be replaced with "\\"
	

Note: The regular expression can not be empty: two characters // in succession set the beginning of the comment. Therefore, to specify an empty regular expression, use the expression /.?/.

Regular expressions are used by the execEx and test methods of the RegExp object and the match, replace, search, and split methods of the String object. If we just need to check if this string contains a substring matching the pattern, then the test or search methods are used. If we need to extract a substring (or substrings) that match the pattern, then we need to use the exec or match methods. The replace method searches for a given substring and replaces it with another string, and the split method allows you to split a string into multiple substrata, based on a regular expression or an ordinary text string. For more information about the use of regular expressions, see the description of the corresponding methods.

Regular Expression Syntax

A regular expression can consist of ordinary characters; in this case it will match the specified combination of characters in the string. For example, the expression / com / matches the selected substrings in the following lines: "lump", "gourmet", "naval commander". However, the flexibility and power of regular expressions makes it possible to use special characters in them, which are listed in the following table.

Special characters in regular expressions:

\ - For characters that are usually treated literally, means that the next character is special. For example, / n / matches the letter n, and / \ n / matches the newline character. For characters that are usually treated as special, it means that the character must be understood literally. For example, / ^ / denotes the beginning of the line, and / \ ^ / corresponds simply to the character ^. / \\ / matches the backslash \.

^ - Matches the beginning of the line.

$ - Matches the end of the line.

* - Matches the repetition of the previous character zero or more times.

+ - Corresponds to repeating the previous character one or more times.

? - Matches the repetition of the previous character zero or one time.

. - Matches any character except the newline character.

(pattern) - Corresponds to the pattern line and remembers the match found.

(?: pattern) - Matches the pattern string, but does not remember the match found. Used to group parts of a sample, for example, / to (?: T | scale) / is a short entry of the expression / cat / cat /.

(? = pattern) - Matching with "looking ahead" occurs when the pattern line matches without remembering the match found. For example, / Windows (? = 95 | 98 | NT | 2000) / matches "Windows" in the "Windows 98" line, but does not match the "Windows 3.1" line. After matching, the search continues from the position following the match found, without taking into account the forward look.

(?! pattern) - Matching with "looking ahead" occurs when the pattern line does not match without remembering the match found. For example, / Windows (?! 95 | 98 | NT | 2000) / matches "Windows" in the string "Windows 3.1", but does not match in the string "Windows 98". After matching, the search continues from the position following the match found, without taking into account the forward look.

x | y - Matches x or y.

{n} - n is a nonnegative number. Matches exactly n occurrences of the previous character.

{n,} - n is a nonnegative number. Matches n or more occurrences of the previous character. / x {1,} / is equivalent to / x + /. / x {0,} / is equivalent to / x * /.

{n, m} - n and m is a nonnegative number. Matches at least n and no more than m occurrences of the previous character. / x {0,1} / is equivalent to / x? /.

[xyz] - Matches any character from the enclosed in square brackets.

[^ xyz] - Matches any character except those enclosed in square brackets.

[az] - Matches any character in the specified range.

[^ az] - Matches any character other than those within the specified range.

\ b - Matches the word boundary, i.e., the position between a word and a space or a line feed.

\ B - Matches any position other than the word boundary.

\ cX - Matches the character Ctrl + X. For example, / \ cI / is equivalent to / \ t /.

\ d - Matches the digit. It is equivalent to [0-9].

\ D - Matches a non-numeric character. It is equivalent to [^ 0-9].

\ f - Matches the translation character of the format (FF).

\ n - Matches the line feed (LF) character.

\ r - Matches the carriage return (CR) character.

\ s - Matches the space character. Equivalent to / [\ f \ n \ r \ t \ v] /.

\ S - Matches any non-whitespace character. Equivalent to / [^ \ f \ n \ r \ t \ v] /.

\ t - Matches the tab character (HT).

\ v - Matches the vertical tab (VT) character.

\ w - Matches a letter, number, or underscore. Equivalent to / [A-Za-z0-9_] /.

\ W - Matches any character except the letter, number, or underscore. Equivalent to / [^ A-Za-z0-9_] /.

\ n n is a positive number. Corresponds to the nth stored substring. Calculated by counting the left parentheses. If the left parenthesis before this character is less than n, then it is equivalent to \ 0n.

\ 0n n is an octal number not greater than 377. It corresponds to the character with the octal code n. For example, / \ 011 / is equivalent to / \ t /.

\ xn n is a hexadecimal number consisting of two digits. Matches a character with hexadecimal code n. For example, / \ x31 / is equivalent to / 1 /.

\ un n is a hexadecimal number consisting of four digits. Matches a Unicode character with hexadecimal n. For example, / \ u00A9 / is equivalent to / c /.

Regular expressions are evaluated similarly to the rest of the JavaScript expressions, i.e., taking into account the priority of the operations: operations that have higher priority are performed first. If the operations have equal priority, then they are executed from left to right. The following table shows the list of regular expression operations in descending order of priority; The operations in one row of the table have the same priority.

Operations:
  \
 () (? :) (? =) (?!) []
	 * +?  .  {n} {n,} {n, m}
	 ^ $ \ metacharacter
	 | |
	

Search Options

When creating a regular expression, we can specify additional search options:
i (ignore case). Do not distinguish between uppercase and lowercase letters.
g (global search). Global search for all occurrences of a pattern.
m (multiline). Multiline search.
Any combinations of these three options, for example ig or gim.

Example

	 var s = "Learning the JavaScript language";
	 var re = / JAVA /;
	 var result = re.test (s)?  "" " : "" not ";
	 document.write ("String" "+ s + result +" corresponds to the pattern "+ re);
	

Because regular expressions distinguish between lowercase and uppercase letters, this example will display the text in the browser window:

The string "Learning the JavaScript language" does not match the sample / JAVA /

If we now replace the second line of the example with var re = / JAVA / i ;, then the text will be displayed:

The string "Learning the JavaScript language" corresponds to the sample / JAVA / i


Now consider the global search option. It is usually used by the replace method when searching for a pattern and replacing the found substring with a new one. The fact is that by default this method replaces only the first substring found and returns the result. Consider the following scenario:

	 var s = "We write scripts in JavaScript," +
	 "but JavaScript is not the only scripted language.";
	 var re = / JavaScript /;
	 document.write (s.replace (re, "VBScript"));
	

It outputs text that clearly does not match the desired result: We write scripts in VBScript, but JavaScript is not the only scripted language. In order for all occurrences of the string "JavaScript" to be replaced by "VBScript", we must change the value of the regular expression to var re = / JavaScript / g; . Then the resulting string will look like:

We write scripts in VBScript, but VBScript is not the only scripting language.

Finally, the multiline search option allows you to match a pattern of a string expression consisting of several lines of text, connected by line break characters. By default, the pattern matching is terminated if a line break character is found. This option overcomes the specified constraint and provides a sample search across the entire source string. It also affects the interpretation of some special characters in regular expressions, namely: Usually, the ^ character is only mapped to the first element of the string. If the multiline search option is enabled, it is also matched with any line element preceded by a line break character. Usually the $ character matches only the last element of the string. If the multiline search option is enabled, it is also matched with any line item that is a line break character.

Memorizing found substrings

If part of the regular expression is enclosed in parentheses, the corresponding substring will be stored for later use. To access stored substrings, use the properties $ 1,:, $ 9 of the RegExp object, or elements of the array returned by the exec and match methods. In the latter case, the number of found and stored substrings is not limited.

The following script uses the replace method to rearrange words in a string. To replace the found text, the properties $ 1 and $ 2 are used.

	 var re = / (\ w +) \ s (\ w +) / i;
	 var str = "Mikhail Bulgakov";
	 document.write (str.replace (re, "$ 2, $ 1"))
	

This script will display the text in the browser window:

Bulgakov, Mikhail

since \ w = [A-Za-z0-9_], then the Russian letters will not work. If we want to use Russian letters, then we will have to slightly modify the code:

	 var re = / ([а-я] +) \ s ([а-я] +) / i; 
	 var str = "Mikhail Bulgakov"; 
	 document.write (str.replace (re, "$ 2, $ 1"));  // Bulgakov, Mikhail
	

This script will display the text in the browser window:

Bulgakov, Mikhail

Introduction

Basic concepts

Regular expressions are a powerful tool for processing incoming data. A task that requires replacement or search for text can be beautifully solved with the help of this "language within the language". And although the maximum effect of regular expressions can be achieved when using server-side languages, you should not underestimate the capabilities of this application on the client side.

A regular expression is a tool for processing strings or a sequence of characters that defines a text template.

Modifier - designed to "instruct" the regular expression.
Metacharacters are special characters that serve as commands for the regular expression language.

A regular expression is specified as an ordinary variable, only a slash is used instead of the quotes, for example:

  var reg = / reg_expression / 

Under the simplest templates we will understand such templates that do not need any special symbols.

Let's say that our task is to replace all the letters "p" (small and capital) with the Latin capital letter "R" in the phrase Regular expressions.

Create a template var reg = / p / and using the replace method, we do the intended

	 <script language = "JavaScript">
		 var str = "Regular expressions"
		 var reg = / p /
		 var result = str.replace (reg, "R")
		 document.write (result)
	 </ script>
	

As a result, we get the line 'REGULAR expressions', the replacement occurred only at the first occurrence of the letter "p" with the case in mind. But under the conditions of our problem this result does not fit ... Here we need the modifiers "g" and "i", which can be used separately or together.
These modifiers are placed at the end of the regular expression pattern, after the slash, and have the following values: modifier "g" - sets the search on the line as "global", i.e. in our case, the replacement will occur for all occurrences of the letter "p". Now the template looks like this: var reg = / p / g , substituting it in our code

	 <script language = "JavaScript">
		 var str = "Regular expressions"
		 var reg = / p / g
		 var result = str.replace (reg, "R")
		 document.write (result)
	 </ script>
	
we get the line 'Regulatory Expressions'.

The modifier "i" - sets the search on the line without regard for the case, adding this modifier to our template var reg = / p / gi, after execution of the script we get the desired result of our task - 'Adjustable expressions'.

Special characters (metacharacters)

Metacharacters specify the type of characters of the search string, the method of surrounding the search string in the text, as well as the number of characters of an individual type in the text being viewed. Therefore, metacharacters can be divided into three groups:

Match metacharacters.
Quantitative metacharacters.
Metacharacters of positioning.
Match match metacharacters

\ b the word boundary, specifies the condition under which the pattern should be executed at the beginning or end of the words.

\ B is not a word boundary, it specifies a condition under which the pattern is not executed at the beginning or end of a word.

\ d is a digit from 0 to 9.

\ D is not a digit.

\ s a single blank character that matches the space character.

\ S is a single non-empty character, any one character except for a space.

\ w letter, number, or underscore.

\ W is not a letter, number, or underscore.

. any character, any characters, letters, numbers, etc.

[] character set, specifies the condition under which the pattern should be executed with any matching of the characters enclosed in square brackets.

[^] a set of non-incoming characters, specifies a condition under which the template should not be executed with any matching of the characters enclosed in square brackets.

Quantitative metacharacters

* zero and more times.

? Zero or one time

+ One or more times.

{n} exactly n times.

{n,} n or more times.

{n, m} at least n times, but not more than m times.

Metacharacters of positioning

^ at the beginning of the line.

$ at the end of the line.

Some methods for working with templates

replace - we already used this method at the very beginning of the article, it is intended to search for a pattern and replace the found substring with a new substring.

test - this method checks if there is a match in the string relative to the pattern and returns false if the pattern match fails, otherwise true.

Example

	 <script language = "JavaScript">
		 var str = "JavaScript"
		 var reg = / PHP /
		 var result = reg.test (str)
		 document.write (result)
	 </ script>
	

will print false as the result; the string "JavaScript" is not equal to the string "PHP".

Also, the test method can return any other string specified by the programmer instead of true or false.
eg:

	 <script language = "JavaScript">
		 var str = "JavaScript"
		 var reg = / PHP /
		 var result = reg.test (str)?  "String Matched": "String did not match"
		 document.write (result)
	 </ script>
	
in this case, the result is the string 'The string did not match'.

exec - this method compares the string with the pattern specified by the template. If the pattern matching fails, null is returned. Otherwise, the result is an array of substrings corresponding to the specified pattern. / * The first element of the array will be equal to the original string that satisfies the specified pattern * /
eg:

	 <script language = "JavaScript">
		 var reg = / (\ d +). (\ d +). (\ d +) /
		 var arr = reg.exec ("I was born on September 15, 1980")
		 document.write ("Date of birth:", arr [0], "<br>")
		 document.write ("Birthday:", arr [1], "<br>")
		 document.write ("Month of Birth:", arr [2], "<br>")
		 document.write ("Year of birth:", arr [3], "<br>")
	 </ script>
	

the result is four lines:
Date of birth: 15.09.1980
Birthday Birthday Unknown
Month of birth: 09
Year of birth: 1980

Conclusion

Not all features and delights of regular expressions are displayed in the article, for a deeper study of this issue I will advise you to study the RegExp object. I also want to pay attention to the fact that the syntax of regular expressions does not differ either in JavaScript or in PHP. For example, to verify the correctness of the input of e-mail, a regular expression that for JavaScript, that for PHP will look the same
/[0-9a-z_]+@[0-9a-z_^.]+.[az]{2,3}/i.

Regular expressions are a kind of pattern for finding specific combinations of characters in strings, with regular expressions enclosed in slashes. The literals of regular expressions are listed below.

Literals

Literal Description
\ It is part of special characters, and also lets the interpreter understand that the next character is not a literal. Used before octal character codes to extract stored subexpressions from memory and to use a literal in the regular expression.
^ Start of line
$ End of line
* Means that the preceding character should meet on the line many times or not meet
+ Means that the preceding character should occur on the line one or more times
? Means that the preceding character should meet once per line or not meet
{number} This means that the preceding character must meet in the string the specified number of times
{number,} Means that the preceding character should meet in the string the specified number of times and more
{number1, number2} Means that the preceding character must occur on the line from the first to the second number of times
. Indicates any character other than \ n (new line)
(a subexpression) Looks for a subexpression and stores the found group of symbols
\ group number Retrieves the specified character group stored with the previous literal
symbol1 | character 2 Searches for one of two characters
[character set] Searches for a character from a given set
[^ character set] Looks for any character not included in the set
\ b Indicates the word boundary, the position between the word and a space
\ B Indicates the space boundary, the position between a space and a word
\ c It coincides with the control character of the form "Ctrl" + "symbol"
\ d Any number
\ D Any non-numeric character
\ f Page translation symbol
\ n New line character
\ r Carriage return symbol
\ s Gap, tab, newline or line feed
\ t Tabulation
\ v Vertical tabulation
\ w Letter, number or underscore
\ x Code Symbol with the specified hexadecimal code
\ oCod Symbol with the specified code
\ group number Retrieves from memory a previously stored group of characters with a specified number

Example

	 /(\w+)@({\w\._}+)/
	

This expression looks for any email address, dividing it into two parts: the mailbox name and the server name, and stores them in memory as groups of characters under numbers 1 and 2.

Class RegExp

This class is responsible for processing the lines with the help of regulary expressions. Its constructor has the following form:

  RegExp ( Regular expression , Flags )
	

It takes only one mandatory paremeter - it's a "Regular Expression", which is quoted. The "flags" parameter is an additional search condition and can take values:

  • g - set the global search, if this flag is set, then the expression will return all matching words.
  • i - ignoring the case of characters
  • m - multiline search

To work with regular variances, three methods of the String class are used:

  • match - performs a string search using the regular expression passed as a parameter and returns an array with search results. If nothing is found, null is returned.
  • replase - performs a search and replace in a string using regular expressions, and returns the resulting string.
  • search - executes a pointer in a string, using the regular expression passed as a parameter, and returns the position of the first substring that matches the regular expression.

Properties

Property Description
lastIndex Specifies the start position of the search in the line
sourse Returns a regular expression string (read-only)
global Defines the presence of the flag g and returns either true or false
ignoreCase Defines the presence of the i flag and returns either true or false
multilane Defines the presence of the flag m and returns either true or false

Methods

Method Description
compile ( Regular precipitation of g, i, m ) Компелирует регулярное выражение во внутренний формат для ускорения работы, может использоваться для изменения регулярного выражения
exec( строка ) Аналогичен методу match класса String, но строка, где нужно произвести поиск, передается в качестве параметра
test( строка ) Аналогичен методу search класса String, возвращает true или false в зависимости о результатов поиска

Example

	var result, re, str;
	str="http://www.netscape.com";
	re=new RegExp ("w{3}","i");
	result=str.match(re)
	

Здесь этот участок скрипта производит поиск текста "www" в строке, которая была присвоена переменной "str" без учета регистра, а метод match возвращает массив result, содержащий результаты поиска.

Глобальный объект RegExp

Этот глобальный объект служит для боступа к результатам поиска с использованием регулярных выражений. Этот объект создается самим интерпритатором и доступен всегда. Формат доступа к его свойствам:

 RegExp. свойство 

Properties

Property Description
$номер подвыражения Возвращает одно из последних найденых подвыражений (зависит от номера). Номер может быть 1-9, т.к.интерпретатор хранит в этом свойстве только девять последних найденых подвыражений, для доступа к остальным используется массив, возвращенный методами match или exec
index Возвращает позоцию в строке найденой подстроки
input|&_ Возвращает строку, где производится поиск
lastIndex Задает позицию начала поиска в строке
lastMatch|$& Возвращает последнюю найденую подстроку
lastParent|$+ Возвращает последнюю найденую группу символов, если в регулярном выражении использовались подвыражения
leftContext|$' Возвращает строку, составленую из всех символов от начала строки до последней найденой подстроки, не включая ее
rightContext|$' Возвращает строку, составленую из всех символов от последней найденой подстроки, не включая ее, до конца исходной строки

Примеры использования Регулярных выражений

Разбор URL

	var re, str, protocol, address, filename, result;
	str="http//www.somedomain.ru/index2.html";
	re=new RegExp("((\w+): \/\/)?([^/]+) (.*)?","i");
	result=re.exec(str);
	if (result !=null)
	 {
		protocol=RegExp.$2;
		address=RegExp.$3;
		filename=RegExp.$4;
	 }
	

Данный скрипт разбивает интернет адрес на несколько составных частей. В регелярном выражении использовалось несколько подвыражений, которые сохраняются интерпритатором следующим образом: сначала сохраняется внешнее выражение, а затем внутренее. После регулярного выражения следует строка (result=re.exec(str);), которая начинает разбивание адреса, а далее проверяется правильность адреса и, в случае положительного результата, происходит присваивание переменным соответствующих частей адреса.

Функция удаления пробелов в начале и конце строки

	function trim(str)
	 {
	 return str.toString().replace(/^[ ]+/, '').replace(/[ ]+$/, '');
	 }
	
Другой вариант:
	function trim(str) {
	 return str.replace(/^\s+|\s+$/g, '');
	  }
	

Поиск музыки на странице

 var mmReg = / (?: http: \ / \ / [\ w.] + \ /)? (?!: \ / \ /) [^ <^> ^ "^ '^ \ s] + \. (? : aiff | au | avi | flv | mid | mov | mp3 | ogg | ra | rm | spl | swf | wav | wma | wmv) (?! \ w) / ig; this.tmp = this.allText.match ( mmReg); if (this.tmp && this.search_common_embeds) if ((this.tmp.length> 1)) if ((this.tmp.length> 2) || (this.tmp [0]! = this.tmp [1])) ...