This page has been robot translated, sorry for typos if any. Original content here.

HTTP protocol

For the source, find out for yourself what the protocol is common. A protocol is a certain set of rules, also key signs, intended for communication between devices. It is necessary so that the computers or their elements can clearly recognize the buddy of a friend.

Protocol - speaking computers on the network.

In fact, any command set is allowed to be called a protocol, but in practice the concept of a protocol is applied only to the so-called network protocols - the languages ​​spoken by computers on the network. Each protocol owns a specific purpose and is supported by specialized software.

URL, IP also DNS addresses, domains

So the URL (Uniform Resource Locator) is the complete route of the document . URL is an address that is allowed to uniquely find a document (file) on the Internet. The line that you type in the window "address" of your browser also eat the URL address of the document.

The URL can have a fairly sophisticated look, also consisting of various parts. To begin, consider the simplest URL:

This URL contains three constituent elements: the name of the host where the document is located, the name of the protocol used to transmit this document, and the actual name of the act itself (file name plus extension). The basis (and the only required portion for the http protocol) of the address is the host name. It determines the machine on which the act is located (on a network, individual computers are referred to as hosts ). Each computer on the network is a host and also has a unique (on this network) name. In the above sample, rambler.ru is the name of the computer on which we want to find the document.

Host names can be specified in duplicate ways: using DNS also using an IP address. An IP address consists of four numbers separated by a dot. Each quantity can exist in the range from 0 down to 255. For example, 192.168.2.1 .

However, in practice, using IP addresses is inconvenient, since numbers are difficult to memorize. Therefore, a domain name system (Domain Name System) was introduced, in which each IP address is put in a relationship which one or a name consisting of letters or numbers. So for example, in the above example, the DNS name was rambler.ru , it also corresponds to the IP address 217.73.192.109 .

It should be noted that different DNS names always correspond to different IP addresses, but the same IP addresses can correspond to different DNS names. For example, such various DNS names as www.rambler.ru and rambler.ru have one also that blah blah IP address. URLs are allowed to use DNS names as well as IP addresses. Thus, the two URL addresses http://rambler.ru/index.html and http://217.73.192.109/index.html are equivalent. Some ways of setting the IP address are described here http://www.xakep.ru/post/11980/default.htm .

Note also that, in principle, the host does not have to own a domain name. That is, some hosts are allowed to contact only by IP address.

You probably have already paid attention to the fact that any DNS name consists of separate words separated by dots. Each name individually means the domain to which the host belongs. The entire DNS system is hierarchical. All domains of the 1st level (com, org, ru, etc.) belong to the root domain of the 0th level (which is usually not written in the DNS in any way as it is meant by default). The domains of another level (for example, rambler, mail or kiev) enter the domains of the main level as well, etc. Domains in DNS are written from right to left, in the order of increasing the level.

We note two important features: 1. The domain is a purely administrative unit and does not represent a host. 2. The IP address is in no way dependent on the domain in which the host is located.

Thus, the domain system was introduced simply to classify sites according to geographic or target characteristics, and also has no relation to the physical structure of the Internet.

In the sample URL of the sample, we explicitly set the name of the act of interest to us index.html , but on each site there is a document opened by default. He as the position owns the name index.html or default.html is also located in the root folder of the site. If we enter the URL of the site without specifying the name of the file we need, the server will automatically open the default certificate to us. Thus, the address http://crackchat.h1.ru is equivalent to the address http://crackchat.h1.ru/index.html . Just like blah blah as there is a file opened by default, then there is also a default folder. On most servers, the default folder for HTTP documents is owned by the WWW name.

After the DNS in the URL address follows the name of the act to which we refer. This implies that this file is in the root folder. If blah blah this is not the case, then we can indicate the filled route to the act, listing the subfolders through the forward slash:

In this sample, we refer to the file in the cgi-bin / perl / folder. This path is relative to the root folder. So, for example, if the path to the root folder is f: / www , then in our example we are accessing the file f: /www/cgi-bin/perl/search.pl . At the same time, it is important to take into account the following: since most Web servers are built on the basis of UNIX-like systems, when specifying the route to the file, one should take into account the difference between lowercase and uppercase letters. So if we had accessed the file at the URL http://rambler.ru/CGI-BIN/perl/Search.pl , then the server would not have found such a file. The difference between the impressive and small letters only results in the route to the file, the DNS is case-insensitive (the rambr.ru address is also equivalent to RAMBLER.RU ).

As already noted, the DNS corresponds to a strictly defined IP address, but this does not mean that the DNS name is equivalent to the host to which we are accessing. Often, on their own, the host itself owns within itself domains of more bottomless levels. For example, the site h1.ru is a host in a domain of another level, but it itself contains third-level domains, for example crackchat.h1.ru or crosswords.h1.ru . Therefore, these site pairs belong to the same host and naturally have the same IP address! Physically, in this case, third-level domains look just like folders on the host disk h1.ru , and access to them could be done like this: h1.ru/crackchat/ also h1.ru/crosswords/ . The means of access (through the domain of the 3rd level or through the disk path) is determined by the server settings.

The root folder is similarly considered a domain, and therefore most URLs are allowed to be specified in pairs of formats: both with the www domain (for example www.crackchat.h1.ru ), and also without it ( crackchat.h1.ru ) - in this case the server is still automatically directs you to the www folder, since it is the default.

Protocols, ports, CGI protocol

As we have already observed, the URL address consists of three main elements: the DNS name, the file path, and the name of the protocol. If the first couple of elements allow you to locate the document, then the protocol determines how the document is accessed. In other words, at what time the customer is trying to receive the document, he is forced to tell the server how he (the server) is forced to transmit this act to him (the client). There are many different data transfer protocols on the network, among them the most common are http (Hypertext Transfer Protocol), ftp (File Transfer Protocol), mailto (prefix of the family of mail protocols), file (file access protocol). or folders). The type of protocol determines the program that can process data in the format of this protocol. So Internet Explorer can work with the protocols http , file also ftp , but can not work with the mailto protocol. Therefore, if you type in your browser in the address bar mailto: microsoft.com , then a special mail program will be launched that can work with this protocol (for example, Outlook Express or The Bat!). The name of the protocol indicated by the most important URL in the URL must also end with a colon. The register does not matter.

Among the protocols there are quite bizarre such as the res or about protocol (for interest, you can type in the address bar of your browser about: <a href="mailto:bill@microsoft.com"> send hi to Bill </a> also see what will become :) . Another interesting protocol is ldap (try for example ldap: //microsoft.com ).

Not all protocols can act as a protocol for a URL. So, the about or javascript protocols have nothing to do with the filled route of the document, also because the "addresses" with these protocols are not URLs.

The protocol prefix indicates to the customer in what "language" communication with the server will result. And the customer knows in advance which program should conduct this communication, which is impossible to say about the server. In order for the server to start “speaking” with us in the required protocol language, it (the server) is forced to launch the appropriate program that will understand this protocol. To solve this problem, use the ports . So if the DNS name or IP address defines the machine to which we are accessing, the port determines the program to which we are accessing on this host. Ports are indicated by an integer in the range from 0 down to 65535.

Each protocol is assigned a default port on which the server program will wait for client requests. For example, if the server supports the http protocol, the corresponding server program (for example, Apache) will wait for client requests for port 80 (this port is the default for the http protocol). If this host is also supported by the ftp protocol, another server program will wait for requests on port 21 (this port is reserved for the ftp protocol).

The port we are accessing is determined automatically, depending on which protocol we selected in the URL. But the port is allowed to specify also explicitly. The port number is indicated by a colon later DNS name or IP address:

In the above sample, we refer to a certain program "hung" on port 8080 , we also ask her to give us the index.html file using the http protocol. If there is no such program on the server (the program will not track requests for port 8080 ), then the browser will give us a message about the erroneous URL address.

Since the port 80 is accepted by default for http servers, the address http://rambler.ru:80 is equivalent to the address http://rambler.ru . Although in principle, hosts are not required to support http on port 80 . The server may exist configured for example on port 3128 , also at that time, to communicate with this host on http, you need to non-stop explicitly indicate the port number: http : //rambler.ru.7312

When accessing the server, sometimes it is necessary to indicate in addition to the address of the act, besides, the user ID, which addresses the server (or to which we are accessing on the server), but the access password is similar. URL allows you to pass this information. To do this, the DNS name is preceded by a prefix with the username:

Typically, the http protocol does not require user identification, but for protocols such as ftp or mailto, it is required. In addition to the username, it is also allowed to specify an access password. The password is no longer on behalf of the colon. For example: ftp: // masha: kasha@yahoo.com . This URL requests the ftp protocol for the root directory of the host yahoo.com for the user masha with the password kasha . But the address mailto: //masha@mail.ru is used to access the mailbox of the user masha on the mail.ru host.

The name of the user can be similarly constructed on the domain principle, also consist of various elements separated by a dot. For example, mailto: //bill.geits@microsoft.com .

As already noted, the URL is the full route of the document. By the act is meant any file that can exist as text (for example html or doc or pdf files), also a picture (jpg or gif), also a program. At the same time, the http protocol implies that if a text or a picture is requested in the URL, they must be sent to the user in order to display them in his browser, but if a program or script is requested, then it must be run on the server, and the result of its work should be sent to the user. The result itself can exist either in text or in a picture. The type of the resulting act is determined within the program itself, and the user does not know in advance what type of document he will receive by calling the program. The server program is called by the usual URL address of the program or script itself. As a rule, the network uses scripts with the .pl .php .cgi extensions (the first two denote programs written in Perl also PHP, however the last extension can be used for any executable modules, including also Perl PHP also EXE). For example, the URL address http://www.rambler.ru/cgi-bin/top.cgi requires running some kind of top.cgi application on the rambler.ru host and also transmitting the result of the application’s work (for example, an html document or image) to the customer.

But from server applications it would be a bit confusing if they could not pass parameters. URL it allows. To transfer parameters to server applications (also called gateways ), a data transfer format known as the Common Gateway Interface (CGI) is used. This format allows you to set the input data of programs in a single line.

In the example above, the URL calls a server gateway called search.pl and also sends one parameter to it as the input data with the name user also specified by masha . Does the CGI string disappear from the script name with a task mark ? . If the script needs to pass several parameters, they are listed sequentially through the ampersand symbol & , for example: http://rambler.ru/cgi-bin/perl/search.pl?user=masha&password=kasha .

We note the following peculiarity: since most of the WEB technologies are based on textual data formats, an early or late problem arises of distinguishing commands from data. So for example, if as a CGI parameter we want to pass a certain expression parameter with the value C = A + B : http://site.com/script.cgi?expression=C=A+B then such a request will be misunderstood by CGI since the other the = sign will be perceived as a separator between the parameter name and its value. Therefore, in the CGI protocol (as well as also in any URL room), a special character encoding is used under the name URL Data Format . This encoding displays the letters of the Latin alphabet as they are, and the remaining characters are in the form of % nn where nn is the hexadecimal character code. For example, the double quote character will look like % 22 , but the character = as % 3D . The exception is a space character, which, in addition to the standard encoding % 20 , can be similarly encoded as + . Thus, the following example URL should be written like this: http: // site .com / script.cgi? expression = C% 3DA% 2BB .

HTTP protocol

HTTP (Hypertext Transfer Protocol) is the main protocol used in the Web. Although the protocol is referred to as the hypertext transfer protocol (ie, HTML), at the very lesson the HTTP protocol can be used (and is used) to transfer almost any data on the network. This transfer is also text and image files as well. The popularity of HTTP, in my opinion, is associated with several factors: it is the use of a fairly universal URL addressing, the ability to transmit any data (as from the customer to the server as well as vice versa), but work is similar in no-line mode (i.e. data transfer directly between the customer is also a server, without intermediaries). HTTP protocol can be called dual, in the sense that in a client-server system data can move in a couple of directions, also from customer to server, also from a server to a client. Yet the very HTTP syntax is aimed precisely at transferring data from the customer to the server.

So consider the simplest sample HTTP request. If we type the address http://yandex.ru in the address window of the browser, the browser will determine the IP address of the server yandex.ru also send him to the 80th port the following HTTP request:

GET http://yandex.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

The request is transmitted in plain text. The very first part of the request is located in the first line: This is the type of request ( GET ), the URL address of the requested document ( http://yandex.ru ) is also a type of HTTP protocol ( HTTP / 1.0 ). The following are the query parameters. Each line corresponds to one parameter. At the source of the string, the name of the parameter is moved, then the colon is also the parameter value. The meaning of the parameters is intuitively clear, but we will describe the main ones: Accept - the type of data that the browser can accept (in MIME encoding). Accept-Language is the preferred language in which the browser wants to receive data. User-Agent - the type of program that sent the request. Host - DNS (or IP) host name to which the request is addressed. Cookies - cookies (data that was saved by the server on the client's local disk when last time you visited this host). Referer is the host from which we send a request. So for example, if we are on the page http://narod.ru , we also click the link http://yandex.ru there , the request will be sent to the host yandex.ru, however the referer query field will have the host name narod.ru.

The query parameter set is not fixed. In addition to the above, there may also be other parameters.

The most interesting parameters such as referer and cookie . These parameters are mainly used to identify the user by the server.

A GET request may have data sent by the customer to the server. they are transmitted directly via the URL using the CGI protocol. So for example, to enter the chat, the browser can send the following request to the server:

GET http://chat.ru/ ? Login = Algol & pass = Algol HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

As we observe the query string, the login also contains the user's password, transmitted via the URL string. This type of data transfer to the server is convenient, but has a capacity limit. Extremely impressive data arrays cannot be transferred via the URL. For such purposes, there is another type of request: POST request. A POST request is very similar to GET , with the only difference being that the data in the POST request is transmitted separately from the request header itself. So the above sample in the POST version has the form:

POST http://chat.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

login = Algol & pass = Algol

As we observe the login information, the password is also transmitted separately in the request body. The request body should disappear from the header empty. If the server encounters an empty string in the POST request, then everything that it moves on it considers as the request body (transmitted data). Note the following: the format of the data in the body of the POST request is arbitrary. Although the CGI format is most often used, it is not required. In addition, the POST request does not require the presence of the request body, it can also transmit data in a similar way via a URL.

In addition to the CGI format, sometimes so-called transcripts are used to transfer large amounts of information (for example, files). multipart format:

POST http://photo.bigmir.net/form.php HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Referer: http://photo.bigmir.net/form.php
Accept-Language: ru
Content-Type: multipart / form-data; boundary = --------------------------- 7d20345dc
Accept-Encoding: gzip, deflate
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.01; Windows 98)
Host: photo.bigmir.net
Proxy-Connection: Keep-Alive
Pragma: no-cache
Cookie: Ukrainian = 2; BSX_TestCookie = Yes; rich_ad = 1; b = 1

----------------------------- 7d20345dc
Content-Disposition: form-data; name = "id"

254353
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "d"

22
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "login"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "passw"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "email"

tps99@mail.ru
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "submit"

To add
----------------------------- 7d20345dc--

Pay attention to the Content-Type header line : multipart / form-data; boundary = --------------------------- 7d20345dc . This parameter expresses to the server that the customer sends data in a multipart format with a limiter --------------------------- 7d20345dc . The limiter is generated by the customer in a random way is also necessary so that the server can separate the different elements sent in the request body. As you can see, the body contains several elements that are transmitted in ASCII format (and not in Unicode as necessary for CGI ) are also separated by the string that was specified in the Content-Type parameter. Each share contains information about the type of data being transferred as well as the name of this part. The comfort of the multipart format is that the transmitted data has an unlimited value and also does not require any preliminary coding.

In addition to GET requests and POST, there are also others, for example TRACE , PUT . But they are rarely used, and we will not dwell on them in any way.

One more time I will call attention to the fact that ALL of the information transmitted by the customer to the server is contained in the header and body of the request. In another way, the server cannot get information from the customer using the HTTP protocol.

On the other hand, the server can also transfer information to the customer only as an objection to the request. Any data exchange in the HTTP protocol is initiated only by the client, the server cannot transmit anything “just like that”, but only at the request of the client.

Thus, if we have the ability to control the transmitted requests, we fully control the information received by the server and the client. This is convenient, since to modify the transmitted / requested data, there is no need to change the HTML files of the pages, change the cookies, etc., it is enough just to make changes in the HTTP request and send it to the server. But this is another chronicle :) ...