This page has been robot translated, sorry for typos if any. Original content here.

HTTP protocol

For the source, let's find out for ourselves what the general protocol is. A protocol is a certain set of rules also of key signs, designed for communication between devices. It is necessary for computers or their elements to be unambiguously recognized by a friend of a friend.

Protocol - a talk of computers on the network.

In fact, any set of commands can be called a protocol, but in practice, the concept of a protocol applies only to the so-called network protocols - the languages ​​of computers in a network. Each protocol has a specific purpose and is supported by specialized software.

URL, IP and DNS addresses, domains

So URL (Uniform Resource Locator) is the full route of the document . URL is the address where it is allowed to unambiguously find a document (file) on the Internet. The line that you type in the "address" box of your browser also eat the URL address of the document.

A URL can have a rather sophisticated look, and also consist of various parts. To get started, consider the simplest URL:

This URL contains three constituent elements: the name of the host where the document is located, the name of the protocol used to transmit this document, and also the name of the act itself (file name plus extension). The basis (and the only required share for the http protocol) of the address is the host name. It defines the machine on which the act is located (on a network, individual computers are called hosts ). Each computer on the network is a host, also has a unique name (on this network). In the above example, rambler.ru is the name of the computer on which we want to find the document.

Host names can be specified in duplicate ways: using DNS and also using an IP address. An IP address consists of four numbers separated by a dot. Each quantity can exist in the range from 0 up to 255. For example, 192.168.2.1 .

However, in practice, using IP addresses is inconvenient, since numbers are difficult to remember. Therefore, a Domain Name System (DNS) was introduced, in which each IP address is assigned a ratio of a name consisting of letters or numbers. So for example, in the above DNS sample, the name was rambler.ru , and also the IP address 217.73.192.109 corresponds to it .

It should be noted that different DNS names almost always correspond to different IP addresses, but the same IP addresses can correspond to different DNS names. For example, such different DNS names as www.rambler.ru and rambler.ru have one that also has an IP address. In URLs it is allowed to use both DNS names and IP addresses. Thus, the two URLs http://rambler.ru/index.html are also http://217.73.192.109/index.html are equivalent. Some methods for setting the IP address are described here http://www.xakep.ru/post/11980/default.htm .

Note also that, in principle, a host does not have to own a domain name. That is, some hosts are allowed to access only by IP address.

You probably have already taken care that any DNS name consists of separate words separated by periods. Each name individually means the domain to which the host belongs. The entire DNS system is built on a hierarchical basis. All domains of the 1st level (com, org, ru, etc.) are included in the root domain of the 0th level (which is usually not written to DNS in any way, as it is implied by default). Domains of another level (for example rambler, mail or kiev) enter main level domains as well etc. Domains in DNS are recorded from right to left, in the order of increasing levels.

We note two important features: 1. A domain is a purely administrative unit and does not represent a host. 2. The IP address in no way depends on the domain in which the host is located.

Thus, the domain system was introduced simply to classify sites by geographic or target basis, and also does not have any relation to the physical device of the Internet.

In the provided sample URL, we explicitly set the name of the index.html act we are interested in, but on every site there is a document that opens by default. He as the position owns the name index.html or default.html is also located in the root folder of the site. If we enter the website URL without specifying the file name we need, the server will automatically open the default act for us. Thus, the address http://crackchat.h1.ru is equivalent to the address http://crackchat.h1.ru/index.html . Just as there is a default open file, there is also a default folder. On most servers, the default folder for HTTP documents owns the WWW name.

After DNS, the name of the act to which we are referring follows in the URL address. This implies that this file is located in the root folder. If blah blah this is not the case, then we can indicate the filled route to the act, listing the subfolders through a direct slash:

In this sample, we refer to the file in the cgi-bin / perl / folder. This path is relative to the root folder. So, for example, if the path to the root folder is f: / www , then in our example we refer to the file f: /www/cgi-bin/perl/search.pl . At the same time, it is proud to consider the following: since most of the Web servers are built on the basis of UNIX-like systems, when specifying the route to the file, one must take into account the difference between lowercase and uppercase letters. So if we accessed the file at the URL http://rambler.ru/CGI-BIN/perl/Search.pl , the server would not have found such a file. The difference between impressive and lowercase letters occurs only in the route to the file, while DNS is case-insensitive (then RAMBLER.RU is equivalent to eat the address rambler.ru ).

As already noted, DNS corresponds to a strictly defined IP address, but this does not mean that the DNS name is equivalent to the host we are accessing. Often the host itself owns domains of more bottomless levels. For example, the site h1.ru is a host in a domain of another level, but it itself contains third-level domains, for example crackchat.h1.ru or crosswords.h1.ru . Therefore, these site pairs belong to the same host and naturally have the same IP address! Physically, in this case, the third-level domains look just like folders on the h1.ru host disk , and access to them could be carried out for example like this: h1.ru/crackchat/ also h1.ru/crosswords/ . The access tool (through a third-level domain or through a disk path) is determined by the server settings.

The root folder is similarly considered a domain, which is also why most URLs are allowed to be specified in a couple of formats: both with the www domain (for example www.crackchat.h1.ru ) and also without it ( crackchat.h1.ru ) - in this case, the server is anyway automatically directs you to the www folder, as it is accepted by default.

Protocols, Ports, CGI Protocol

As we have already seen, the URL address consists of three main elements: DNS name, file route, and protocol name. If the first couple of elements allow you to locate the document, then the protocol determines how to access the document. In other words, at what time the customer tries to receive the document, he is forced to tell the server how he (the server) is forced to transmit this act to him (the client). There are many different data transfer protocols on the network, among them the most common are http (Hypertext Transfer Protocol - a protocol for transferring hypertext files), ftp (File Transfer Protocol - a protocol for transferring files), mailto (a prefix for a family of mail protocols), file (a file access protocol or folders). The type of protocol determines the program that can process data in the format of this protocol. So Internet Explorer can work with http protocols, file also ftp , but can not work with mailto protocol. Therefore, if you type in your browser, in the address line mailto: microsoft.com , a special mail program will start that can work with this protocol (for example, Outlook Express or The Bat!). The name of the protocol is indicated by the most important in the URL should also end with a colon. The register does not matter at all.

Among the protocols, there are very bizarre ones, for example, the res or about protocol (for fun, you can type in the address about: <a href="mailto:bill@microsoft.com"> send greetings to Bill </a> in the browser address bar also see what :) . Another interesting ldap protocol (try for example ldap: //microsoft.com ).

Not all protocols can act as a protocol for a URL. Since the about or javascript protocols have nothing to do with the document’s full route, also because the URLs with these protocols are not URLs at all.

The protocol prefix indicates to the customer in which "language" communication with the server will begin. And the customer knows in advance which program should conduct this communication, which is impossible to say about the server. In order for the server to start “talking” to us in the required protocol language, it (the server) is forced to run the appropriate program that will understand this protocol. To solve this problem, use ports . So if the DNS name or IP address determines the machine we are accessing, the port determines the program we are accessing on this host. Ports are indicated by an integer ranging from 0 up to 65535.

Each protocol is assigned a default port through which the server program will wait for client requests. For example, if the server supports the http protocol, then the corresponding server program (for example, Apache) will wait for client requests on port 80 (this port is accepted by default for the http protocol). If this bla bla host also supports the ftp protocol, then another server program will wait for requests on port 21 (this port is reserved for the ftp protocol).

The port we are accessing is determined automatically, depending on which protocol we selected in the URL. But the port is also allowed to specify explicitly. The port number is indicated via a colon later than the DNS name or IP address:

In the given sample, we turn to a certain program "hung up" on port 8080 , we also ask her to give us the index.html file via the http protocol. If there is no such program on the server (then no program will track requests for port 8080 ), then the browser will give us a message about the wrong URL.

Since port 80 is accepted by default for http servers, the address http://rambler.ru:80 is equivalent to the address http://rambler.ru . Although, in principle, hosts are not required to support http on the 80th port. The server may exist configured for example on port 3128 , also at that time to communicate with this host on http you need to explicitly specify the port number non-stop explicitly: http://rambler.ru//128

When accessing the server, sometimes it is necessary to indicate, in addition to the address of the act, the user ID, which accesses the server (or which we access on the server), but the access password is similar. URL allows you to transfer such information. To do this, put a @ sign before the DNS name, before which the user name is indicated:

As a rule, for the http protocol, user authentication is not required at all, but for protocols like ftp or mailto it is required. In addition to the username, it is also possible to specify an access password. The password is no longer a colon. For example: ftp: // masha: kasha@yahoo.com . This URL requests the ftp protocol for the root directory of the yahoo.com host for masha with the password kasha . But such a mailto address : //masha@mail.ru is used to access the mailbox of the user masha in the mail.ru host.

The username can similarly exist for a domain principle, and also consist of various elements separated by a dot. For example mailto: //bill.geits@microsoft.com .

As already noted, a URL is a filled document route. An act means any file that can also exist in text (for example, html or doc or pdf files), as well as a picture (jpg or gif), as well as a program. At the same time, the http protocol implies that if a text or a picture is requested in the URL, they must be transmitted to the user in order to display them in his browser, however, if a program or script is requested, it must be run on the server, and the result of its work should also be sent to the user. The result itself can exist either in text or in a picture. The type of the resulting act is determined inside the program itself, and the user does not know in advance what type of document he will receive by calling the program. The server program is called by the usual URL address of the program or script. As a rule, scripts with extensions .pl .php .cgi are used on the network (the first two are programs written in Perl as well as PHP, however, the last extension can be applied to any executable modules, including PHP and EXE as well for Perl). For example, the URL http://www.rambler.ru/cgi-bin/top.cgi requires running a certain top.cgi application on the rambler.ru host and also transferring the result of this application to the customer (for example, an html document or picture).

But server applications would be a little useless if parameters could not be passed to them. URL allows it. To transfer parameters to server applications (they are also called gateways ), a data transfer format known as CGI (Common Gateway Interface) is used. This format allows you to specify the input data of programs in a single line.

It can be seen in the above example that the URL calls a server gateway called search.pl also passes it as input the same parameter with the name user, also ending masha . Does a CGI string drop off the script name as a task sign ? . If a script needs to pass several parameters, they are listed sequentially through the ampersand & symbol, for example: http://rambler.ru/cgi-bin/perl/search.pl?user=masha&password=kasha .

We note the following feature: since most of the WEB technologies are based on text data formats, sooner or later the problem arises of distinguishing commands from data. So, for example, if we want to pass some expression parameter with the value C = A + B as a CGI parameter: http://site.com/script.cgi?expression=C=A+B, then such a request will be misunderstood by CGI since another the = sign will be perceived as a separator between the parameter name and its value. Therefore, a special character encoding called URL Data Format is used in the CGI protocol (as well as in any premises of a URL). This encoding displays the letters of the Latin alphabet as they are, and the remaining characters in the form of % nn where nn is the hexadecimal code of the character. For example, the double quotation mark " will look like % 22 , but the symbol = like % 3D . The exception is the space character, which, in addition to the standard encoding % 20 , can similarly be encoded as + . Thus, the example URL should be written like this: http: // site .com / script.cgi? expression = C% 3DA% 2BB .

HTTP protocol

HTTP (Hypertext Transfer Protocol) is the main protocol used on the Web. Despite the fact that the protocol is called the hypertext transfer protocol (i.e. HTML), in the lesson the HTTP protocol can be used (and is used) to transfer almost any data on the network. This transfer also text and images as well files. The popularity of HTTP, in my opinion, is associated with several factors: this is the use of a fairly universal URL addressing, the ability to transfer any data (either from the customer to the server or vice versa), however, the work is similar in no-line mode (i.e. transferring data directly between the customer is also a server, without intermediaries). The HTTP protocol can be called dual, in the sense that in a client-server system, data can move in pairs of directions, and also from server to server can also be reversed from server to client. Nevertheless, the HTTP syntax is aimed specifically at transferring data from the customer to the server.

So consider the simplest sample HTTP request. If in the address window of the browser we type the address http://yandex.ru , then the browser will determine the IP address of the server yandex.ru will also send it the following HTTP request to port 80:

GET http://yandex.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

The request is transmitted in unencrypted text form. The very first part of the request is located in the first line: This is the type of request ( GET ), the URL of the requested document ( http://yandex.ru ) is also a type of HTTP protocol ( HTTP / 1.0 ). The following are the query parameters. Each line corresponds to one parameter. The name of the parameter moves at the source of the line, then the colon also sets the value of the parameter. The meaning of the parameters is intuitively clear, but we will describe the main ones: Accept - the type of data that the browser can accept (in MIME encoding). Accept-Language is the preferred language in which the browser wants to receive data. User-Agent - the type of program that sent the request. Host - DNS (or IP) host name to which the request is addressed. Cookies - cookies (data that was stored by the server on the client’s local disk when visiting this host last time). Referer - the host from which we send the request. So for example, if we are on the page http://narod.ru , we also click on the link http://yandex.ru there , the request will be sent to the yandex.ru host, however the referer request field will have the host name narod.ru.

The set of query parameters is not fixed. In addition to the above, other parameters may also be present.

The most interesting parameters are referer and cookie . These parameters are mainly used for user identification by the server.

A GET request may have data transmitted by the customer to the server. they are transmitted directly through the URL using the CGI protocol. So, for example, to enter the chat, the browser can send the server the following request:

GET http://chat.ru/ ? Login = Algol & pass = Algol HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

As we observe the query string contains the username and password of the user, passed through the URL string. This type of data transfer to the server is convenient, but it has limitations on capacity. Extremely impressive datasets cannot be passed via URLs. For such purposes, there is another type of request: a POST request. The POST request is very similar to the GET , with the only difference being that the data in the POST request is transmitted separately from the request header itself. So the above sample in the POST version has the form:

POST http://chat.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: ru
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

login = Algol & pass = Algol

As we observe the login information also the password is transmitted separately in the request body. The request body should fall away from the header empty string. If the server encounters an empty string in a POST request, then everything that moves further it considers the body of the request (transmitted data). Note the following: the format of the data in the body of the POST request is arbitrary. Despite the fact that the most commonly used CGI format, it is not required. In addition to the POST request, it does not require the presence of a request body, it can also transmit data similarly through a URL.

In addition to the CGI format, sometimes the so-called multipart format:

POST http://photo.bigmir.net/form.php HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Referer: http://photo.bigmir.net/form.php
Accept-Language: ru
Content-Type: multipart / form-data; boundary = --------------------------- 7d20345dc
Accept-Encoding: gzip, deflate
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.01; Windows 98)
Host: photo.bigmir.net
Proxy-Connection: Keep-Alive
Pragma: no-cache
Cookie: Ukrainian = 2; BSX_TestCookie = Yes; rich_ad = 1; b = 1

----------------------------- 7d20345dc
Content-Disposition: form-data; name = "id"

254353
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "d"

22
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "login"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "passw"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "email"

tps99@mail.ru
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "submit"

Add
----------------------------- 7d20345dc--

Let's take care of the Content-Type header line : multipart / form-data; boundary = --------------------------- 7d20345dc . This parameter tells the server that the customer is transmitting data in multipart format with a delimiter --------------------------- 7d20345dc . A limiter generated by the customer at random is also necessary so that the server can separate the different elements sent in the request body. As you can see, the body contains several elements that are transmitted in ASCII format (and not in Unicode as necessary for CGI ) are also separated by the line that was specified in the Content-Type parameter. Each share contains information about the type of data transmitted, also the name of this part. The comfort of the multipart format is that the transmitted data is unlimited and also does not require preliminary coding.

In addition to GET requests, there are also POST requests, such as TRACE , PUT . But they are rarely used, and we will not dwell on them in any way.

Once again I’ll turn my attention to the fact that ALL information transmitted by the customer to the server is contained in the header and body of the request. In another way, the server cannot receive information from the customer in any way via the HTTP protocol.

On the other hand, the server can also transmit information to the customer only in opposition to the request. Any data exchange in the HTTP protocol is initiated only by the client, the server can not transmit anything "just like that" however, only at the request of the client.

Thus, if we have the ability to control the transmitted requests, then we completely control the information received by the server and client. This is convenient, since for the modification of the transmitted / requested data there is no need to change the HTML files of the pages, change the cookies, etc., you just need to make changes to the HTTP request and also send it to the server. However, this is another chronicle :) ...