This page has been robot translated, sorry for typos if any. Original content here.

HTTP protocol

For the source, let's find out for ourselves what the common protocol is. The protocol is a certain set of rules of also key signs, intended for communication of devices among themselves. It is necessary in order that computers or their elements can be uniquely understood by a buddy's friend.

Protocol - the talking of computers in the network.

In fact, any set of commands is allowed to be called a protocol, but in practice the concept of the protocol is applied only to the so-called network protocols - the languages ​​of communication of computers on the network. Each protocol has a specific purpose and is supported by specialized software.

URL, IP also DNS addresses, domains

So the URL (Uniform Resource Locator) is the complete document route . URL is the address on which it is allowed to uniquely find the document (file) on the Internet. That line that you type in the "address" box of your browser also eats the URL of the document.

URL can have a fairly sophisticated look, also from different parts. To get started, consider the simplest URL:

This URL contains three constituent elements: the name of the host where the document is located, the name of the protocol used to transmit this document, as well as the actual name of the act itself (file name plus extension). The basis (and the only mandatory share for the http protocol) of the address is the host name. It determines the machine on which the act is located (on the network, individual computers are called hosts ). Each computer on the network is a host, also has a unique (in this network) name. In the sample, rambler.ru is the name of the computer on which we want to find the document.

Host names can be specified in duplicate ways: using DNS also using an IP address. The IP address consists of four numbers separated by a period. Each number can exist in the range from 0 up to 255. For example 192.168.2.1 .

However, in practice it is inconvenient to use IP addresses, since numbers are difficult to remember. Therefore, the Domain Name System (DNS) was introduced, in which each IP address is put into a relation either as a name consisting of letters or digits. So for example in the above example DNS name was rambler.ru , it also corresponds to IP address 217.73.192.109 .

It should be noted that different IP addresses always correspond to different DNS names, but the same IP addresses may correspond to different DNS names. For example, such different DNS names like www.rambler.ru and rambler.ru have one also that blah blah IP address. URL addresses are allowed to use both DNS names and IP addresses. Therefore, two URLs http://rambler.ru/index.html are also http://217.73.192.109/index.html equivalent. Some ways of specifying an IP address are described here http://www.xakep.ru/post/11980/default.htm .

Note also that in principle, the host does not have to own a domain name. That is, some hosts are allowed to access only the IP address.

You probably already paid attention to the fact that any DNS name consists of separate words separated by dots. Each name individually refers to the domain to which the host belongs. The entire DNS system is built on a hierarchical basis. All domains of the 1st level (com, org, ru, etc.) enter the 0-level root domain (which is usually not written in DNS as it is implied by default). Domains of a different level (eg rambler, mail or kiev) enter the main-level domains, etc. Domains in DNS are written from right to left, in order of increasing the level.

Let's note two important features: 1. The domain is purely administrative unit also does not represent a host in any way. 2. The IP address in no way depends on the domain in which the host is located.

Thus, the domain system is introduced simply to classify sites according to geographic or objective characteristics, and also does not have any relation to the physical device of the Internet.

In the URL sample URL, we explicitly set the name of the index.html we are interested in, but there is a default document on each site. It as the position owns the name index.html or default.html is also in the root folder of the site. If we enter the URL of the site without specifying the file name, we will automatically open the default document. Thus the address http://crackchat.h1.ru is equivalent to the address http://crackchat.h1.ru/index.html . Similarly, blah blah as there is a default file, then there is also a default folder. In most servers, the default folder for HTTP documents is the WWW name.

After DNS, the URL address is followed by the name of the act to which we are referring. It is assumed that this file is in the root folder. If blah blah is not so, then we can specify the filled route to the act, listing the subfolders via a direct slash:

In this sample, we access the file in the cgi-bin / perl / folder. This path is relative to the root folder. So, for example, if the path to the root folder is f: / www , then in our example we refer to the file f: /www/cgi-bin/perl/search.pl . In this case, it is proud to consider the following: since most of the Web servers are built on the basis of UNIX-like systems, when you specify the route to the file, you must take into account the difference of lowercase letters. So if we accessed the file at the URL http://rambler.ru/CGI-BIN/perl/Search.pl , then the server would not find such a file. The difference of impressive small letters also stems only in the route to the file, DNS is case-insensitive (it's equivalent to eat rambler.ru also RAMBLER.RU ).

As already noted, DNS corresponds to a strictly defined IP address, but this does not mean that the DNS name is equivalent to the host we are addressing. Often the host itself owns domains of more bottomless levels within itself. For example, the site h1.ru is a host in a domain of another level, but it itself contains third-level domains, for example, crackchat.h1.ru or crosswords.h1.ru . Therefore, these site pairs belong to the same host and also have the same IP address! Physically, in this case, the third-level domains look just like folders on the h1.ru host disk , also access to them could be done for example like this: h1.ru/crackchat/ also h1.ru/crosswords/ . The access tool (via a third-level domain or via a disk path) is determined by the server settings.

The root folder is considered to be a domain, and therefore most URLs are allowed to specify in the following formats: as with the www domain (for example, www.crackchat.h1.ru ), or without it ( crackchat.h1.ru ) - in this case the server is anyway automatically directs you to the www folder, because it is accepted by default.

Protocols, ports, CGI protocol

As we have seen, the URL address consists of three main elements: the DNS name, the file route, and the protocol name. If the first pair of the element allows you to locate the document, the protocol determines how the document is accessed. In other words, at what time the customer is trying to get the document, then he is forced to tell the server how he (the server) is forced to transfer this act to him (the client). There are many different data transfer protocols on the network, among them the most common http (Hypertext Transfer Protocol), ftp (File Transfer Protocol), mailto (prefix of the mail protocol family), file (file access protocol or folders). The type of protocol determines which program will be able to process data in the format of this protocol. So Internet Explorer can work with protocols http , file also ftp , but can not work with mailto protocol in any way. Therefore, if you type in your browser, in the address bar of mailto: microsoft.com , then a special mail program will start that can work with this protocol (for example, Outlook Express or The Bat!). The name of the protocol is indicated by the most important one in the URL, and must end with a colon. The register does not matter.

Among the protocols there are very bizarre for example res or about protocol (for interest you can type in the address bar of the browser the address about: <a href="mailto:bill@microsoft.com"> send greetings to Bill </a> also see what will happen :) . Another entertaining protocol is ldap (try for example ldap: //microsoft.com ).

As a protocol for the URL can act in no way all the protocols. So the protocols about or javascript have nothing to do with the full document route, so the "addresses" with these protocols are not URL addresses.

The protocol prefix specifies the customer in what "language" will communicate with the server. And the customer knows in advance which program should conduct this communication, which is impossible to say about the server. In order for the server to "talk" with us in the required protocol language, it (the server) is forced to run an appropriate program that will understand this protocol. Ports are used to solve this problem. So if the DNS name or IP address determines the machine to which we are addressing, then the port determines the program to which we are addressing on this host. Ports are designated by an integer in the range from 0 up to 65535.

Each protocol is assigned a default port by which the server program will wait for client requests. For example, if the server supports http protocol, then the corresponding server program (for example, Apache) will wait for client requests on port 80 (this port is accepted by default for http protocol). If this blast host supports the same ftp protocol, then another server program will wait for requests on port 21 (this port is reserved for ftp protocol).

The port to which we address is determined automatically, depending on which protocol we chose in the URL. But you can also specify the port explicitly. The port number is indicated by a colon after the DNS name or IP address:

In the above sample, we are referring to some program "hung" on port 8080 , we also ask her to give us an index.html file via http protocol. If there is no such program on the server (then no program will monitor the requests for port 8080 ), then the browser will give us a message about the wrong URL.

As by default for http servers port 80 is accepted, the address http://rambler.ru:80 is equivalent to the address http://rambler.ru . Although in principle, hosts are not required to support http on the 80th port. The server can be configured for example on port 3128 , also at that time to communicate with this host on http you need to specify the port number without interruption: http://rambler.ru:3128

When accessing the server, you sometimes need to specify in addition to the address of the certificate, in addition, the id of the user that is accessing the server (or to which we are accessing on the server), but the access password is similar. URL allows you to send this information. To do this, before the DNS, the name is preceded by a @ sign, preceded by the username:

As a rule, http protocol does not require user identification, but for such protocols as ftp or mailto it is mandatory. In addition to the user name, you can also specify an access password. The password disappears on behalf of a colon. For example: ftp: // masha: kasha@yahoo.com . This URL address requests the root directory of the yahoo.com host on the ftp protocol for the user masha with the kasha password. But such an address mailto: //masha@mail.ru is used to access the mailbox of the user masha in the mail.ru host.

The user's name can be similar to that of the domain, it also consists of different elements separated by a dot. For example mailto: //bill.geits@microsoft.com .

As already noted, URL is a filled document route. An act means any file, which can also exist as text (for example html or doc or pdf files), also a picture (jpg or gif), also a program. The http protocol implies that if a text or picture is requested in the URL, then they need to be sent to the user in order to display them in his browser, but if the program or script is requested, then it must be run on the server, and send the user the result of its operation. The result itself can exist either as text or as a picture. The type of the resulting act is defined within the program itself, and the user does not know in advance what type of document he will receive by calling the program. The server program is called by the usual URL of the program or script itself. Typically, the network uses scripts with extensions .pl .php. Cgi (the first two denote programs written in Perl also PHP, however the last extension can be applied to any executable modules, including Perl also PHP also EXE). For example URL URL http://www.rambler.ru/cgi-bin/top.cgi requires running on the rambler.ru host some application top.cgi also transfer to the customer the result of the work of this application (eg html document or image).

But from server applications it would be of little use if they could not pass parameters. URL this allows. To transfer parameters to server applications (also called gateways ), a data transfer format known as CGI (Common Gateway Interface) is used. This format allows you to specify the input data of programs in a single line.

In the sample, you can see that the URL calls a server gateway called search.pl and passes it as input one parameter with the name user, also the value of masha . Does the CGI string disappear from the script name with the task symbol ? . If the script needs to pass several parameters, then they are listed sequentially through the ampersand symbol & , for example: http://rambler.ru/cgi-bin/perl/search.pl?user=masha&password=kasha .

We note the following feature: since most of the WEB technologies are based on text data formats, the problem is to distinguish commands from data very early or late. So for example, if we want to pass a certain expression parameter with the value C = A + B as the CGI parameter: http://site.com/script.cgi?expression=C=A+B then such a request will be misunderstood by CGI because the other one the = sign will be treated as a separator between the parameter name and its value. Therefore, in the CGI protocol (as well as in any room URL), a special character encoding called URL Data Format is used . This encoding displays the letters of the Latin alphabet as they are, and the remaining characters in the form % nn where nn is the hexadecimal character code. For example, the double quotation character " will look like % 22 , but the character = as % 3D .The exception is the space character, which, in addition to the standard encoding % 20 , can be similarly coded as + .This example URL should be written like this: http: // site .com / script.cgi? expression = C% 3DA% 2BB .

HTTP protocol

HTTP (Hypertext Transfer Protocol) - the main protocol used in the Web. Although the protocol is referred to as a hypertext transfer protocol (ie HTML), in the class itself, the HTTP protocol can be used (and used) to transfer virtually any data on the network. This transfer of text and images also files. The popularity of HTTP, in my opinion, is connected with several factors: it is the use of a sufficiently universal URL addressing, the ability to transmit any data (both from the customer to the server, so also vice versa), but similar work in no-line mode (ie, the customer also the server, without intermediaries). The HTTP protocol is allowed to be called dual, in the sense that in a client-server system, data can move in pairs, as well as from the customer to the server, also from the server to the client. However, the HTTP syntax itself is aimed specifically at transferring data from the customer to the server.

So, let's look at the simplest sample HTTP request. If in the address window of the browser we type the address http://yandex.ru , the browser will determine the IP address of the server yandex.ru will also send it to the 80th port such an HTTP request:

GET http://yandex.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: en
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

The request is transmitted in plaintext. The very first part of the query is located in the first line: This is the request type ( GET ), the URL of the requested document ( http://yandex.ru ) is also a kind of HTTP protocol ( HTTP / 1.0 ). The query parameters are listed below. Each line corresponds to one parameter. The name of the parameter moves in the source of the string, then the colon is also the value of the parameter. The meaning of the parameters is intuitively clear, but we will describe the main ones: Accept - the type of data that the browser can accept (in MIME encoding). Accept-Language is the preferred language in which the browser wants to receive data. User-Agent - the type of program that sent the request. Host - DNS (or IP) host name to which the request is addressed. Cookies are cookies (data that was saved by the server on the local disk of the client, when visiting this host last time). Referer - host, from the page of which we send the request. So for example if we are on the page http://narod.ru , we also click there the link http://yandex.ru , then the request will be sent to the host yandex.ru, however the referer query field will have the name of the narod.ru host.

The query parameter set is not fixed. In addition to the above, other parameters may be present.

The most interesting are such parameters as referer also cookie . These parameters are used mainly to identify the user by the server.

A GET request can have data transmitted by the customer to the server. they are transmitted directly through the URL address on the CGI protocol. For example, to enter the chat the browser can send the following request to the server:

GET http://chat.ru/ ? Login = Algol & pass = Algol HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: en
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

As we see the query string contains the login also the user's password passed through the URL string. This type of data transfer to the server is convenient, but it has limitations on capacity. Extremely impressive arrays of data can not be transmitted through the URL. For such purposes, there is another type of query: the POST request. The POST request is very similar to GET , with the only difference being that the data in the POST request is transmitted separately from the actual request header itself. So the above sample in the POST version has the form:

POST http://chat.ru/ HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Accept-Language: en
Cookie: yandexuid = 2464977781018373381
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.5; Windows 98)
Host: yandex.ru
Referer: narod.ru
Proxy-Connection: Keep-Alive

login = Algol & pass = Algol

As we observe the data about the login, the password is also transmitted separately in the request body. The body of the request must be dropped from the header by an empty string. If the server encounters an empty string in the POST request, then everything that moves further is considered by the request body (data to be sent). Note the following: the format of the data in the body of the POST request is arbitrary. Despite the fact that the most commonly used CGI format, it is optional. In addition to POST, the request does not require the presence of a request body, it can also transmit data similarly through a URL.

In addition to the CGI format, sometimes to transmit impressive amounts of information (for example, files) use the so-called. multipart format:

POST http://photo.bigmir.net/form.php HTTP / 1.0
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-excel, application / msword, application / vnd.ms-powerpoint, * / *
Referer: http://photo.bigmir.net/form.php
Accept-Language: en
Content-Type: multipart / form-data; boundary = --------------------------- 7d20345dc
Accept-Encoding: gzip, deflate
User-Agent: Mozilla / 4.0 (compatible; MSIE 5.01; Windows 98)
Host: photo.bigmir.net
Proxy-Connection: Keep-Alive
Pragma: no-cache
Cookie: Ukrainian = 2; BSX_TestCookie = Yes; rich_ad = 1; b = 1

----------------------------- 7d20345dc
Content-Disposition: form-data; name = "id"

254353
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "d"

22
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "login"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "passw"

Algol
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "email"

tps99@mail.ru
----------------------------- 7d20345dc
Content-Disposition: form-data; name = "submit"

Upload
----------------------------- 7d20345dc--

Let's take care of the header line Content-Type: multipart / form-data; boundary = --------------------------- 7d20345dc . This parameter expresses to the server that the customer transmits the data in the multipart format with the limiter --------------------------- 7d20345dc . The limiter is generated by the customer in a random way and is also necessary so that the silver can separate the various elements sent in the body of the request. As you can see, the body contains several elements that are transmitted in ASCII format (and not in Unicode as needed for CGI ) are also separated by the line that was specified in the Content-Type parameter. Each share contains information about the type of data transferred and the name of that part. The comfort of the multipart format is that the transmitted data has an unlimited value and does not require any prior coding.

In addition to GET requests, there are also other POSTs , such as TRACE , PUT . But they are rarely used, and we will not dwell on them in any way.

One more time I will pay attention to the fact that ALL information transmitted by the customer to the server is contained in the header and body of the request. In another way, the server can not receive information from the customer via the HTTP protocol in any way.

On the other hand, the server can also give the customer information only in the objection to the request. Any exchange of data in an HTTP protocol is initiated only by the client, the server can not transmit anything "just like that", however, only at the request of the client.

Thus, if we have the ability to control the transmitted requests, we completely control the information received by the server and the client. This is convenient, because to modify the transmitted / requested data, there is no need to change HTML page files, to change cookies, etc., just make changes to the HTTP request and send it to the server. However, this is another chronicle :) ...