Robots.txt file

If you ever wondered about the statistics of visits to your site, you should have noticed that periodically it is visited by various search engines. Naturally, they are not people, but special programs, which are often called "robots". "Robots" browse the site also index the web resource, so that it can then be found with the help of the search engine, whose "robot" was engaged in indexing.

All the "robots" before indexing the resource are looking for a file named robots.txt in the root directory of your site. This file contains information about which files "robots" can index, but which are not. This is useful in cases where you do not want to index some pages, for example, containing "private" information.

The robots.txt file is forced to own the text file format for Unix. Some editors can convert ordinary Windows files, sometimes the FCT client can do it. The file consists of records, each of which contains a pair of fields: a line with the name of the client application (user-agent), also one or several lines beginning with the Disallow directive:
<Field> ":" <value>

The User-agent string contains the name of the robot. For example:
User-agent: googlebot

If you are accessing all robots, you can use the wildcard character "*":
User-agent: *

The names of robots are allowed to be found in the logs of access to your web server.

The other part of the command consists of the lines Disallow. These lines are directives for the given "robot". They tell the "robot" what files and / or directories the robot is prohibited from indexing. For example:
Disallow: email.htm

The directive can also have the name of the catalog:
Disallow: / cgi-bin /

In the Disallow directives, wildcards can also be used. The standard dictates that the / bob directive will prevent spiders from indexing /bob.html, also /bob/index.html.

If the Disallow directive becomes empty, it means that the robot can index all files. At a minimum, one Disallow directive must be present for each User-agent field, so that robots.txt is considered correct. Completely empty robots.txt means then blah blah blah, as if it was not at all common.