Module mod_rewrite part 3

In the two previous parts, we got acquainted with the basics of "rewrite rules" for URLs and "rules conditions". Let me propose to consider two examples illustrating more complex applications.

The first example deals with dynamic pages, and the second one shows the possibilities of calling ".txt" files and the product of various actions on them.

Suppose that we have a virtual store for the sale of some goods. Clients refer to the descriptions of goods through the script:

Http://www.yoursite.com/cgi-bin/shop.cgi?product1
Http://www.yoursite.com/cgi-bin/shop.cgi?product2
Http://www.yoursite.com/cgi-bin/shop.cgi?product3.

These addresses are presented as links on most pages of the site.

And now let's say that you decided to add a site for indexing in the search engines. There is a little trouble waiting for you - not all search engines accept, understand and index URLs that contain the symbol "?".

More natural and acceptable for the search engine is the URL of the form:

Http://www.yoursite.com/cgi-bin/shop.cgi/product1

In this case, the symbol "?" Is replaced by "/".

Even more comfortable URL in terms of search engine will look like:

Http://www.yoursite.com/shop/product1

For the search engine, "shop" is now, as it were, a directory containing products product1, product2, etc.

If the user from the search results page in the search engine follows this link, then this link should be transformed into a link: shop.cgi? Product1.

To achieve this effect, you can use mod_rewrite using the following construct in the .htaccess file: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2

Variables $ 1 and $ 2 constitute the so-called "backreferences". They are associated with text groups. The called URL is broken into parts. Everything that is before the "shop", plus everything that is after "shop /" is defined and stored in these two variables: $ 1 and $ 2.

Up to this point, our examples have used "rules" of the type: RewriteRule ^.htaccess*$ - [F]

However, we have not yet achieved the true rewriting of URLs, in the sense that one URL should redirect the visitor to another.

For our record like: RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2 general syntax is:

RewriteRule currentURL rewritableURL

As you can see, this directive performs a real "overwrite" of the URL address.

In addition to the entries in the .htaccess file, you also need to replace all the links on the site that have the format "cgi-bin / shop.cgi? Product", with links like "shop / product"

Now that the search engine finds a page with similar links, it will index the site without any visible problems.

Thus, you can turn a purely dynamic site into a site that has a static structure, which obviously will benefit in the matter of indexing by different lorry machines. Pay attention to the type of URLs on this site. In addition to all, they also have an easily readable structure for a person - CNC (human-to-understand URL). But we'll talk about this in another article.

In our second example, we will discuss how to forward requests for ".txt" files to the script of the program.

Many hosting providers that work with Apache provide log files in a common format. This means that they will not link fields with referring pages and user agents.

However, regarding requests to the "robots.txt" file, it is preferable to have access to all this data in order to have more information about visiting the search engines than just knowing their IP addresses. In order to organize this, the following entries should be in .htaccess: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteRule ^robots.txt$ /text.cgi?%{REQUEST_URI}

Now, when requesting the "robots.txt" file, our RewriteRule redirects the visitor (robot) to the text.cgi requesting script. In addition, the variable is passed to the script, which will be processed according to your needs. "REQUEST_URI" specifies the name of the requested file. In this example, it's "robots.txt". The script will read the contents of "robots.txt" and send it to a web browser or search engine robot. Thus, we can consider hits of visitors and conduct their log files.

To this end, the script will use the environment variables $ ENV {'HTTP_USER_AGENT'}, etc. This will ensure that all the required information is obtained. Here is the source for the cgi script mentioned above: #!/usr/bin/perl # If required, adjust line above to point to Perl 5. ################################# # (c) Copyright 2000 by fantomaster.com # # All rights reserved. # ################################# $stats_dir = "stats"; $log_file = "stats.log"; $remote_host = "$ENV{'REMOTE_HOST'}"; $remote_addr = "$ENV{'REMOTE_ADDR'}"; $user_agent = "$ENV{'HTTP_USER_AGENT'}"; $referer = "$ENV{'HTTP_REFERER'}"; $document_name = "$ENV{'QUERY_STRING'}"; open (FILE, "robots.txt"); @TEXT = ; close (FILE); &get_date; &log_hits ("$date $remote_host $remote_addr $user_agent $referer $document_name "); print "Content-type: text/plain "; print @TEXT; exit; sub get_date { ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(); $mon++; $sec = sprintf ("%02d", $sec); $min = sprintf ("%02d", $min); $hour = sprintf ("%02d", $hour); $mday = sprintf ("%02d", $mday); $mon = sprintf ("%02d", $mon); $year = scalar localtime; $year =~ s/.*?(d{4})/$1/; $date="$year-$mon-$mday, $hour:$min:$sec"; } sub log_hits { open (HITS, ">>$stats_dir/$log_file"); print HITS @_; close (HITS); } #!/usr/bin/perl # If required, adjust line above to point to Perl 5. ################################# # (c) Copyright 2000 by fantomaster.com # # All rights reserved. # ################################# $stats_dir = "stats"; $log_file = "stats.log"; $remote_host = "$ENV{'REMOTE_HOST'}"; $remote_addr = "$ENV{'REMOTE_ADDR'}"; $user_agent = "$ENV{'HTTP_USER_AGENT'}"; $referer = "$ENV{'HTTP_REFERER'}"; $document_name = "$ENV{'QUERY_STRING'}"; open (FILE, "robots.txt"); @TEXT = ; close (FILE); &get_date; &log_hits ("$date $remote_host $remote_addr $user_agent $referer $document_name "); print "Content-type: text/plain "; print @TEXT; exit; sub get_date { ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(); $mon++; $sec = sprintf ("%02d", $sec); $min = sprintf ("%02d", $min); $hour = sprintf ("%02d", $hour); $mday = sprintf ("%02d", $mday); $mon = sprintf ("%02d", $mon); $year = scalar localtime; $year =~ s/.*?(d{4})/$1/; $date="$year-$mon-$mday, $hour:$min:$sec"; } sub log_hits { open (HITS, ">>$stats_dir/$log_file"); print HITS @_; close (HITS); }

Download the file with this content to the root or DocumentRoot directory of the server and set the permissions for the file (chmod) 755. Then, create the "stats" directory.

If your server settings do not allow you to run cgi scripts in the main directory (DocumentRoot), then try the following option: RewriteRule ^robots.txt$ /cgi-bin/text.cgi?%{REQUEST_URI}

Note that in this case, it will be necessary to change the paths in the script code!

Finally, here is the solution to the problem given in the previous part of this publication: RewriteCond %{REMOTE_ADDR} ^212.37.64 RewriteRule ^.*$ - [F]

If we write in the regular expression "^ 212.37.64" instead of "^ 212.37.64." (With a dot at the end), will this give the same effect, and will the same IP addresses be excluded?

The regular expression ^ 212.37.64 satisfies and is applicable to the following lines:

212.37.64
212.37.640
212.37.641
212.37.64a
212.37.64abc
212.37.64.12
212.37.642.12

Therefore, the last digit "4" can be followed by any character string. However, the maximum value of IP is the address 255.255.255.255 - which implies that for example 212.37.642.12 is an incorrect (invalid) IP. The only valid IP in the above list is 212.37.64.12!

To be continued...