This page has been robot translated, sorry for typos if any. Original content here.

Mod_rewrite module h.3

In the two previous parts, we learned the basics of the “rewriting rules” of the URL and the “terms of the rules”. Let me suggest two examples for consideration illustrating more complex applications.

The first example deals with dynamic pages, and the second shows the possibility of calling the “.txt” files and performing various actions on them.

Suppose we have a virtual store selling some goods. Customers access product descriptions via a script:

http://www.yoursite.com/cgi-bin/shop.cgi?product1
http://www.yoursite.com/cgi-bin/shop.cgi?product2
http://www.yoursite.com/cgi-bin/shop.cgi?product3

These addresses are presented as links on most pages of the site.

Now let's say that you decide to add a site for indexing in search engines. There is a little trouble waiting for you - not all search engines accept, understand and index URLs that contain the symbol “?”.

More natural and acceptable for a search engine is the URL of the form:

http://www.yoursite.com/cgi-bin/shop.cgi/product1

In this case, the symbol "?" Is replaced by "/".

Even more comfortable URL from the point of view of the search engine will be:

http://www.yoursite.com/shop/product1

For the search engine, "shop" is now, as it were, a directory containing the goods product1, product2, etc.

If the user from the search results page in the search engine follows this link, then this link should be transformed into a link: shop.cgi? Product1.

To achieve this effect, you can use mod_rewrite using the following construct in the .htaccess file: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2

The variables $ 1 and $ 2 constitute the so-called "backreferences". They are associated with text groups. Called URL is broken into pieces. Everything that is before the “shop”, plus everything that is after the “shop /” is determined and stored in these two variables: $ 1 and $ 2.

Up to this point, our examples used “rules” of the type: RewriteRule ^.htaccess*$ - [F]

However, we have not yet achieved the true rewriting of URLs, in the sense that one URL should redirect the visitor to another.

For our type entry: RewriteRule ^(.*)shop/(.*)$ $1cgi-bin/shop.cgi?$2 , the general syntax is:

RewriteRule currentURL rewritableURL

As you can see, this directive performs a valid “rewrite” of the URL.

In addition to the entries in the .htaccess file, you need to replace all the links on the site that have the format "cgi-bin / shop.cgi? Product" with links like: "shop / product"

Now, when the search engine finds a page with similar links, it will index the site without any visible problems.

In this way, you can turn a purely dynamic site into a site that has a static structure, which obviously will be useful in the matter of indexing by various subscale machines. Pay attention to the type of URL addresses on this site. On top of that, they also have an easy-to-read structure for humans - a CNC (human-readable URL). But we'll talk about this in another article.

In our second example, we will discuss how to redirect requests for “.txt” files to a program script.

Many Apache hosting providers provide log files in a common format. This means that they will not contain fields with referring pages and user agents.

However, regarding requests to the file “robots.txt”, it is preferable to have access to all this data in order to have more information about the visits to the search engines than just knowing their IP addresses. In order to organize this, the following entries should be in “.htaccess”: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteRule ^robots.txt$ /text.cgi?%{REQUEST_URI}

Now, when requesting the “robots.txt” file, our RewriteRule will redirect the visitor (robot) to the text.cgi script processing requests. In addition, the variable is passed to the script, which will be processed according to your needs. "REQUEST_URI" defines the name of the requested file. In this example, this is “robots.txt.” The script will read the contents of "robots.txt" and send it to a web browser or search engine robot. Thus, we can count visitors hits and maintain our log files.

For this purpose, the script will use environment variables "$ ENV {'HTTP_USER_AGENT'}", etc. This will provide all the required information. Here is the source code for the cgi script mentioned above: #!/usr/bin/perl # If required, adjust line above to point to Perl 5. ################################# # (c) Copyright 2000 by fantomaster.com # # All rights reserved. # ################################# $stats_dir = "stats"; $log_file = "stats.log"; $remote_host = "$ENV{'REMOTE_HOST'}"; $remote_addr = "$ENV{'REMOTE_ADDR'}"; $user_agent = "$ENV{'HTTP_USER_AGENT'}"; $referer = "$ENV{'HTTP_REFERER'}"; $document_name = "$ENV{'QUERY_STRING'}"; open (FILE, "robots.txt"); @TEXT = ; close (FILE); &get_date; &log_hits ("$date $remote_host $remote_addr $user_agent $referer $document_name "); print "Content-type: text/plain "; print @TEXT; exit; sub get_date { ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(); $mon++; $sec = sprintf ("%02d", $sec); $min = sprintf ("%02d", $min); $hour = sprintf ("%02d", $hour); $mday = sprintf ("%02d", $mday); $mon = sprintf ("%02d", $mon); $year = scalar localtime; $year =~ s/.*?(d{4})/$1/; $date="$year-$mon-$mday, $hour:$min:$sec"; } sub log_hits { open (HITS, ">>$stats_dir/$log_file"); print HITS @_; close (HITS); } #!/usr/bin/perl # If required, adjust line above to point to Perl 5. ################################# # (c) Copyright 2000 by fantomaster.com # # All rights reserved. # ################################# $stats_dir = "stats"; $log_file = "stats.log"; $remote_host = "$ENV{'REMOTE_HOST'}"; $remote_addr = "$ENV{'REMOTE_ADDR'}"; $user_agent = "$ENV{'HTTP_USER_AGENT'}"; $referer = "$ENV{'HTTP_REFERER'}"; $document_name = "$ENV{'QUERY_STRING'}"; open (FILE, "robots.txt"); @TEXT = ; close (FILE); &get_date; &log_hits ("$date $remote_host $remote_addr $user_agent $referer $document_name "); print "Content-type: text/plain "; print @TEXT; exit; sub get_date { ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(); $mon++; $sec = sprintf ("%02d", $sec); $min = sprintf ("%02d", $min); $hour = sprintf ("%02d", $hour); $mday = sprintf ("%02d", $mday); $mon = sprintf ("%02d", $mon); $year = scalar localtime; $year =~ s/.*?(d{4})/$1/; $date="$year-$mon-$mday, $hour:$min:$sec"; } sub log_hits { open (HITS, ">>$stats_dir/$log_file"); print HITS @_; close (HITS); }

Upload the file with this content to the root or DocumentRoot server directory and set permissions on the file (chmod) 755. Then, create the “stats” directory.

If your server settings do not allow cgi scripts to be executed in the main directory (DocumentRoot), then try the following option: RewriteRule ^robots.txt$ /cgi-bin/text.cgi?%{REQUEST_URI}

Please note that in this case, it will be necessary to change the paths in the script code!

Finally, here is the solution to the problem given in the previous part of this publication: RewriteCond %{REMOTE_ADDR} ^212.37.64 RewriteRule ^.*$ - [F]

If we write in the regular expression "^ 212.37.64" instead of "^ 212.37.64." (With a dot at the end), will it have the same effect, and will the same IP addresses be excluded?

The regular expression ^ 212.37.64 satisfies and applies to the following lines:

212.37.64
212.37.640
212.37.641
212.37.64a
212.37.64abc
212.37.64.12
212.37.642.12

Therefore, the last digit "4" can be accompanied by any character string. However, the maximum IP value is the address 255.255.255.255 - which implies that for example 212.37.642.12 - an incorrect (invalid) IP. The only valid IP in the above list is 212.37.64.12!

To be continued...