: Tactics for dealing with misbehaving robots I have a site that, for regulatory reasons, may not be indexed or searched automatically. This means that we need to keep all robots away and prevent

Posted in: #Apache #UserAgent #WebCrawlers

I have a site that, for regulatory reasons, may not be indexed or searched automatically. This means that we need to keep all robots away and prevent them from spidering the site.

Obviously we've had a robots.txt file that disallows all right from the start. However, observing the robots.txt file is something only well behaved robots do. Recently we've had some issues with less well behaved robots. I've configured Apache to ban a few user-agents but it is pretty easy to get around that.

So, the question is, is there some way to configure Apache (perhaps by installing a some module?) to detect robot-like behavior and respond? Any other ideas?

At the moment all I can do is ban IP addresses based on manual inspection of the logs and that is simply not a viable long term strategy.

10.03% popularity Vote Up Vote Down

: Configure Apache HTTP 2.2 for PUT method I've written code for file upload and download using HttpWebRequest/HttpWebResponse class in C# with Apache 2.2 as a web server. In code I'm using request.method

@Angela700

Posted in: #Apache #Http #HttpdConf

1 Comments

: How to Mark-Up a Multi-Page Tutorial? I'm writing a very long tutorial that just can't fit in one page--it would make the page heavier than 1MB, so I have to split it up. But I really wish

@Angela700

Posted in: #Html #Markup #Seo

1 Comments

: Domains are hostnames that identify Internet Protocol (IP) resources such as web sites. They provide a human-friendly, easily recognizable and memorizable names to numerically addressed Internet

@Angela700

0 Comments

: Usually Referrers are not passed along but instead by GET url-encoding containing the search keywords. Headers can be set and deleted, - are simple ASCI text, and are just part of the HTTP

@Angela700

0 Comments

Login to post a comment!

3 Comments

Sorted by latest first Latest Oldest Best

@Si4351233

As Gisle Hannemyr mentioned in a comment, the best way to do this is to require logins of all users, and do not provide the restricted content to anyone who isn't logged in.

If you can't require logins for some reason, there are still a couple of fallbacks you can use (disclaimer: both of them are either partly or completely my fault):

The OWASP ModSecurity Core Rule Set contains a number of rules designed to detect automation, even when the bot has taken steps to disguise itself as a browser (e.g. faking its User-Agent string). If you are in full control of your server, such as a VPS, dedicated server, or something larger than that, then you can use these rules with ModSecurity.

This rule set also contains other rules meant to stop a wide variety of inappropriate activity; if you haven't looked at it, you definitely should.
If you aren't in full control of your server (i.e. you're on shared web hosting) and your host doesn't allow you to use your own ModSecurity rules, you can try something at the application level, such as my own Bad Behavior. I started this project in 2005 to fight blog spam and content scrapers such as those that concern you. It can be added to any PHP-based web site.

I should also note that many of Bad Behavior's rules have been incorporated into the ModSecurity Core Rule Set, so as long as you've enabled those rules, running both would be rather redundant. These rules are annotated in the Core Rule Set as originating from Bad Behavior.

10% popularity Vote Up Vote Down

@Sue5673885

You can piggyback on work other people have done in identifying bad IPs by using an Apache module which interfaces with Project Honeypot's IP blacklist. If you're doing this on a large scale, it would probably be polite to offer to run a honeypot.

10% popularity Vote Up Vote Down

@Pope3001725

You can link to a hidden page that, when visited, captures the useragent and IP address of the bot and then appends one or both of them to a .htaccess file which blocks them permanently. It's automated so you don't have to do anything to maintain it.

10% popularity Vote Up Vote Down

Feed

: Tactics for dealing with misbehaving robots I have a site that, for regulatory reasons, may not be indexed or searched automatically. This means that we need to keep all robots away and prevent

More posts by @Angela700

: Configure Apache HTTP 2.2 for PUT method I've written code for file upload and download using HttpWebRequest/HttpWebResponse class in C# with Apache 2.2 as a web server. In code I'm using request.method

: How to Mark-Up a Multi-Page Tutorial? I'm writing a very long tutorial that just can't fit in one page--it would make the page heavier than 1MB, so I have to split it up. But I really wish

: Domains are hostnames that identify Internet Protocol (IP) resources such as web sites. They provide a human-friendly, easily recognizable and memorizable names to numerically addressed Internet

: Usually Referrers are not passed along but instead by GET url-encoding containing the search keywords. Headers can be set and deleted, - are simple ASCI text, and are just part of the HTTP

Login to post a comment!

3 Comments

Back to top | Use Dark Theme