Mobile app version of vmapp.org
Login or Join
Shanna517

: Which tools can help limit maximum page view per IP to limit scrapers and bots? I would like to prevent scrapers grabbing all my content except Google, Bing and other search engines. I am

@Shanna517

Posted in: #Googlebot #Iptables #ScraperSites #WebCrawlers

I would like to prevent scrapers grabbing all my content except Google, Bing and other search engines. I am thinking of going with Fail2ban and limiting hits from an IP maybe at around 1000 per day. Is this a good idea? Would there be a better way?

10.05% popularity Vote Up Vote Down


Login to follow query

More posts by @Shanna517

5 Comments

Sorted by latest first Latest Oldest Best

 

@Ann8826881

I think the most efficient ways to limit unwanted hosts and IP's are:


Block them outside your server to reduce the load on it.
Use internal IP filtering/firewall rules, which reduces the load on your web server application.
Bock them using your web server.


The first requires dedicated hardware or a proxy server.

The second can be done via a control panel (e.g., cPanel, Plesk, etc...), or manually by creating IP filtering/firewall rules (covered in other answers).

The third can be done in IIS via its GUI, in Apache using modules (covered in other answers), or in Apache's configuration like this:

# Block unwanted host domains
RewriteEngine on
RewriteCond %{HTTP_REFERER} baddomain01.com [NC,OR]
RewriteCond %{HTTP_REFERER} baddomain02.com [NC]
RewriteRule .* - [F]


The later is a good option since you won't be banning specific IP addresses or classes, which might preclude schools, large businesses, or libraries that use NAT (a single outgoing IP address).

You can often spot frequent scraper and bot hosts in your web server's access and errors logs, which is easy to do using a stats application.

10% popularity Vote Up Vote Down


 

@Turnbaugh106

There are many ways this can be done within Apache using Modules, or alternatively you can setup IP tables to do the job though personally I just use the modules.

mod_security

I've personally used this and it does the job well, a good article about limiting requests can be found here.

mod_evasive


Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:

Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)


Another one here:

mod_qos


The current release of the mod_qos module implements control
mechanisms to manage:

The maximum number of concurrent requests to a location/resource (URL) or virtual host.
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded
kbytes per second.
Limits the number of request events per second (special request conditions).
It can also "detect" very important persons (VIP) which may access the web server without or with fewer restrictions.
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering (requires mod_parp).
Limitations on the TCP connection level, e.g., the maximum number of allowed connections from a single IP source address or dynamic
keep-alive control.
Prefers known IP addresses when server runs out of free TCP connections.


mod_dosevasive


The IP address of the client is checked in the temporary blacklist of the hash table. If the IP address is listed, then the client is denied access with a 403 Forbidden.

If the client is not currently on the blacklist, then the IP address of the client and the Universal Resource Identifier (URI) being requested are hashed into a key. Mod_Dosevasive will then check the listener's hash table to verify if any of the same hashes exist. If it does, it will then evaluate the total number of matched hashes and the timeframe that they were requested in versus the thresholds specified in the httpd.conf file by the Mod_Dosevasive directives.

If the request does not get denied by the preceding check, then just the IP address of the client is hashed into a key. The module will then check the hash table in the same fashion as above. The only difference with this check is that it doesn't factor in what URI the client is checking. It checks to see if the client request number has gone above the threshold set for the entire site per the time interval specified.


Iptables Solution

iptables -A FORWARD -m state --state NEW -m recent --rcheck --seconds 600 --hitcount 5 --name ATACK --rsource -j REJECT --reject-with icmp-port-unreachable

iptables -A FORWARD -d 127.0.0.1/32 -o eth1 -p tcp -m tcp --dport 80 -m recent --set --name ATACK --rsource -j ACCEPT

10% popularity Vote Up Vote Down


 

@Heady270

Another approach to limiting scrapers and bots would be to implement a honey pot. Put a page up that only bots would be able to find and restrict bots from accessing it via robots.txt. Any bot that then hits this url would get blacklisted.

WPoison is a project that provides the source code for doing exactly that.

10% popularity Vote Up Vote Down


 

@Heady270

There is an Apache module called "robotcop" that is designed for this particular purpose.

Unfortunately, the website for that Apache module (www.robotcop.org) is no longer in service. Here is a slashdot article announcing the launch of the robotcop module.

The (open source - Apache license) source code for the module is still available from various places:

fossies.org/linux/privat/old/robotcop-src_0.6.tar.gz/ http://slackware.org.uk/people/alphageek/pending/robotcop/http://slackware.org.uk/people/alphageek/pending/robotcop/

10% popularity Vote Up Vote Down


 

@Hamm4606531

A CDN service could sit in front of your site and filter out known crawlers; they also filter out spammers and since they cache your images at locations around the world your site would be faster.

I have been using CloudFlare for about a month on a site for a client and they have seen a decrease in bandwidth use, and an increase in traffic. CloudFlare also offer a free app called scrapeshield www.cloudflare.com/apps/scrapeshield but scraping isn't a big problem for that site, so its not caught anybody yet

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme