: Which tools can help limit maximum page view per IP to limit scrapers and bots? I would like to prevent scrapers grabbing all my content except Google, Bing and other search engines. I am

Posted in: #Googlebot #Iptables #ScraperSites #WebCrawlers

I would like to prevent scrapers grabbing all my content except Google, Bing and other search engines. I am thinking of going with Fail2ban and limiting hits from an IP maybe at around 1000 per day. Is this a good idea? Would there be a better way?

10.05% popularity Vote Up Vote Down

: Virtualmin: https domain showing up for all other domains that have ssl disabled My Virtualmin has a domain with an SSL cert enabled. All domains are on the same IP. The other domains have

@Shanna517

Posted in: #Https #Webmin

0 Comments

: My webpage works using HTTP, but not using www I'm sorry to repeat this question as I saw it was previously asked. I would just like the answer in something that a complete idiot and newbie

@Shanna517

Posted in: #WebDevelopment

2 Comments

: Pros and cons of tracking referral trafic with "utm" tags We have a blog discussing products of a variety of manufacturers. Recently a manufacturer asked us if we could utm tag links from our

@Shanna517

Posted in: #GoogleAnalytics

2 Comments

: Time Update in HomePage Meta Description every time I add a new post my blog homepage also shows the update time in Meta. Say whenever i type site:thetopblogger.com in Google search it shows

@Shanna517

Posted in: #Homepage

1 Comments

Login to post a comment!

5 Comments

Sorted by latest first Latest Oldest Best

@Ann8826881

I think the most efficient ways to limit unwanted hosts and IP's are:

Block them outside your server to reduce the load on it.
Use internal IP filtering/firewall rules, which reduces the load on your web server application.
Bock them using your web server.

The first requires dedicated hardware or a proxy server.

The second can be done via a control panel (e.g., cPanel, Plesk, etc...), or manually by creating IP filtering/firewall rules (covered in other answers).

The third can be done in IIS via its GUI, in Apache using modules (covered in other answers), or in Apache's configuration like this:

# Block unwanted host domains
RewriteEngine on
RewriteCond %{HTTP_REFERER} baddomain01.com [NC,OR]
RewriteCond %{HTTP_REFERER} baddomain02.com [NC]
RewriteRule .* - [F]

The later is a good option since you won't be banning specific IP addresses or classes, which might preclude schools, large businesses, or libraries that use NAT (a single outgoing IP address).

You can often spot frequent scraper and bot hosts in your web server's access and errors logs, which is easy to do using a stats application.

10% popularity Vote Up Vote Down

@Turnbaugh106

There are many ways this can be done within Apache using Modules, or alternatively you can setup IP tables to do the job though personally I just use the modules.

mod_security

I've personally used this and it does the job well, a good article about limiting requests can be found here.

mod_evasive

Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:

Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)

Another one here:

mod_qos

The current release of the mod_qos module implements control
mechanisms to manage:

The maximum number of concurrent requests to a location/resource (URL) or virtual host.
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded
kbytes per second.
Limits the number of request events per second (special request conditions).
It can also "detect" very important persons (VIP) which may access the web server without or with fewer restrictions.
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering (requires mod_parp).
Limitations on the TCP connection level, e.g., the maximum number of allowed connections from a single IP source address or dynamic
keep-alive control.
Prefers known IP addresses when server runs out of free TCP connections.

mod_dosevasive

The IP address of the client is checked in the temporary blacklist of the hash table. If the IP address is listed, then the client is denied access with a 403 Forbidden.

If the client is not currently on the blacklist, then the IP address of the client and the Universal Resource Identifier (URI) being requested are hashed into a key. Mod_Dosevasive will then check the listener's hash table to verify if any of the same hashes exist. If it does, it will then evaluate the total number of matched hashes and the timeframe that they were requested in versus the thresholds specified in the httpd.conf file by the Mod_Dosevasive directives.

If the request does not get denied by the preceding check, then just the IP address of the client is hashed into a key. The module will then check the hash table in the same fashion as above. The only difference with this check is that it doesn't factor in what URI the client is checking. It checks to see if the client request number has gone above the threshold set for the entire site per the time interval specified.

Iptables Solution

iptables -A FORWARD -m state --state NEW -m recent --rcheck --seconds 600 --hitcount 5 --name ATACK --rsource -j REJECT --reject-with icmp-port-unreachable

iptables -A FORWARD -d 127.0.0.1/32 -o eth1 -p tcp -m tcp --dport 80 -m recent --set --name ATACK --rsource -j ACCEPT

10% popularity Vote Up Vote Down

@Heady270

Another approach to limiting scrapers and bots would be to implement a honey pot. Put a page up that only bots would be able to find and restrict bots from accessing it via robots.txt. Any bot that then hits this url would get blacklisted.

WPoison is a project that provides the source code for doing exactly that.

10% popularity Vote Up Vote Down

@Heady270

There is an Apache module called "robotcop" that is designed for this particular purpose.

Unfortunately, the website for that Apache module (www.robotcop.org) is no longer in service. Here is a slashdot article announcing the launch of the robotcop module.

The (open source - Apache license) source code for the module is still available from various places:

fossies.org/linux/privat/old/robotcop-src_0.6.tar.gz/ http://slackware.org.uk/people/alphageek/pending/robotcop/http://slackware.org.uk/people/alphageek/pending/robotcop/

10% popularity Vote Up Vote Down

@Hamm4606531

A CDN service could sit in front of your site and filter out known crawlers; they also filter out spammers and since they cache your images at locations around the world your site would be faster.

I have been using CloudFlare for about a month on a site for a client and they have seen a decrease in bandwidth use, and an increase in traffic. CloudFlare also offer a free app called scrapeshield www.cloudflare.com/apps/scrapeshield but scraping isn't a big problem for that site, so its not caught anybody yet

10% popularity Vote Up Vote Down

Feed

: Which tools can help limit maximum page view per IP to limit scrapers and bots? I would like to prevent scrapers grabbing all my content except Google, Bing and other search engines. I am

More posts by @Shanna517

: Virtualmin: https domain showing up for all other domains that have ssl disabled My Virtualmin has a domain with an SSL cert enabled. All domains are on the same IP. The other domains have

: My webpage works using HTTP, but not using www I'm sorry to repeat this question as I saw it was previously asked. I would just like the answer in something that a complete idiot and newbie

: Pros and cons of tracking referral trafic with "utm" tags We have a blog discussing products of a variety of manufacturers. Recently a manufacturer asked us if we could utm tag links from our

: Time Update in HomePage Meta Description every time I add a new post my blog homepage also shows the update time in Meta. Say whenever i type site:thetopblogger.com in Google search it shows

Login to post a comment!

5 Comments

Back to top | Use Dark Theme