: Can I limit content scrapers by counting the number of hits from an IP? I would like to ban aggressive scrapers that access x number of pages in an hour, say 1000. I plan to implement such

Posted in: #Logs #Protection #ScraperSites #WebCrawlers

I would like to ban aggressive scrapers that access x number of pages in an hour, say 1000. I plan to implement such a check via fail2ban and count the hits from the same ip via access logs. Should I be looking for other clues as well? I would whitelist the biggest scrapers like Google, Bing and ban the rest. I understand there may be some casualties if hundreds of users are behind an IP, but the ban would be temporary.

10.02% popularity Vote Up Vote Down

: Do shorter URLs benefit the pagerank and overall SEO value of pages vs. a longer categorised url showing the site structure We are coming to the end of a long project in which we are upgrading

@Jamie184

Posted in: #Google #Seo #Url

5 Comments

: Will sub-domain errors effect parent domain SEO? I have a sub-domain which has many errors like duplicate titles, description, 404 errors, ... but the parent domain is working perfectly. I want

@Jamie184

Posted in: #Domains #Seo #Subdomain

2 Comments

: How is impression determined in GWT? When Webmaster tools say a query has x amount of impressions, is this the amount of time the URL has been returned into the Google search list or is it

@Jamie184

Posted in: #GoogleSearchConsole

2 Comments

: Security and Apache Methods I occasionally get annoying POST requests but one lately caused a 500 error. Not a biggie really, but it got me thinking. My site does not have but one form. Outside

@Jamie184

Posted in: #Apache #Hacking #Htaccess #HttpdConf #Security

0 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Shakeerah822

Alas if your data is worth getting it will be gotten.
When we decided to handle IP addresses, the scrapers went horizontal on us and came back with thousands of IP addresses, each accessing us well under the limit we set.
I am myself hunting for the newest hints on how to stop robots that do not respect any protection.

10% popularity Vote Up Vote Down

@Becky754

Yes. You can. I do not know the tool you are using, but here are some clues.

Not a search engine (duh!).
High number of requests over a longer period of time- more than a
user.
Does not request an image, CSS, or JS (some do, most don't).
Changes agent name and OS over a longer period of time.
Comes from a web-hosting IP address block/range.

Check for access over various periods of time; one hour, one day, one week, one month. The reason for this is that stealth bots will make a high number of requests over a longer period of time. If the number of requests are higher than that of a user, then block them. Bots will stick out big-time. There will be no confusion on this one.

Some bots will request images to escape automated blocking, but may or may not request CSS and JS (javascript). Some will. If you have a request for a page but none of the above, it is a bot. If you have a request for a page and only one or two of the three, then it is a bot. Create a bogus JS (with virtually do nothing code) file and request it in your page. Keep it very small so that you are not effecting download speeds. You just want to see if it is requested.

Some bots will change the user agent (browser) and OS, some won't. If over a longer period of time these elements change, it is a bot.

Bots sometimes can come from subscriber lines (telcos) but most will not. If you check the IP address and it belongs to a web hosting company, then it is a bot. Subscriber lines are also used. Over a period of time, you will learn where bots come from. It will become a no-brainer for the most part.

If you are not sure if an IP address or access is a bot, come back here and ask making sure to post the original request. I should be able to help. I do this all day every day.

I have a much longer list of criteria, but those determinations get complicated and are protected processes. Sorry. But the list above should be more than enough for anyone to determine a bot %99 of the time.

10% popularity Vote Up Vote Down

Feed

: Can I limit content scrapers by counting the number of hits from an IP? I would like to ban aggressive scrapers that access x number of pages in an hour, say 1000. I plan to implement such

More posts by @Jamie184

: Do shorter URLs benefit the pagerank and overall SEO value of pages vs. a longer categorised url showing the site structure We are coming to the end of a long project in which we are upgrading

: Will sub-domain errors effect parent domain SEO? I have a sub-domain which has many errors like duplicate titles, description, 404 errors, ... but the parent domain is working perfectly. I want

: How is impression determined in GWT? When Webmaster tools say a query has x amount of impressions, is this the amount of time the URL has been returned into the Google search list or is it

: Security and Apache Methods I occasionally get annoying POST requests but one lately caused a 500 error. Not a biggie really, but it got me thinking. My site does not have but one form. Outside

Login to post a comment!

2 Comments

Back to top | Use Dark Theme