Mobile app version of vmapp.org
Login or Join
Jamie184

: Can I limit content scrapers by counting the number of hits from an IP? I would like to ban aggressive scrapers that access x number of pages in an hour, say 1000. I plan to implement such

@Jamie184

Posted in: #Logs #Protection #ScraperSites #WebCrawlers

I would like to ban aggressive scrapers that access x number of pages in an hour, say 1000. I plan to implement such a check via fail2ban and count the hits from the same ip via access logs. Should I be looking for other clues as well? I would whitelist the biggest scrapers like Google, Bing and ban the rest. I understand there may be some casualties if hundreds of users are behind an IP, but the ban would be temporary.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Jamie184

2 Comments

Sorted by latest first Latest Oldest Best

 

@Shakeerah822

Alas if your data is worth getting it will be gotten.
When we decided to handle IP addresses, the scrapers went horizontal on us and came back with thousands of IP addresses, each accessing us well under the limit we set.
I am myself hunting for the newest hints on how to stop robots that do not respect any protection.

10% popularity Vote Up Vote Down


 

@Becky754

Yes. You can. I do not know the tool you are using, but here are some clues.


Not a search engine (duh!).
High number of requests over a longer period of time- more than a
user.
Does not request an image, CSS, or JS (some do, most don't).
Changes agent name and OS over a longer period of time.
Comes from a web-hosting IP address block/range.


Check for access over various periods of time; one hour, one day, one week, one month. The reason for this is that stealth bots will make a high number of requests over a longer period of time. If the number of requests are higher than that of a user, then block them. Bots will stick out big-time. There will be no confusion on this one.

Some bots will request images to escape automated blocking, but may or may not request CSS and JS (javascript). Some will. If you have a request for a page but none of the above, it is a bot. If you have a request for a page and only one or two of the three, then it is a bot. Create a bogus JS (with virtually do nothing code) file and request it in your page. Keep it very small so that you are not effecting download speeds. You just want to see if it is requested.

Some bots will change the user agent (browser) and OS, some won't. If over a longer period of time these elements change, it is a bot.

Bots sometimes can come from subscriber lines (telcos) but most will not. If you check the IP address and it belongs to a web hosting company, then it is a bot. Subscriber lines are also used. Over a period of time, you will learn where bots come from. It will become a no-brainer for the most part.

If you are not sure if an IP address or access is a bot, come back here and ask making sure to post the original request. I should be able to help. I do this all day every day.

I have a much longer list of criteria, but those determinations get complicated and are protected processes. Sorry. But the list above should be more than enough for anyone to determine a bot %99 of the time.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme