Mobile app version of vmapp.org
Login or Join
Reiling115

: How to protect SHTML pages from crawlers/spiders/scrapers? I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation

@Reiling115

Posted in: #ScraperSites #Security

I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers.

I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish:


The idea is that if you request too many pages too fast you're added to a blacklist for 24 hrs and shown a captcha instead of content, upon every page you request. If you enter the captcha correctly you've removed from the blacklist.
There is a whitelist so GoogleBot, etc. will never get blocked.


Which is the best/easiest way to implement this idea?

Server = IIS

Cleaning out the old tuples from a DB every 24 hrs is easily done so no need to explain that.

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Reiling115

4 Comments

Sorted by latest first Latest Oldest Best

 

@Sherry384

make a spider trap:


make a page like /spider-trap/somepage.html
block the page on robots.txt: Disallow: /spider-trap/
place a link to this page but hide it to human eyes
block the IP for ANYTHING reach this page
show a human readable hints and an unlock IP captcha something on this page.

10% popularity Vote Up Vote Down


 

@Hamm4606531

Are you providing instructions to bots in a robots.txt file?

Are you using a meta element with a "noindex" value in its content attribute?

Have you specified a slower crawl rate in the Google Webmaster (or whichever crawler you are having issues with) interface?

10% popularity Vote Up Vote Down


 

@Speyer207

Well, you can render everything into an image format...That sort of thing tends to cause me issues, and it can be done reliably using ImageMagick, for example. I tend to scrape a lot of government sites, and they store vast quantities of information in scanned documents. Blech. What a pain. But judicious use of OCR will foil this sort of security.

If it can be viewed, it can be scraped. Your only recourse is to attempt to identify "mechanical" traffic by monitoring incoming requests, and checking the intervals between page requests. If an ip is requesting multiple pages a second, it's almost certainly a scraper or a spider. Or if it requests one page every 10 seconds, or some other impossibly precise interval. Google uses a learning algorithm to spot scraper-like traffic, but I can count on one hand the number of times I've tripped it (though I very seldom run up against Google content).

A clever scripter will have a random amount of delay built in, however. If they are patient, there is effectively nothing you can do to stop them. Perhaps set an upper limit per IP? You risk alienating your biggest users.

Some people try blocking unknown HTTP_USER_AGENTs, but that's a waste of time: it'll only stop the same people who would respect a robots.txt file.

10% popularity Vote Up Vote Down


 

@Shanna517

You didn't specify a server technology so this answer may not apply, but can't you simply move the SSI pages to a directory that only the ID that the web server is running as has access to but the anonymous ID does not?

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme