: How to protect SHTML pages from crawlers/spiders/scrapers? I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation
I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers.
I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish:
The idea is that if you request too many pages too fast you're added to a blacklist for 24 hrs and shown a captcha instead of content, upon every page you request. If you enter the captcha correctly you've removed from the blacklist.
There is a whitelist so GoogleBot, etc. will never get blocked.
Which is the best/easiest way to implement this idea?
Server = IIS
Cleaning out the old tuples from a DB every 24 hrs is easily done so no need to explain that.
More posts by @Reiling115
4 Comments
Sorted by latest first Latest Oldest Best
make a spider trap:
make a page like /spider-trap/somepage.html
block the page on robots.txt: Disallow: /spider-trap/
place a link to this page but hide it to human eyes
block the IP for ANYTHING reach this page
show a human readable hints and an unlock IP captcha something on this page.
Are you providing instructions to bots in a robots.txt file?
Are you using a meta element with a "noindex" value in its content attribute?
Have you specified a slower crawl rate in the Google Webmaster (or whichever crawler you are having issues with) interface?
Well, you can render everything into an image format...That sort of thing tends to cause me issues, and it can be done reliably using ImageMagick, for example. I tend to scrape a lot of government sites, and they store vast quantities of information in scanned documents. Blech. What a pain. But judicious use of OCR will foil this sort of security.
If it can be viewed, it can be scraped. Your only recourse is to attempt to identify "mechanical" traffic by monitoring incoming requests, and checking the intervals between page requests. If an ip is requesting multiple pages a second, it's almost certainly a scraper or a spider. Or if it requests one page every 10 seconds, or some other impossibly precise interval. Google uses a learning algorithm to spot scraper-like traffic, but I can count on one hand the number of times I've tripped it (though I very seldom run up against Google content).
A clever scripter will have a random amount of delay built in, however. If they are patient, there is effectively nothing you can do to stop them. Perhaps set an upper limit per IP? You risk alienating your biggest users.
Some people try blocking unknown HTTP_USER_AGENTs, but that's a waste of time: it'll only stop the same people who would respect a robots.txt file.
You didn't specify a server technology so this answer may not apply, but can't you simply move the SSI pages to a directory that only the ID that the web server is running as has access to but the anonymous ID does not?
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.