scrapers? I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation

I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers.

I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish:

The idea is that if you request too many pages too fast you're added to a blacklist for 24 hrs and shown a captcha instead of content, upon every page you request. If you enter the captcha correctly you've removed from the blacklist.
There is a whitelist so GoogleBot, etc. will never get blocked.

Which is the best/easiest way to implement this idea?

Server = IIS

Cleaning out the old tuples from a DB every 24 hrs is easily done so no need to explain that.

10.04% popularity Vote Up Vote Down

: Can I get IIS to treat some .shtml files as .php files? I want to be able to get IIS to interpret SHTML files (only in a specific directory) as PHP, and interpret the rest as SHTML. If

@Reiling115

Posted in: #Html #Iis #ModRewrite #Php #ServerSideScripting

1 Comments

: Will changing SHTML files to PHP change effect how search engines index my site? I have a lot of SHTML pages. If I change these to PHP to add some more functionality (obviously changing stuff

@Reiling115

Posted in: #SearchEngines #Seo #ServerSideScripting

2 Comments

: Which languages are supported by reCAPTCHA? The documentation states that "Any supported language code." is supported, but then says: Which language is used in the interface for the pre-defined

@Reiling115

Posted in: #Captcha #Google #Language

2 Comments

: Changing mod_rewrite "file exists" checks to not see directories as files I have the following mod_rewrite rules set up: RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_FILENAME}.php

@Reiling115

Posted in: #Apache #ModRewrite #Webserver

1 Comments

Login to post a comment!

4 Comments

Sorted by latest first Latest Oldest Best

@Sherry384

make a spider trap:

make a page like /spider-trap/somepage.html
block the page on robots.txt: Disallow: /spider-trap/
place a link to this page but hide it to human eyes
block the IP for ANYTHING reach this page
show a human readable hints and an unlock IP captcha something on this page.

10% popularity Vote Up Vote Down

@Hamm4606531

Are you providing instructions to bots in a robots.txt file?

Are you using a meta element with a "noindex" value in its content attribute?

Have you specified a slower crawl rate in the Google Webmaster (or whichever crawler you are having issues with) interface?

10% popularity Vote Up Vote Down

@Speyer207

Well, you can render everything into an image format...That sort of thing tends to cause me issues, and it can be done reliably using ImageMagick, for example. I tend to scrape a lot of government sites, and they store vast quantities of information in scanned documents. Blech. What a pain. But judicious use of OCR will foil this sort of security.

If it can be viewed, it can be scraped. Your only recourse is to attempt to identify "mechanical" traffic by monitoring incoming requests, and checking the intervals between page requests. If an ip is requesting multiple pages a second, it's almost certainly a scraper or a spider. Or if it requests one page every 10 seconds, or some other impossibly precise interval. Google uses a learning algorithm to spot scraper-like traffic, but I can count on one hand the number of times I've tripped it (though I very seldom run up against Google content).

A clever scripter will have a random amount of delay built in, however. If they are patient, there is effectively nothing you can do to stop them. Perhaps set an upper limit per IP? You risk alienating your biggest users.

Some people try blocking unknown HTTP_USER_AGENTs, but that's a waste of time: it'll only stop the same people who would respect a robots.txt file.

10% popularity Vote Up Vote Down

@Shanna517

You didn't specify a server technology so this answer may not apply, but can't you simply move the SSI pages to a directory that only the ID that the web server is running as has access to but the anonymous ID does not?

10% popularity Vote Up Vote Down

Feed

: How to protect SHTML pages from crawlers/spiders/scrapers? I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation

More posts by @Reiling115

: Can I get IIS to treat some .shtml files as .php files? I want to be able to get IIS to interpret SHTML files (only in a specific directory) as PHP, and interpret the rest as SHTML. If

: Will changing SHTML files to PHP change effect how search engines index my site? I have a lot of SHTML pages. If I change these to PHP to add some more functionality (obviously changing stuff

: Which languages are supported by reCAPTCHA? The documentation states that "Any supported language code." is supported, but then says: Which language is used in the interface for the pre-defined

: Changing mod_rewrite "file exists" checks to not see directories as files I have the following mod_rewrite rules set up: RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-d RewriteCond %{REQUEST_FILENAME}.php

Login to post a comment!

4 Comments

Back to top | Use Dark Theme