Mobile app version of vmapp.org
Login or Join
Angela700

: Is there a way to make Alexa's ia_archiver slow down its crawling of my website? Alexa's ia_archiver bot is the main contributor to the Internet Archive's "Wayback Machine" web collection, and

@Angela700

Posted in: #Cache #InternetArchive #RobotsTxt #WebCrawlers

Alexa's ia_archiver bot is the main contributor to the Internet Archive's "Wayback Machine" web collection, and there are advantages to having my website included in that collection. There are other robots that do other useful things too.

What's a quick and easy way to make ia_archiver crawl my website more slowly, in order to put less load on the server? I haven't tested the Crawl-delay directive: if you have, and it works, please tell me so. If it doesn't work, please leave a comment. If you've never tested Crawl-delay, then instead please recommend some other solution that takes fifteen minutes or less to implement. Maybe there exists some easy-to-implement software-based solution which will let me throttle too-rapid hits from ia_archiver?

Please assume my website is running on Apache 2.4.3 on Debian Linux 6.0.6 on a dedicated server which I administer.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Angela700

1 Comments

Sorted by latest first Latest Oldest Best

 

@Berumen354

Crawling for the Internet Archive is done both by Alex and by the Internet Archive's own crawlers. Support for the Crawl-Delay directive in the robots.txt file is vairly hap-hazard between the two due to the directive not being part of the official robots.txt standard. In addition the way both companies treat the Crawl-Delay directive when they do accept it seems to change over time in my experience. I have tried doing this in the past and have found that sometimes the Crawl-Delay directive has been respected by both, other times only one has respected it, and yet some other times neither has, and there doesn't seem to be a pattern to when it is respected or not. The only thing I can suggest that will definitely work is to add a disallow directive for both the Alexa crawler and the ia_crawler.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme