Mobile app version of vmapp.org
Login or Join
Dunderdale272

: Control over the Internet Archive besides just "Disallow /"? Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add: User-agent:

@Dunderdale272

Posted in: #Cache #InternetArchive

Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add:

User-agent: ia_archiver
Disallow: /



Can I tell the bot that I want them to crawl my site once a month, or once a year?
I have a site/pages that doesn't/don't get archived correctly because of assets not picked up. Is there a way to tell the Internet Archive bot what assets it needs if it's going to grab the site?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Dunderdale272

2 Comments

Sorted by latest first Latest Oldest Best

 

@Cofer257

Most search engines support the "Crawl-delay" directive, but I don't know if IA does. You could try it though:

User-agent: ia_archiver
Crawl-delay: 3600


This would limit the delay between requests to 3600 seconds (i.e. 1 hour), or ~700 requests per month.

I don't think #2 is possible - the IA bot grabs the assets as and when it sees fit. It may have a file size limit to avoid using too much storage.

10% popularity Vote Up Vote Down


 

@Angela700

Note: This answer is increasingly out-of-date.

The largest contributor to the Internet Archive's web collection has been Alexa Internet. Material that Alexa crawls for its purposes has been donated to IA a few months later. Adding the disallow rule mentioned in the question does not affect those crawls, but the Wayback will 'retroactively' honor them (denying access, the material will still be in the archive - you should exclude Alexa's robot if you really want to keep your material out of the Internet Archive).

There may be ways to affect Alexa's crawls, but I'm not familiar with that.

Since IA developed its own crawler (Heritrix) they have started doing their own crawls, but those tend to be targeted crawls (they do election crawls for Library of Congress and have done national crawls for France and Australia etc.). They do not engage in the kind of sustained world scale crawls that Google and Alexa conduct. IA's largest crawl was a special project to crawl 2 billion pages.

As these crawls are operated on schedules that derive from project specific factors, you can not affect how often they visit your site or if they visit your site.

The only way to directly affect how and when IA crawls your site is to use their Archive-It service. That service allows you to specify custom crawls. The resultant data will (eventually) be incorporated into IA's web collection. This is however a paid subscription service.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme