: Control over the Internet Archive besides just "Disallow /"? Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add: User-agent:

Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add:

User-agent: ia_archiver
Disallow: /

Can I tell the bot that I want them to crawl my site once a month, or once a year?
I have a site/pages that doesn't/don't get archived correctly because of assets not picked up. Is there a way to tell the Internet Archive bot what assets it needs if it's going to grab the site?

10.02% popularity Vote Up Vote Down

: How do I prevent Google from serving a cached version of my site? From what I understand I can tell Google to remove pages from their archive if I add the header: <meta name="ROBOTS" contents="NOARCHIVE"

@Dunderdale272

Posted in: #Cache #Google #Seo

1 Comments

: Should I tell visitors that their browser is out-of-date? I have seen many sites like browser-update.org that provide small bits of JavaScript & CSS to implement in a page that inform a

@Dunderdale272

Posted in: #Browsers #BrowserSupport #Notification

1 Comments

: I think it depends entirely on how much you emphasize the "Donation." From what I've seen a passive "Donate" button with no emphasis, and no regular mentions of it, say on home and landing

@Dunderdale272

0 Comments

: My shared-host tells me that they're throttling my website because of MySQL, what can I do? My shared-host tells me that they're throttling my website because of MySQL, what can I do? I'm not

@Dunderdale272

Posted in: #Mysql #SharedHosting #Throttling

1 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Cofer257

Most search engines support the "Crawl-delay" directive, but I don't know if IA does. You could try it though:

User-agent: ia_archiver
Crawl-delay: 3600

This would limit the delay between requests to 3600 seconds (i.e. 1 hour), or ~700 requests per month.

I don't think #2 is possible - the IA bot grabs the assets as and when it sees fit. It may have a file size limit to avoid using too much storage.

10% popularity Vote Up Vote Down

@Angela700

Note: This answer is increasingly out-of-date.

The largest contributor to the Internet Archive's web collection has been Alexa Internet. Material that Alexa crawls for its purposes has been donated to IA a few months later. Adding the disallow rule mentioned in the question does not affect those crawls, but the Wayback will 'retroactively' honor them (denying access, the material will still be in the archive - you should exclude Alexa's robot if you really want to keep your material out of the Internet Archive).

There may be ways to affect Alexa's crawls, but I'm not familiar with that.

Since IA developed its own crawler (Heritrix) they have started doing their own crawls, but those tend to be targeted crawls (they do election crawls for Library of Congress and have done national crawls for France and Australia etc.). They do not engage in the kind of sustained world scale crawls that Google and Alexa conduct. IA's largest crawl was a special project to crawl 2 billion pages.

As these crawls are operated on schedules that derive from project specific factors, you can not affect how often they visit your site or if they visit your site.

The only way to directly affect how and when IA crawls your site is to use their Archive-It service. That service allows you to specify custom crawls. The resultant data will (eventually) be incorporated into IA's web collection. This is however a paid subscription service.

10% popularity Vote Up Vote Down

Feed

: Control over the Internet Archive besides just "Disallow /"? Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add: User-agent:

More posts by @Dunderdale272

: How do I prevent Google from serving a cached version of my site? From what I understand I can tell Google to remove pages from their archive if I add the header: <meta name="ROBOTS" contents="NOARCHIVE"

: Should I tell visitors that their browser is out-of-date? I have seen many sites like browser-update.org that provide small bits of JavaScript & CSS to implement in a page that inform a

: I think it depends entirely on how much you emphasize the "Donation." From what I've seen a passive "Donate" button with no emphasis, and no regular mentions of it, say on home and landing

: My shared-host tells me that they're throttling my website because of MySQL, what can I do? My shared-host tells me that they're throttling my website because of MySQL, what can I do? I'm not

Login to post a comment!

2 Comments

Back to top | Use Dark Theme