Mobile app version of vmapp.org
Login or Join
Nickens628

: How to properly (dis)allow the archive.org bot? Did things change, if so when? I have a website that I mostly don't want to be indexed by search engines, but I do want to preserve it for

@Nickens628

Posted in: #InternetArchive #RobotsTxt #WebCrawlers

I have a website that I mostly don't want to be indexed by search engines, but I do want to preserve it for eternity on archive.org. So my robots.txt starts with this:

User-agent: *
Disallow: /


Today, according to archive.org I have to add the following in my robots.txt to allow their bots:

User-agent: ia_archiver
Disallow:


But, I already had done what they indicated a couple of years ago, at least, I added the following:

User-agent: archive.org_bot
Disallow:


Then there's another source claiming that you have to add the two above Disallows, plus another one:

User-agent: ia_archiver-web.archive.org
Disallow:


Note that you need to put Disallow: / if you don't want the bot to archive your site.

Has there been a change with the IA bot? If so, when?

What is the recommended way? Should I just allow all three for now and hope that IA will not change their bot name again in the future?

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Nickens628

4 Comments

Sorted by latest first Latest Oldest Best

 

@Tiffany637

The robots.txt ia_archiver Disallow entry (with the "/") should be fine for the need you describe (to "preserve for eternity", but not yet publicly).

I just did a quick test, commenting out the ia_archiver Disallow entry for a site that had it for at least the past 10 years. Then I looked the site up on archive.org/web, and it showed up grabs it had collected in 2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016 and 2017! This means that Archive.org never strictly honored what others thought to be a "do not archive" statement during these years, it was merely not exposing the archived copies.

10% popularity Vote Up Vote Down


 

@Margaret670

Update 2017

Archive bot now does not care about your robots.txt.

If you really want to block it, send them a email according to this page, or block their IP address via htaccess.

10% popularity Vote Up Vote Down


 

@Fox8124981

There are really 2 issues here:


Will the robots.txt on your site Disallow (block) Wayback from crawling your site.
Will Wayback crawl your site.


For point #1 :
As others have said, the correct entry for robots.txt is:

User-agent: ia_archiver
Disallow:


Keep in mind that it might take a while (perhaps a good long while), for Wayback to notice any changes you have made to robots.txt.

To check if the robots.txt on your site will allow Wayback to crawl your site:


Go to this URL: archive.org/web/ In the box at the TOP of the page, enter the URL of a page on your site, and click the "Browse History" button.
Or, in the box under "Save Page Now" (currently near the bottom on the right), and enter the URL of a page on your site, and click the "Save Page" button.


At this point, you should see 1 of 3 things:


You will see an error message indicating that Wayback can't access pages on that site due to "robots.txt".
You will see the "calendar" of historical save points for the page on your site. In this case, you know that Wayback is NOT blocked from crawling your site.
Or, you will see a message indicating that Wayback doesn't have an archive of that page, and an offer to click a link to add the page to Wayback. In this case also, you know that Wayback is NOT blocked from crawling your site.




Now, for point #2 :

Will Wayback crawl your site?

Just because you Allow Wayback to crawl your site, doesn't mean that they (ever) will crawl your site.

According to the Wayback FAQ (emphasis added):


How can I get my site included in the Wayback Machine?

Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.

Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.

Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.



Update: 09-May-2017

Others have left comments/answers indicating that Archive.org no longer honors robots.txt. Perhaps this is a "work-in-progress" and it will eventually be the case, but I have not seen this new behavior yet.

The case for this seems to come from this article: Robots.txt: ROBOTS.TXT IS A SUICIDE NOTE by archiveteam.org. While that page has little if anything good to say about "Robots.txt", it doesn't mention anywhere that Archive.org will no longer honor robots.txt.

Also of note: that article is hosted on archiveteam.org, which is most definitely not archive.org, and I'm not sure there is any (official) relationship between archive.org and archiveteam.org.

In fact, this page on archive.org about Archive Team, seems to declare a distinction between archive.org and archiveteam.org (emphasis added):


Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. ...


In any case, I decided to give this a try, and I found that, at least at this time, Archive.org STILL honors robots.txt:


I found a random item on eBay: Item #: 131795294232
Click to view the sold items:





The "Items sold" page opens: offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=131795294232 Copy the link to the clipboard.
Goto web.archive.org, and paste the link from eBay.
You will see that archive.org indicates that the "Page cannot be displayed due to robots.txt."




So, at this time, I remain unconvinced, but I would love to be proven wrong... it would be great if it were true.

10% popularity Vote Up Vote Down


 

@Ann8826881

Update: As @KevinFegan notes in the comments, their documentation changed. The below part describes how the Internet Archive handled it in the past (at least in 2014).



Their FAQ How can I have my site's pages excluded from the Wayback Machine? refers to Removing Documents From the Wayback Machine, which documents that their bot is called ia_archiver.

So this record should allow their bot to crawl your entire site:

User-agent: ia_archiver
Disallow:

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme