Mobile app version of vmapp.org
Login or Join
Hamaas447

: How can I block who.is and archive.org from getting information of my website with htaccess? who.is is a service that gives people whois information of websites and archive.org automatically saves

@Hamaas447

Posted in: #Htaccess #SpamBlocker

who.is is a service that gives people whois information of websites and archive.org automatically saves people's website. How can I block them using htaccess from getting access with my website?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Hamaas447

2 Comments

Sorted by latest first Latest Oldest Best

 

@Kevin317

Whois does not crawl your website to gather the info, therefore there is nothing to block. However, you can set your info to private with your registrar.

For the Internet Archive, you can do this:

in .htaccess, you can use this to block bots from accessing your site:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (archive.org_bot) [NC]
RewriteRule .* - [R=403,L]


This will return a 403. I have found the bot name here. The problem with this is that if the bot name is wrong, it will obviously not work.

Blocking crawlers is a job better suited for robots.txt like this:

User-agent: archive.org_bot
Disallow: /


However, Internet Archive says that robots.txt are meant for search engines and might very well be ignored by the Internet Archive.


A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.


Source: blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
Your best bet would be to send them an email to ask them not to list your site.

10% popularity Vote Up Vote Down


 

@Chiappetta492

You can block archive.org from crawling your site in your robots.txt file with

User-agent: ia_archiver
Disallow: /


I believe this will also block archive.org from accessing your site by putting this in your htaccess file:

SetEnvIfNoCase User-Agent "^ia_archiver" bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme