Mobile app version of vmapp.org
Login or Join
Lee4591628

: Deny access to Archive.is I would like to deny archive.is from having access to my website. (I do not want this website to cache mine without my consent). Do you know if it is possible?

@Lee4591628

Posted in: #Noarchive #WebCrawlers

I would like to deny archive.is from having access to my website. (I do not want this website to cache mine without my consent).

Do you know if it is possible?

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Lee4591628

4 Comments

Sorted by latest first Latest Oldest Best

 

@Jennifer507

Also consider Contacting the registrars at isnic.is, Iceland's Domain Registry.
isnic at isnic dot is

Iceland has copyright law, and the Registry recognizes it.
The Registry has existed since the late 1980s, and is not under ICANN.

10% popularity Vote Up Vote Down


 

@Nimeshi995

To block the disgusting stealing practices of archive.is (ignoring robots.txt, overriding link canonical, fake user agent, no way to perform a site-wide removal), I want to add the following to the solutions above.

Find their IP-addresses

To find their ip-addresses, submit a url to them that is under your control so that you can monitor your web server logs to see who accessed it that url. The url doesn't even have to exist, as long as the web server receives the request. (So it is better to use a non-existing empty page/url.) For example, use an url like: example.com/fuck-you-archive.is
Then check your logs to see who accessed the url. You can use grep to check it:

grep "fuck-you-archive.is" web-server-log.txt


Once you have the IP-address, you can block it using the solutions from the other answers. And then repeat the process again to find other IP-addresses they use. You need to specify a different url, to make them perform a HTTP request again, for example, simply change example.com/fuck-you-archive.is to example.com/fuck-you-archive.is?2 etc.

In case you don't want to expose your website at all when trying to find their IP-addresses, you may want to use this handy HTTP request website: requestb.in The steps to perform are: create a RequestBin > submit the "BinURL" to Archive.is with "?SomeRandomNumber" appended to the BinURL > use the "?inspect" of RequestBin to monitor the incoming request from Archive.is and see their IP-address in the "Cf-Connecting-Ip" HTTP header. (Make sure you don't submit "?inspect" url to Archive.is.) Than repeat to find other IP-addresses by changing "?SomeRandomNumber" to another number.

Block their ip addresses

Note that with IP-tables you can block using

/sbin/iptables -A INPUT -s 78.108.190.21 -j DROP


but often the 'INPUT' chain is set to a 'DROP' policy with acceptance of HTTP traffic. In that case, you may need to use a prepend (insert) operation instead of append operation, otherwise it isn't blocked at all:

/sbin/iptables -I INPUT -s 78.108.190.21 -j DROP


However, they have alot of IP-addresses, so it may be easier to block complete IP-ranges. You can do this conveniently with IPTables (without the need to specify subnetmasks) using:

iptables -I INPUT -m iprange --src-range 46.166.139.110-46.166.139.180 -j DROP


This range (46.166.139.110-46.166.139.180) is for a large part owned by them, because I have seen multiple addresses between 46.166.139.110 and 46.166.139.173.

Send an abuse complaint to their web host

They are currently using NFOrce as web host. See www.nforce.com/abuse to how to make a complaint about Archive.is.
Mention: 1) your webpage url that archive.is has stolen, 2) mention the url at archive.is that contains the stolen content, and 3) mention the IP-addresses that they used.

Also you may want to complain at Cloudflare, their CDN, which caches their stolen pages and images for performance reasons. www.cloudflare.com/abuse/

10% popularity Vote Up Vote Down


 

@Shanna517

robots.txt

Archive.is does not use a bot that autonomously crawls pages (e.g., by following hyperlinks), so robots.txt does not apply, because it’s always a user that gives the command to archive a certain page.

For the same reason, services like Google’s Feedfetcher (Why isn't Feedfetcher obeying my robots.txt file?) and W3C’s Validator (details) don’t obey robots.txt.

See the archive.is FAQ: Why does archive.is not obey robots.txt?

meta-robots / X-Robots-Tag

I’m not sure if archive.is should (ideally) honor the noindex or noarchive value in meta-robots/X-Robots-Tag, or if these technologies also apply to autonomous bots only. But as archive.is doesn’t document it, they don’t seem to support it currently.

(FWIW, each archived pages seem to get a <meta name="robots" content="index,noarchive"/>.)

User-Agent

archive.is doesn’t document that a certain User-Agent is used (they probably don’t identify themselves, to get the pages as if viewed by a usual browser), so you can’t use it to block their access on the server-level.

Blocking their IP addresses

So as neither robots.txt nor meta-robots / X-Robots-Tag work here, and you can’t block them via their User-Agent, you would have to block accesses from archive.is IPs. See closetnoc’s answer about IP blocking, but note that this might block more than intended, and you might never catch all of their IPs (and/or keep up to date).

Side note: Report function

Each archived version links to a form where you can report possible abuse (append /abuse), e.g., with the reasons "SEO Issue" or "Copyright". But I don’t know if or how they handle these cases.

10% popularity Vote Up Vote Down


 

@Sherry384

Okay. This is a new one (to me at least) and quite interesting so far. I will not get into the weeds on this.

When I wrote this, I was working on little or no sleep. I missed a few things which @unor has kindly pointed out and so I must temper my answer and give credit where credit is due. Thank you @unor !

Archive.is is registered to Denis Petrov who is using a Google webhost account on IP address 104.196.7.222 [AS15169 GOOGLE - Google Inc.] according to Domain Tools though I have it on 46.17.100.191 [AS57043 HOSTKEY-AS HOSTKEY B.V.]. It is likely that the host company has recently changed.

Archive.today is also owned by Denis Petrov and is similar to Archive.is if not identical. For the purpose of this answer, I will address Archive.is and you can assume that it applies to Archive.today. Archive.today does exist on another IP address 78.108.190.21 [AS62160 GM-AS Yes Networks Unlimited Ltd]. Please understand that Denis Petrov owns 70 domains. Without digging deeper, it is possible that there are more sites to be concerned about. I will provide blocking code for all three IP addresses.

Archive.is is user directed. It is assumed that you are archiving your own page. Other than this scenario, Archive.is can be considered as a content scraper spam site.

Archive.is is walking a dangerous line. It is using other sites content through single page scraping. Ultimately, the original content's search potential is at least diluted and potentially usurped altogether. Worse yet, the original site is not cited as the originator of the content. Archive.is uses a canonical tag, but it is to it's own site/page.

Example: <link rel="canonical" href="http://archive.is/Eo267"/>

This coupled with the lack of controls over who is submitting a site and whether they have the right to the site, the lack of clear take-down information, and the somewhat fuzzy and potentially weak contact mechanism, Archive.is has the potential for real trouble.

You can find out more IP address information here: www.robtex.com/#!dns=archive.is
How to block by IP address 78.108.190.21.

Using Cisco Firewall.

access-list block-78-108-190-21-32 deny ip 78.108.190.21 0.0.0.0 any
permit ip any any


**Note: You can replace the [provided acl name] with the ACL name of your choice.

Using Nginx.

Edit nginx.conf and insert include blockips.conf; if it does not exist. Edit blockips.conf and add the following:

deny 78.108.190.21/32;


Using Linux IPTables Firewall.
**Note: Use with caution.

/sbin/iptables -A INPUT -s 78.108.190.21/32 -j DROP


Using Microsoft IIS Web Server

<rule name="abort ip address block 78.108.190.21/32" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{REMOTE_ADDR}" pattern="^78.108.190.21$" />
</conditions>
<action type="AbortRequest" />
</rule>


Using Apache .htaccess.

RewriteCond %{REMOTE_ADDR} ^78.108.190.21$ [NC]
RewriteRule .* - [F,L]


How to block by IP address 46.17.100.191.

Using Cisco Firewall.

access-list block-46-17-100-191-32 deny ip 46.17.100.191 0.0.0.0 any
permit ip any any


**Note: You can replace the [provided acl name] with the ACL name of your choice.

Using Nginx.

Edit nginx.conf and insert include blockips.conf; if it does not exist. Edit blockips.conf and add the following:

deny 46.17.100.191/32;


Using Linux IPTables Firewall.
**Note: Use with caution.

/sbin/iptables -A INPUT -s 46.17.100.191/32 -j DROP


Using Microsoft IIS Web Server

<rule name="abort ip address block 46.17.100.191/32" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{REMOTE_ADDR}" pattern="^46.17.100.191$" />
</conditions>
<action type="AbortRequest" />
</rule>


Using Apache .htaccess.

RewriteCond %{REMOTE_ADDR} ^46.17.100.191$ [NC]
RewriteRule .* - [F,L]


How to block by IP address 104.196.7.222.

Using Cisco Firewall.

access-list block-104-196-7-222-32 deny ip 104.196.7.222 0.0.0.0 any
permit ip any any


**Note: You can replace the [provided acl name] with the ACL name of your choice.

Using Nginx.

Edit nginx.conf and insert include blockips.conf; if it does not exist. Edit blockips.conf and add the following:

deny 104.196.7.222/32;


Using Linux IPTables Firewall.
**Note: Use with caution.

/sbin/iptables -A INPUT -s 104.196.7.222/32 -j DROP


Using Microsoft IIS Web Server

<rule name="abort ip address block 104.196.7.222/32" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{REMOTE_ADDR}" pattern="^104.196.7.222$" />
</conditions>
<action type="AbortRequest" />
</rule>


Using Apache .htaccess.

RewriteCond %{REMOTE_ADDR} ^104.196.7.222$ [NC]
RewriteRule .* - [F,L]


You may need to block more than one IP address from any set of code. That is not clear.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme