Mobile app version of vmapp.org
Login or Join
Ravi8258870

: I don't want my site to be analyzed on WooRank or builtwith.com I don't want my site to be analyzed on WooRank or builtwith.com. Is there any way I can do that by editing the robots.txt

@Ravi8258870

Posted in: #RobotsTxt

I don't want my site to be analyzed on WooRank or builtwith.com.

Is there any way I can do that by editing the robots.txt file or any other possible way?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Ravi8258870

2 Comments

Sorted by latest first Latest Oldest Best

 

@Becky754

There are literally thousands of sites similar to these. Most of them are scraper sites that use Alexa data and possibly other data to build their pages. But some go against your site specifically. Woorank.com and builtwith.com are two sites among the many that access your server directly.

I did a cursory search and did not see that either site will respect robots.txt. So here goes.

You can block both of these easily in your .htaccess file. I am assuming Apache. I do not know how to block using IIS or other web servers that do not use .htaccess.
#woorank .com
RewriteCond %{REMOTE_ADDR} ^103.21.(2*4*[4-7]*).([0-2]*[0-5]*[0-5]*)$ [NC]
RewriteRule .* - [F,L]
#builtwith .com
RewriteCond %{REMOTE_ADDR} ^5.39.([0-1]*[0-2]*[0-7]*).([0-2]*[0-5]*[0-5]*)$ [NC]
RewriteRule .* - [F,L]


These .htaccess rules block AS13335 - CloudFlare IP Address Range: 103.21.244.0 - 103.21.247.255, and AS16276 - OVH Systems IP Address Range: 5.39.0.0 - 5.39.127.255. These are not subscriber lines but rather hosting companies and you would not be blocking users.

If your site is listed, it is generally too late. They will likely remain listed. However, I have noticed that some of these sites will drop an entry once they determine that the server is inaccessible and after a period of time. But please note that these sites are for monetization and may not update a page unless requested by a user if at all. This means that it can take years for a site to try and update their data and therefore may not know your site is unavailable. Even then they may not care as long as they are getting search traffic.

10% popularity Vote Up Vote Down


 

@Goswami781

Robots.txt is a technological politeness. However, it is not a defined legal standard and legally, search engines and indexing engines do not have to follow it.

Yes, big search engines like Google are designed to follow the standard; that's why you get "a description for this page is not available because of this site's robots.txt".

However, many sites don't follow it. In the worst case, malicious sites may use robots.txt as a starting point specifically for pages to crawl, rather than to ignore.

So if you think these two sites are likely to follow your robots.txt, go for it. If not, you're going to need to record thousands of IPs for the two sites and specifically block them with htaccess or similar.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme