Mobile app version of vmapp.org
Login or Join
Kaufman445

: Hide a site entirely from search engines (Google, Bing etc.) My company is running a few internal websites that we do not want indexed by search engines such as Google, Bing etc. However, the

@Kaufman445

Posted in: #Apache #RobotsTxt #SearchEngines #WebCrawlers

My company is running a few internal websites that we do not want indexed by search engines such as Google, Bing etc.

However, the websites still need to be accessible for our customers, and therefore, I do not wish to use HTTP password protection.

Obviously, I already have a robots.txt containing:

User-agent: *
Disallow: /


When I search for the domain name, it still shows up, and Google says: "A description for this result is not available because of this site's robots.txt", while Bing says "We would like to show you a description here but the site won’t allow us.".

How can I ensure that the websites are totally hidden in the search results?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Kaufman445

3 Comments

Sorted by latest first Latest Oldest Best

 

@Cugini213

Any method that relies on the crawler's good behaviour, may fail, so the best option is to use the strongest force/authority available, in this case, the web server itself. If you have access to the main web server configuration or at least to the .htaccess file, you should use a method that involves those elements.

The best way is using http password, but if you really don't want to use that, then you still have another option.

If you know the IPs of your clients, you can restrict/allow that in your .htaccess with a simple access control code like this

Order deny,allow
Deny from all
Allow from x.x.x.x
Allow from y.y.y.y


The IPs can be in the form x.x.x instead of x.x.x.x, which means that you will be allowing the whole block that is missing.

You can combine that with some HTTP headers. 403 tells the bot to not go there, they usually try a few times, just in case, but it should work quickly if combined with the deny directive.

You can use the HTTP response code even if you don't know your client's IPs.

Another option is to redirect the request to the home page and use, for instance a 301 HTTP code, although I wouldn't recommend this method. Even when it's going to work, you are not telling the truth about the resource and what happened to it, so it's not a precise approach.

Update considering your comment

You can use the [list of user agent string from crawlers] to block them on your .htaccess., this simple syntax would do what you want.

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|yahoo|yandex) [NC]
RewriteRule .* - [R=403,L]


Just add the most common ones or the ones that have been to your site.

10% popularity Vote Up Vote Down


 

@Alves908

This happens when Google or Bing discovers your site and has not been told not to index the site. This happens when there is a link or redirect to the site and the robots.txt restricts the search engine from the site. However, this is not the same as telling a search engine not to index the site.

Put <meta name="robots" content="noindex"> in the header of your HTML of all pages (preferable) or at least the home page and search engines should remove your site from the index in time. It can take 30-60 normally (for Google) but may take longer. It all depends upon how fast the search engine revisits your site and the processing within the search engine. It can take less than 30 days too. I just wanted to warn you that it may take some time.

For now, there is no harm except that others may discover your site. If you want to limit visitation, then perhaps another mechanism is needed. I understand wanting to keep it open and not require an account. As of right now, I am not sure I have advice on limiting visitation. But also understand that rogue spiders will also discover your site and may create links regardless of your wishes. Think about how you may control access if and when this happens - and if control is important to you.

10% popularity Vote Up Vote Down


 

@Pope3001725

Use Header set X-Robots-Tag "noindex". This prevents pages from being in a search engine's index.

In Apache you could put this in your conf file or .htaccess file in your root directory:

Header set X-Robots-Tag "noindex"

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme