Mobile app version of vmapp.org
Login or Join
Moriarity557

: Robots denied by domain is still listed in search results So, on all of our sites that are not search facing we've applied a robots.txt file (per How to exclude a website from real-time Google

@Moriarity557

Posted in: #RobotsTxt #Seo

So, on all of our sites that are not search facing we've applied a robots.txt file (per How to exclude a website from real-time Google search results?, or any other similar question).

However, if search terms are specific enough, the domain itself can be found via results. An example of this can be found here. As you can see from the link, the domain itself can be found (content is not cached, but domain is listed). Additionally, performing a search with site:hyundaidigitalmarketing.com should 3 results. Checking backlinks provides a few as well, but I obviously cannot prevent them (linking is allowed in context) or control how these are handled (can't tell the host to add nofollow, noindex).

Now, I know this is a severe edge case, however my companies clients are doing just this. In fact, our domains are pretty good, so even seemingly arbitrary searches are turning up relevant results. Now, I have to write up a report on how/why this is happening.

So, I turn to the wonderful Stack Exchange network to help me either understand what I am missing or understand what is happening. Links to industry articles are extremely helpful but, anything you can give is obviously greatful. I do intend to offer bounties the best I can to make this an answer to turn to in the future.

Edit: I've opened a bounty on this question in hopes of getting some more responses on it. I've also provided the results of my own research below.

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Moriarity557

4 Comments

Sorted by latest first Latest Oldest Best

 

@Gonzalez347

I think your basic issue is the back links to the site as these give the search engines an entry point to the site and make them aware of it. So although they will not display a description for the site they may show the URL if they think its the best match for the result.

Have a read of this article linked to from the one @joe posted: Matt Cutts keeping google out

The key bit is:


There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page.


The research you have done also covers things quiet well and the answers by @john and @joe are both relevant. I have included a link below which gives some further guidance on blocking search engines. The only way i can think to completely block the site would be to add some form of password protection in front of the site that needs to be completed before the content is displayed.

SEOMoz tips on not appearing in search

10% popularity Vote Up Vote Down


 

@Moriarity557

Based on my research into the subject, I've found that there isn't a 100% guarenteed way to prevent indexing and caching of data, but you can come pretty darn close (assuming you want to deal with increased bot traffic). Here's how I've interpreted the information.

One would think that the robots.txt file is used to define robots information site-wide and meta tags are used for page specific details. I think the spirit behind the the 2 are exactly this but this is not the case in practice.

Don't create a robots.txt file

This works with all major search providers to prevent content from appearing on the SERP, but does not prevent indexing. This also prevents bots from crawling your pages so any robot meta tags (see below) are also ignored. Because of this you cannot use the 2 together and this is why, if you want to prevent indexing, you should not use a robots.txt file.

Side note: Google does support the use of Noindex: / in robots.txt, but it is undocumented (who knows when it will break) and unknown whether this works for anyone else.

Use HTTP headers or HTML META tags to prevent everything

Unlike the robots.txt file, the robots meta tag (and HTTP Header) is widely supported and, surprisingly, feature rich. It is designed to be set on each page, but recent adoption of the X-Robots-Tag header make it easy to set site-wide. The only downside with this method is that bots will crawl your site. This can be limited by using nofollow, but not all bots truely respect nofollow.

I found a ton of information in this, outdated, blog post. It's original release was 2007 but, because a lot of the information on it are newer features since then, it appears to be getting updated regularly.

In summary, you should sent an HTTP header of X-Robots-Tag: noindex,nofollow,noodp,noydir. Here's the break down of why:


nofollow should limit the number of pages crawled on your site, keeping bot traffic down. * noindex tells engines to not index the page.
Now, you might assume that noindex might be enough. However, I've found that even if you say noindex your site might be indexed because of other sites linking to it. The best way to prevent common site links from Y! Directory (noydir) and Open Directory (noodp).
Using the HTTP header also applies the robots data to files, images, and other non-HTML files! YAY!


This will work in 99% of cases. Keep in mind though that it is still possible to become indexed in some cases by some providers. Google claims to fully respect noindex, but I have my suspicions.

Finally, if you do get indexed, or have already been indexed, the only way to get your information de-indexed is to follow the various means from each provider to request the site/url be removed. Obviously this means that you will probably want to monitor the sites/pages using something like Google Alerts (thanks @Joe ).

10% popularity Vote Up Vote Down


 

@Pope3001725

I'll have to go looking for the source of this information but apparently robots.txt will not necessarily prevent a page from being indexed. But the HTTP x-robots-tag header does apparently work.

If you're using Apache you can block pages in bulk using this line in an .htaccess file:

Header set x-robots-tag: noindex


Give that a try and see what happens.

Edit

(Found a source. Not the one I remember but it works).

10% popularity Vote Up Vote Down


 

@Debbie626

I think Matt Cutts talked about this. If my memory is correct it had to do with linking.
Here is more: www.google.com/support/forum/p/Webmasters/thread?tid=2720810fa226e9c8&hl=en
You can remove them with the Google removal tool.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme