Mobile app version of vmapp.org
Login or Join
Deb1703797

: Page blocked by robots.txt showing up in site: search results with a description that is a mix of Chinese, English, and German I found a strange search result for a resource blocked by robots.txt.

@Deb1703797

Posted in: #GoogleSearch #RobotsTxt #Seo

I found a strange search result for a resource blocked by robots.txt.
Why ist there the Chinese (guessed) text followed by the text Hello nighthawk!. Is this an esteregg of Google?



Yesterday I tried to remove the URL from google with Webmaster Tools.
There was no Hello Nighthawk!, only the 'blocked by robots.txt' message. The issue was reported by a co-woker.



This is the content of the robots.txt:

User-agent: *
Disallow: /en


The domains gets redirected in the following way:
domain.com/en -> (301) domain.com/en

The page domain.com/en show the normal page with the correct title of the page.

The title of domain.com/en does not contain any of the words.
I have searched the whole project to find the word 'nighthawk'. It is not included. And we never had any Chinese translations.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Deb1703797

2 Comments

Sorted by latest first Latest Oldest Best

 

@Pope3001725

robots.txt prevents pages from being specifically unindexable

You read that right.

Make the page crawlable and unindexable

To make sure a page does not appear in Google search results, make sure it is crawlable by robots.txt, and explicitly unindexable.

It’s common practice to use robots.txt in an effort to keep pages out of search engine indexes. However, to ensure a page does not get indexed, it must be crawlable.

Google (and Bing) will exclude a page from the index if instructed to by the page. This can be an X-Robots-Tag HTTP header, or a noindex meta tag in the HTML.

But Googlebot can’t read those instructions if robots.txt forbids them from reading the page. So Google takes the benefit of the doubt and places the page into the index (if they like).

Here’s how Google explains it:


After the robots.txt file (or the absence of one) has given permission to crawl a page, by default pages are treated as crawlable, indexable, archivable, and their content is approved for use in snippets that show up in the search results, unless permission is specifically denied in a robots meta tag or X-Robots-Tag.


Google’s half-fixes

You can use Google Webmaster Tools to temporarily remove a page from the Google index. But there’s no set time on how long the removal is good for. It's not really a solution.

Google also has an experimental no-index feature in robots.txt that is designed to allow web masters to pages to be both un-crawlable and un-indexable. As Google makes no guarantee about its functionality, use at your own risk.

Also, be aware other search engines don’t support no-index directives inside robots.txt. Bing webmaster documentation states:


To remove a URL from their own site from the Bing index…Bingbot needs to be able to access the URL, so you should not block the URL from being re-crawled through robots.txt.


What is robots.txt for, then?

robots.txt is intended as a solution for ensuring search engine bots don’t inflict unwanted spidering traffic on websites — that traffic might incur fees from the web host, or (if your website is fragile) might cause performance or stability issues.

These are (ostensably) separate concerns from not wanting your pages to be findable by users searching on Google.

About the gibberish associated with your page in the SERPs

The incorrect content in the search results associate with your page may come from the anchor text of pages linked to your site. Since the page is uncrawlable this second-hand information can be the best available information Google has about your page’s content.

It would seem that some of content getting associated with your site is from shadier areas of the web. These places might be linking to your site for any number of reasons, most of which involve attempts to associate themselves with your good reputation.

10% popularity Vote Up Vote Down


 

@Megan663

Google includes uncrawlable pages in the index when they are linked from other sites.

That means that a link to the website like<a href="domain.com/en">[CHINESE] - Hey nighthawk</a> can show up in the search results.

Some have suggested that such occurrences are temporary. They aren't always. Google indexes uncrawlable pages because sometimes important pages are blocked by robots.txt. Matt Cutts explains:


You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link.


You are unlikely to see this page from your search result except for the site: query that you did. Otherwise somebody would have to search for [CHINESE] Hey nighthawk or some portion thereof.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme