Mobile app version of vmapp.org
Login or Join
Holmes151

: Google indexing page with parameters but page is Disallowed in robots.txt I have the following in robots.txt: User-agent: * Disallow: /refer.php User-agent: NinjaBot Allow: / Sitemap: http://www.mysite.com/sitemap.xml

@Holmes151

Posted in: #Google #RobotsTxt #Seo

I have the following in robots.txt:

User-agent: *
Disallow: /refer.php

User-agent: NinjaBot
Allow: /

Sitemap: www.mysite.com/sitemap.xml

The refer.php file does various things depending on what GET parameters are passed to it.

When I do a Google search, I see tons of results for pages like this:
www.mysite.com/refer.php?o=23945 http://www.mysite.com/refer.php?o=39858 www.mysite.com/refer.php?o=9683 http://www.mysite.com/refer.php?o=10569 www.mysite.com/refer.php?o=58304 http://www.mysite.com/refer.php?o=69604


Is the reason that Google is indexing these because I don't have an asterisk * after refer.php in the robots.txt ? Should changing it to Disallow: /refer.php* fix the problem?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Holmes151

3 Comments

Sorted by latest first Latest Oldest Best

 

@Mendez628

Your robots.txt is just fine. However, it might not be enough to totally prevent indexing: Disallow command in robots.txt will block crawling, but in some cases the URLs themselves will still be indexed because of links or other factors.

Robots.txt is not meant to prevent the indexing of URLs, its purpose is to prevent crawling.

Best way to prevent Google from indexing an URL is to use this in the document head:

<meta name="robots" content="noindex" />


Google Help:


While Google won't crawl or index the content of pages blocked by
robots.txt, we may still index the URLs if we find them on other pages
on the web. As a result, the URL of the page and, potentially, other
publicly available information such as anchor text in links to the
site, or the title from the Open Directory Project (www.dmoz.org), can
appear in Google search results.

10% popularity Vote Up Vote Down


 

@Welton855

Add:

Disallow: /refer.php?*


To your robots.txt. Googlebot understand the wildcard and is the most explicit way to tell them not to index the URLs you want.

For working with all robots, try without the trailing * but do test using the Google Webmaster Tools robot tester to make sure Googlebot will be blocked.

10% popularity Vote Up Vote Down


 

@Megan663

You shouldn't need an asterisk after, as leaving the path open without a dollar sign should match anything after. Maybe as its ending in php is causing an issue. In this case I might try:

Disallow: /*refer.php?


Also maybe obvious, but how long has the robots.txt been in place? I have seen Google take up too and over couple of weeks before updating the SERPS to reflect robots.txt changes.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme