: How do I disallow a specific query string in robots.txt? I have the URL http://www.example.com/shopping/books/?b=9 and the following robots.txt file: User-agent: * Disallow: /?b=9 But when I test
I have the URL
www.example.com/shopping/books/?b=9
and the following robots.txt file:
User-agent: *
Disallow: /?b=9
But when I test this in Google Webmaster Tool's robots.txt tester it is showing allowed when it should be disallowed.
Whilst /?b=9 is fixed, /shopping/books will change with different categories and I need to block them all.
Please tell me what's wrong with my robots.txt.
More posts by @Bryan171
4 Comments
Sorted by latest first Latest Oldest Best
Doesn't a self altering text configuration file suggest an issue with your directories and actual ability to reach/edit that file? Not to cause panic, but... the input you entered changed....I don't think it's a text file issue.
I don't think there's such a way to do it in robots.txt and also whatever is advertised in robots.txt is also what can be advertised to hackers because robots.txt is a file accessible to all.
What I would suggest is to use your scripting language to detect for the query string you don't want people to access and if the query string matches, create a redirect to a relevant page people are allowed to access or take them to a page with a 410 HTTP code.
For example, in PHP, you can use either of these to block the b=9 parameter from being accessible:
<?php
if ($_GET['b']=="9"){
header("HTTP/1.1 410 Gone",true);
echo "This page is gone.";
exit();
}
?>
<?php
if ($_GET['b']=="9"){
header("HTTP/1.1 301 Redirect",true);
header("Location: example.com/newpage ,true);
echo "This page moved <a href="http://example.com/newpage">here</a>";
exit();
}
?>
If you are looking to specifically block just robots and not real users, then you could make the parameters accessible via POST only. Here's the HTML and PHP you can use:
Html:
<form action="phpscript.php" method="POST">
<input type="hidden" name="b" value="9">
<input type="submit" value="special page">
</form>
Php file named phpscript.php:
<?php
if ($_GET['b']=="9" && strtoupper($_SERVER['REQUEST_METHOD']) != "POST"){
header("HTTP/1.1 410 Gone",true);
echo "This page is gone";
exit();
}
?>
Only problem with the post method is that making post requests are generally non-cacheable based requests since they're primarily meant for user data submission.
The answer is on the link i posted :
Disallow: /shopping/*/*?b=9
* is a joker which mean "all"
robots.txt is prefix matching, so a rule like Disallow: /?b=9 will block all URLs that start /?b=9. Your URLs start /shopp... so they are not blocked.
However, you can use a * (wildcard - 0 or more instances of any character) to represent the first part of the URL. This is an addition to the "standard", but the main search engine bots ("Google, Bing, Yahoo, and Ask") support it:
Disallow /*/?b=9
The above should block /shopping/books/?b=9 and /<anything>/?b=9.
Reference: developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#url-matching-based-on-path-values
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.