: How to disallow robots from the first 185 pages? I have a website that whereby the first 185 pages are sample profiles for demonstration purpose: http://example.com/profile/1 ... http://example.com/profile/185
I have a website that whereby the first 185 pages are sample profiles for demonstration purpose:
example.com/profile/1 ... example.com/profile/185
I want to block these pages from Google as they are somewhat similar in content to avoid penalty for being labelled as duplicate content. Is there a better way to do it than listing them out in robots.txt like so:
User-agent: *
Disallow: /profile/1
Disallow: /profile/2
Disallow: /profile/3
...
More posts by @Kevin317
3 Comments
Sorted by latest first Latest Oldest Best
It is not possible to use robots.txt (as defined by the original specification) in your case. A line like Disallow: /profile/1 will block all URLs whose paths start with /profile/1. So this applies to the profiles 1, 10-19, 100-185 (as intended), but also to the profiles 186-199, 1000-1999, 10000, … (not intended).
Workaround: Add a character as suffix, for example a /. So your profile URLs would look like profile/1/, /profile/2/, …. Then you could specify Disallow: /profile/1/ etc.
That said, some robots.txt parsers support additional features which are not included in the original robots.txt specification. As you say you want to block the pages for Google, Google gives special meaning to the $ character:
To specify matching the end of a URL, use $
So for Google, you could write Disallow: /profile/1$. But other parsers that don’t support this feature will then index your profiles 1-185 as they only look for URL paths literally starting with /profile/1$.
So when you don’t want to add a suffix (and list all Disallow lines explicitly), or if you don’t want a Google-only solution (without suffix, but still listing Disallow lines explicitly), robots.txt is no solution for you.
Instead, you could use:
on the HTTP level: the HTTP header X-Robots-Tag
X-Robots-Tag: noindex
on the HTML level: meta element with the robots name
<meta name="robots" content="noindex" />
Both ways are supported by Google.
You are creating a file to be read by a robot, so create it with a robot:
<?php ob_start(); ?>
User-agent: *
<?php
header("Content-Type:text/plain");
$limit = 185;
for($i = 1; $i < $limit ; $i++)
echo "Disallow: /profile/$in";
?>
# rest of robots.txt here
Or if you are using leading zeros (better sorting) replace the echo line with:
printf("Disallow: /profile/%03dn", $i);
Of course, robots.php doesn't work, but that's what mod_rewrite is for:
In .htaccess:
RewriteRule robots.txt robots.php [L]
You could put the robots meta tag in all of those pages: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.