Mobile app version of vmapp.org
Login or Join
Kevin317

: How to disallow robots from the first 185 pages? I have a website that whereby the first 185 pages are sample profiles for demonstration purpose: http://example.com/profile/1 ... http://example.com/profile/185

@Kevin317

Posted in: #RobotsTxt

I have a website that whereby the first 185 pages are sample profiles for demonstration purpose:
example.com/profile/1 ... example.com/profile/185

I want to block these pages from Google as they are somewhat similar in content to avoid penalty for being labelled as duplicate content. Is there a better way to do it than listing them out in robots.txt like so:

User-agent: *
Disallow: /profile/1
Disallow: /profile/2
Disallow: /profile/3
...

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Kevin317

3 Comments

Sorted by latest first Latest Oldest Best

 

@Ann8826881

It is not possible to use robots.txt (as defined by the original specification) in your case. A line like Disallow: /profile/1 will block all URLs whose paths start with /profile/1. So this applies to the profiles 1, 10-19, 100-185 (as intended), but also to the profiles 186-199, 1000-1999, 10000, … (not intended).

Workaround: Add a character as suffix, for example a /. So your profile URLs would look like profile/1/, /profile/2/, …. Then you could specify Disallow: /profile/1/ etc.

That said, some robots.txt parsers support additional features which are not included in the original robots.txt specification. As you say you want to block the pages for Google, Google gives special meaning to the $ character:


To specify matching the end of a URL, use $


So for Google, you could write Disallow: /profile/1$. But other parsers that don’t support this feature will then index your profiles 1-185 as they only look for URL paths literally starting with /profile/1$.

So when you don’t want to add a suffix (and list all Disallow lines explicitly), or if you don’t want a Google-only solution (without suffix, but still listing Disallow lines explicitly), robots.txt is no solution for you.

Instead, you could use:


on the HTTP level: the HTTP header X-Robots-Tag

X-Robots-Tag: noindex

on the HTML level: meta element with the robots name

<meta name="robots" content="noindex" />



Both ways are supported by Google.

10% popularity Vote Up Vote Down


 

@Si4351233

You are creating a file to be read by a robot, so create it with a robot:

<?php ob_start(); ?>

User-agent: *

<?php
header("Content-Type:text/plain");
$limit = 185;

for($i = 1; $i < $limit ; $i++)
echo "Disallow: /profile/$in";
?>
# rest of robots.txt here


Or if you are using leading zeros (better sorting) replace the echo line with:

printf("Disallow: /profile/%03dn", $i);


Of course, robots.php doesn't work, but that's what mod_rewrite is for:
In .htaccess:

RewriteRule robots.txt robots.php [L]

10% popularity Vote Up Vote Down


 

@Welton855

You could put the robots meta tag in all of those pages: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme