: Remove large number of useless page from Google index using robots.txt A large number of auto-generated pages from the search of my site has been indexed by Google due to an error. I'm trying
A large number of auto-generated pages from the search of my site has been indexed by Google due to an error.
I'm trying to clean this in order to be sure that only the most high quality content is on Google index.
So I have added a line to disallow this directory in robots.txt so googlebot doesn't waste time crawling:
Disallow: /search/
After some months it worked as expected and most pages are indexed on Google only as URLs with this message:
A description for this result is not available because of this site's robots.txt
Is this is enough for preventing future penalties like Panda, or since they are still on the index (even as only URLs due to the disallow) Google will still have the pages in their "cache" from the last crawl and it may cause problems?
More posts by @Margaret670
3 Comments
Sorted by latest first Latest Oldest Best
What I would suggest is to create a brand new set of URLs for search result pages and then for anyone requesting the old URLs, produce an error page with an HTTP 410 status code. Also, make your search pages only accessible via the POST request method.
Google won't crawl pages that are requested via POST if it is requested as a result from filling out a proper form. An example is logging in to a specific section of a website.
For example:
If your current search result URLs are in the form of:
example.com/results.php?query=abc
then that needs to return a 410 status code with an error indicating the page is no longer available.
What you need to do is create a proper search form. In HTML, this will work:
<form action="searchfor.php" method="post">
Query: <input type="text" name="query">
<input type="submit" value="search">
</form>
When the user clicks search, the page requested will be searchfor.php and the data posted will be query=whatever (replace whatever with the text user entered) and the value can be extracted in the server script. In all search result pages, the URL in the address bar will always remain the same but the page will be different based on the query.
For best results, make sure the form action value points to a different script name. The script however must exist on the server. You can insert the full URL to the script if that makes things easier for you.
Just make sure there is no other way for people to access the search results pages and then google will not try to access them. If you must allow users to access them another way via a hyperlink, you should include a rel="nofollow" inside the anchor tag and in the particular result page, make it non-indexable. See other answers in this thread for instructions on how to make the page non-indexable.
Google supports Noindex: in robots.txt as an experimental feature. It sounds like this would be the perfect case for using it:
User-Agent: *
Disallow: /search/
Noindex: /search/
Google won't crawl the disallowed path specified in robots.txt but you can't control through robots.txt the references to your site search results from other sites.
While Google won't crawl or index the content blocked by robots.txt,
we might still find and index a disallowed URL from other places on
the web. As a result, the URL address and, potentially, other publicly
available information such as anchor text in links to the site can
still appear in Google search results. You can stop your URL from
appearing in Google Search results completely by using other URL
blocking methods, such as password-protecting the files on your server
or using the noindex meta tag or response header.
Source: support.google.com/webmasters/answer/6062607
Block search indexing with meta tags
You can prevent a page from appearing in Google Search by including a
noindex meta tag in the page's HTML code. When Googlebot next crawls
that page, Googlebot will see the noindex meta tag and will drop that
page entirely from Google Search results, regardless of whether other
sites link to it.
To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the section of your page *:
<meta name="robots" content="noindex">
To prevent only Google web crawlers from indexing a page:
<meta name="googlebot" content="noindex">
Source: support.google.com/webmasters/answer/93710
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.