Mobile app version of vmapp.org
Login or Join
Mendez628

: HTTP: How to be deleted from search engines at a certain point in time in the future? Is there a way to tell search engines, that a page they crawl should be included in the search results

@Mendez628

Posted in: #Http #SearchEngines

Is there a way to tell search engines, that a page they crawl should be included in the search results now, but have to be deleted at a certain time in the future?

I have a website where hundreds of publications happen each day and I want them to be crawled and be searchable, but I am legally required to remove the information after a while (individual date for each page).

After that given date, the page will not be visible at my website anymore (HTTP response 410 gone), but the page will linger in e.g. the google cache for a while, which could cause legal issues for me. Obviously, it's not viable to issue hundreds of content removal requests to google by hand. On the other hand, the individual pages do not get modified for some months until they have to be discarded, so google bot won't check in often.

For what I understand, the HTTP Expires header is a label for minimum freshnes and not for maximum lifetime, correct? I am sending last-modified-at and etag headers, but they don't help here. Is there any way to say "cache, but only until 2011-08-15"?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Mendez628

2 Comments

Sorted by latest first Latest Oldest Best

 

@Mendez628

For google there is a meta tag called unavailable_after, which does exactly what I was looking for: It tells google to remove a certain page at a specific time in the future.

It is the only way to achieve what I was hoping to accomplish: Getting the pages removed automatically, at the right time, not relying on the crawler to come back and notice the 410 Gone response, which can take a weeks after the content has been removed.

Example:


<META NAME="GOOGLEBOT" CONTENT="unavailable_after: 25-Aug-2007 15:00:00 EST" />


Or with HTTP header, for PDFs etc.:


X-Robots-Tag: unavailable_after: 23 Jul 2007 15:00:00 PST


Sources: googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html and www.google.com/support/webmasters/bin/answer.py?answer=79812
I could not find out if Bing, Yahoo & Co have adopted this Google specific tag.

10% popularity Vote Up Vote Down


 

@Merenda212

First of all you don't have control what search engines crawl and what they put in their index.

BUT, Google for instance is taking your information about the live time of your pages very serious. So if you add the correct HTTP header it will consider those information. You can also add some information into your robots.txt about which pages are invalid.

There are also the Webmaster tools where you can tell Google to remove pages from the index.

On the official Google webmaster blog you will find very helpful information about removing URLs from the index and how to reinclude content. There they say you can remove URLs by:


using 410,
robots.txt or the
noindex meta tag

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme