Mobile app version of vmapp.org
Login or Join
Harper822

: Making google forgive us right away when accidentally publishing incorrect links (noarchive doesn't work) So this is what happened. One day, I changed my code on my site to try to make things

@Harper822

Posted in: #Google #Googlebot #Links #Noarchive #WebCrawlers

So this is what happened.

One day, I changed my code on my site to try to make things more compliant with adsense. Several hours later, I check the crawl errors report GSC only to see a string of HTTP 400 errors (I deliberately made requests containing ' return 400 status) in URLs of the following format (replace # with an actual number):
example.com/album/gallery/('+#+');

But in reality the URL should be in this form:
example.com/album/gallery/#

In each of my pages I used the noarchive attribute as follows:

<meta name="GOOGLEBOT" content="NOARCHIVE">
<meta name="ROBOTS" content="NOARCHIVE">


Since this incident, my maximum crawl rate google allows for my site is now only at 2 requests per second, and almost everyday I see only a couple of new entries similar to above malformed links in my GSC crawl errors report.

It's like somehow google has cached all the HTML output related to my site and relies solely on that HTML as if it was the truth, and it also ignores my "noarchive" which is supposed to prevent google from archiving content in the first place.

The only way I was able to solve this problem even though the malformed links on my site were replaced with valid links was to modify my apache configuration file to include a rewrite rule before the odd character filtering rules (that produced the HTTP 400 errors). This rewrite rule causes all malformed URLs (shown above) to redirect to the valid URLs.

The problem here is that I disabled use of .htaccess for speed and security reasons, so my action resulted in an apache graceful restart. I hate resorting to that and I hate lowering my security.

What I want to know is, could there be some way for me to explicitly tell google to cancel their cached version of my pages (a.k.a. forgive me for my accidental mistake) and re-crawl everything if I made a mistake rather than having it rely on its own cached data for at least several days? I used noarchive meta tag and that does not seem to work.

If I accidentally make faulty links in the future on my site, I don't want to have to go through modifying the apache configuration to redirect the links and gracefully restart apache and/or lower security to my site.

Heck, if there's a special URL I can use to reset web crawler scanning of my site, I'd use it right now.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Harper822

1 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

There is no way to tell Googlebot to forget about something it has crawled.

Your only recourse is to:


Fix the problem with your HTML.
Redirect any faulty URLs that were caused by the problem.
Wait until Googlebot has recrawled all the pages with faulty HTML and all the bad links that those pages might have generated.


NOARCHIVE prevents Google from showing a cache of the page to users. It has no effect on whether they crawl it again, remember it internally, or use its links to crawl other pages. Google will always crawl links in a page unless that page has the NOFOLLOW attribute. However, NOFOLLOW cannot be applied retroactively.

I tend not to put much into my .htaccess files either, but I don't usually disable them entirely. I find that redirects are often best implemented in the programming logic that powers the web application rather than in .htaccess. You might consider moving your redirects into your software, however that isn't an option if your site is static.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme