: Google isn't indexing URLs after a redirect that differs only in percent-encoding/decoding? Will Google's crawler refuse to follow redirects when the difference between the redirected-from and redirected-to
Will Google's crawler refuse to follow redirects when the difference between the redirected-from and redirected-to URLs is solely whether specific characters are percent-encoded or not? For example:
splunkbase.com/apps/All/4.x/Add-On/app:PDF+Report+Server+%28install+on+Linux+only%29 www.splunkbase.com/apps/All/4.x/Add-On/app:PDF+Report+Server+(install+on+Linux+only)
Both of these are, per the HTTP specs, valid and equivalent URIs, but our site's code always redirects to a "canonical" URL for each content page-- which in this case is the first URL listed.
Google clearly isn't indexing this page (in either URL variant). Neither of the URLs above show up when I search for "PDF Report Server (install on Linux only)".
Google webmaster tools reports a "redirect error" for the "decoded" variant of the URL: splunkbase.com/apps/All/4.x/Add-On/app:PDF+Report+Server+(install+on+Linux+only)
Another problem is that we're currently using a 302 instead of a 301 redirect to handle canonicalization-- we're switching to 301s soon for canonicalizing redirects.
But I'm wondering if the 302 vs. 301 issue may be a red herring-- that the actual underlying issue may be that, in Google's eyes, we're redirecting a URL to itself since, per the HTTP specs, a perecent-encoded and non-percent-encoded URL should be treated the same by clients and servers.
I found a related thread here. It's not the same issue-- in their case the only difference between redirected URLs was the upper/lower case of the percent-encoded hex values. But it's suspiciously similar to our issue.
Finally, my question: Has anyone run into this percent-encoding-plus-redirect issue, and if so can you discuss how you worked around it? Did switching to a 301 fix it, or was more needed?
For workarounds beyond 301-ing, we're looking at a variety of options from using REL=CANONICAL and turning off redirection in this case, to modifying our escaping to turn off escaping of apostrophes, parentheses, and other not-usually-percent-escaped characters.
For long-term fixes, we're looking at:
like this site does, using a numeric ID as a key, adding REL=CANONICAL to handle changes in SEO text after the title, and not doing any redirection
like many blogs do, continuing to use the title as the canonical URL, continue redirecting, but switching all problematic characters with dashes so we don't have to worry about encoding/decoding
More posts by @Courtney195
3 Comments
Sorted by latest first Latest Oldest Best
If you are using Apache, I would strongly recommend using mod_rewrite so that it will handle canonical URLs by simply serving the page requested rather than sending a redirect.
Your root problem really is the fact you are using URLs throughout your website that require UTF-8 encoding. While it may not be entirely as pretty looking original to limit your character set, it really does help down the road once other sites start linking to yours. Other sites are most likely going to end up re-encoding the URL and before you know it search engines will be trying to access the canonized link.
My best solution is to change the link to one without UTF-8 characters, then send a 410 GONE or 404 NOT FOUND response and add a 5 second redirect in the header to the corrected URL. Wait a couple of weeks and it will correct itself.
Of course, if the page is older (more than 1 year) this may never correct.
(edit tidbit: 410 GONE seemed to work best on URLs that absolutely should have never been spidered. For example temporary files and URLs with session data in the $_GET.)
Firstly, the fact that Google has not indexed either URL variant does not indicate a problem with the URL itself. It's more likely a different reason, like Googlebot hasn't crawled that page yet or doesn't think it's interesting enough.
I would suggest a few steps:
Remove the redirection entirely. As you say the URLs are treated as the same anyway. In fact Google Chrome automatically converts %28 into (. You may find some browsers do the opposite - ( into %28 - which may cause problems.
Link to 'canonical' version. In other words, make sure all links that you have control over point to the correct, parenthesised, version.
Put the canonical version in your sitemap. If you don't already have an XML sitemap, create one and submit it to Google Webmaster Tools.
Use rel=canonical to set the correct URL. If you put the parenthesised version in there, Google should show that in search results instead of the other one.
One final suggestion would be to remove special characters from URLs where possible. Your URLs seem to be based on files names. Perhaps when you upload a file, generate a 'slug' for use on the website, e.g. pdf-report-server-install-on-linux-only and use that in the URL instead.
These URLs are theoretically equivalent, so a redirect is likely being seen as a redirect to itself, which would be a reason for a crawl error. In general, if that is the case (and I'm assuming that is so), then I would recommend not trying to canonicalize the URL on that level, I would not redirect a URL to an alternate representation of the same URL.
Similarly, there's no need to use the rel=canonical link element for these two URLs. It's fine to use if there are alternate versions, such as different capitalization in the path or URL parameters, but just for these two URLs it will not have any effect.
For what it's worth, an easy way to test how Google sees URLs like these is to use the Fetch as Googlebot function in Webmaster Tools. It will not follow redirects, so you should be able to see exactly which URLs are fetched, allowing you to try the different variations and see how it reacts.
In a related note, using URLs such as the one you mentioned seems a bit problematic to me, since it may be difficult for users to link to URLs that use spaces (eg when copy & pasting a URL into a forum). Given a proper link, Google will be able to follow it, but if the server-side software does not recognize the full URL, then that link may end up being broken.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.