: How can I fix Google indexing URLs that are not part of my sitemap? Using site:example.com in Google is returning many results with the following format: https://www.w.example.com/services/edison/16mm-to-2k
Using site:example.com in Google is returning many results with the following format: www.w.example.com/services/edison/16mm-to-2k
Obviously this is not what I submitted and is not part of my sitemap. What are some solutions for dealing with this kind of problem?
This is particularly a problem since they indexed the HTTPS protocol and all the links are showing a warning before visiting the site as a result.
Getting wildcard SSL certificates for *.w.example.com and *.ww.example.com seems like a bad idea.
The site's DNS runs through AWS Route 53 and the site is running on an Ubuntu 12.04 EC2 with Apache.
More posts by @Phylliss660
5 Comments
Sorted by latest first Latest Oldest Best
The question focuses a lot on what Google is doing but to me it appears that your fundamental problem not really Google specific at all.
Why do these names, which you clearly don't seem to want people to use, even exist in DNS?
If it is intentional that these names exist and resolve, why are you serving your actual site when people (and Googlebot) connect using these names? If you want to lead people to the site, it would be much better to do so by redirecting (permanent redirect / 301) them to the real site, using its canonical name, instead of leaving them navigating around your site using this incorrect name.
Google follows not only links made by other content writers, but it also heuristically interprets your javascript and even tries to "simplify" your URLs to strip them off wrappers, such as /index.php?page=news.php => /news.php! One way would be to ban those mangled URLs in your robots.txt, but that would (1) grow your robots.txt and make it messy, and (2) take away your rank for those links. You must either implement a 301 Moved Permanently or add a Canonical URL tag
<link rel="canonical" href="http://moz.com/blog" />
pointing to the most basic address of the same content. Beware, most "Chinese" bots won't obey this, so you might consider a server-side conditional that would redirect everything else but Googlebot and user browsers and leave Gogolebot and users with the metadata.
Sitemaps serve to include, not limit the content Google indexes. If you want to exclude some files, use a robots.txt file as mentioned, or setup redirects.
The reason this URL is included is likely that Google found a link pointing to it somewhere else. It could be on your site (which you can fix) or on a third-party site as incoming link. To figure that out, you can use the link syntax link:https://www.w.example.com/services/edison/16mm-to-2k that will tell you what page(s) is linking there.
do you have a google webmaster tools account? if you create a free account with them and verify that you are the actual site owner then google will allow you to request for removal of a folder or specific urls.
my personal experience is that search engines take the liberty of not following instructions but this step would at least remove your pages from their index.
before you create an account pls change your robots.txt to disallow access to specific areas. as soon as you verify google will check the robots.txt file and update itself.
www.google.com/webmasters/tools
Most likely some part of your web site generated links like that, and that is how Google started to crawl the URLs.
You should check the links in your web pages to see where these incorrect URLs are, and you should fix them.
Also, you could change your Apache configuration so that requests for any other virtualhost than example.com or example.com would 301 redirect to the correct URL at example.com. This way Google will eventually index the correct versions.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.