Mobile app version of vmapp.org
Login or Join
Samaraweera270

: Odd/faulty page links being served to Googlebot Yahoo recently bought Tumblr, where I write fairly long form scientific pieces. I noticed recently that a new script had been added to the end

@Samaraweera270

Posted in: #Googlebot #Indexing #Tumblr

Yahoo recently bought Tumblr, where I write fairly long form scientific pieces. I noticed recently that a new script had been added to the end of pages, and it seems to be interacting badly with the Google search bots.

A hit today came to /post/:id/:summary where this is clearly just some sort of site scheme. It came from a page on What is a gene, post-ENCODE? and when I searched for this title specifically for my site I also pulled up a second seemingly identical hit at

/page/:page


Searching for site:<siteURL>/:page comes up blank, presumably due to the punctuation but site:<siteURL>/post/:id/:summary confirms that it is just for this one article page.

I went to use the 'remove URL' page on Google Webmaster Tools but it says the page is still live, i.e. it's either seeing my article or my entire homepage (when you go /post/:id/:summary you actually just get the root URL, and I don't want to remove that from Google!), advising I get the content removed before requesting it be cleared from their cache.

I can't see how this site scheme is being generated (at least /sitemap1.xml doesn't give it like that anyway), but searching for "/post/:id/:summary" brought up code similar to what Yahoo are adding to sites:

(function() {
var s = document.createElement('script');
var el = document.getElementsByTagName('script')[0];
s.src = ('https:' == document.location.protocol ? 'https://s' : 'http://l') + '.yimg.com/ss/rapid-3.14.js';
s.onload = function(){
var YAHOO = window.YAHOO;
if (YAHOO) {
YAHOO.i13n.beacon_server = 'nol.yahoo.com';
var keys = { pd:'/post/:id/:summary', _li:0, i_rad:0, i_strm:0, b_id:66209497 };
var conf = {
spaceid:1197719230,
client_only:1,
yql_enabled:false,
keys:keys
}
YAHOO.rapid = new YAHOO.i13n.Rapid(conf);
}
}
el.parentNode.insertBefore(s, el);
})();


It's pretty hard to see what this is doing, and all I'm wondering is whether


I've done something wrong in my page Javascript that is interfering with these variables in some way
How might I fix it?
Should I maybe contact some part of Google or Yahoo? If so where would I even start?


I was going to put this on SO thinking it was a Javascript issue but assuming 1. isn't the case I'll leave it here unless anyone here suggests otherwise. Likewise if this isn't suitable for a "pro webmasters" forum (it's a personal/professional development blog albeit using a custom domain).

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Samaraweera270

1 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

Googlebot uses heuristics to parse JavaScript for things that look like URLs. It then follows those URLs. Even when they are 404, it reports them in Webmaster Tools.

Google knows that this will sometimes result in it downloading things that are in no way meant to be URLs. Google doesn't view this as a big problem. They find enough content this way that they wouldn't otherwise be able to get to, that it is worth it to them so that they index the web deeper.

Here is what Google's John Mueller has to say about these 404 errors (especially the third point):



404 errors on invalid URLs do not harm your site’s indexing or ranking in any way. It doesn’t matter if there are 100 or 10 million, they won’t harm your site’s ranking. googlewebmastercentral.blogspot.ch/2011/05/do-404s-hurt-my-site.html
In some cases, crawl errors may come from a legitimate structural issue within your website or CMS. How you tell? Double-check the origin of the crawl error. If there's a broken link on your site, in your page's static HTML, then that's always worth fixing. (thanks +Martino Mosna)
What about the funky URLs that are “clearly broken?” When our algorithms like your site, they may try to find more great content on it, for example by trying to discover new URLs in JavaScript. If we try those “URLs” and find a 404, that’s great and expected. We just don’t want to miss anything important (insert overly-attached Googlebot meme here). support.google.com/webmasters/bin/answer.py?answer=1154698 You don’t need to fix crawl errors in Webmaster Tools. The “mark as fixed” feature is only to help you, if you want to keep track of your progress there; it does not change anything in our web-search pipeline, so feel free to ignore it if you don’t need it.
support.google.com/webmasters/bin/answer.py?answer=2467403 We list crawl errors in Webmaster Tools by priority, which is based on several factors. If the first page of crawl errors is clearly irrelevant, you probably won’t find important crawl errors on further pages.
googlewebmastercentral.blogspot.ch/2012/03/crawl-errors-next-generation.html There’s no need to “fix” crawl errors on your website. Finding 404’s is normal and expected of a healthy, well-configured website. If you have an equivalent new URL, then redirecting to it is a good practice. Otherwise, you should not create fake content, you should not redirect to your homepage, you shouldn’t robots.txt disallow those URLs -- all of these things make it harder for us to recognize your site’s structure and process it properly. We call these “soft 404” errors.
support.google.com/webmasters/bin/answer.py?answer=181708 Obviously - if these crawl errors are showing up for URLs that you care about, perhaps URLs in your Sitemap file, then that’s something you should take action on immediately. If Googlebot can’t crawl your important URLs, then they may get dropped from our search results, and users might not be able to access them either.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme