Mobile app version of vmapp.org
Login or Join
Gloria169

: How to tell Googlebot to crawl just URLs ending in .html I have a big problem in Google Webmaster Tools. The number of 404 pages in the error report increases so fast that now I have over

@Gloria169

Posted in: #CrawlErrors #GoogleSearchConsole

I have a big problem in Google Webmaster Tools. The number of 404 pages in the error report increases so fast that now I have over 1000. When I check for errors, I see that for every page Googlebot tries to crawl URLs without .html. That creates a 404 error each time.

I have tried to find the source of this error. Here is an example: ermagazin.com/najgora-nuklearna-katastrofa-u-americkoj-povijesti-za-koju-nikad-niste-culi It has 3 sources that are correct links. One of them is ermagazin.com/najgora-nuklearna-katastrofa-u-americkoj-povijesti-za-koju-nikad-niste-culi.html which is the correct URL that Googlebot should be crawling instead the first one without .html.

Check screenshot:



Is there something I can add in robots.txt to prevent Googlebot from crawling the URLs without .html?

My robots.txt is:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
Disallow: /readme.html

Sitemap: ermagazin.com/sitemap_index.xml ermagazin.com/post-sitemap.xml 2016-02-11 08:57 ermagazin.com/page-sitemap.xml 2016-01-14 14:45 ermagazin.com/category-sitemap.xml 2016-02-11 08:57 ermagazin.com/post_tag-sitemap1.xml 2016-02-11 08:57 ermagazin.com/post_tag-sitemap2.xml 2016-02-11 08:57

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Gloria169

1 Comments

Sorted by latest first Latest Oldest Best

 

@Eichhorn148

When Googlebot is crawling such a large number of bad URLs, it is almost always because your site is misconfigured and you are linking to the URLs incorrectly somewhere.

In your case it is the "show all articles" link. For example on this page I see the following in the HTML source code:

<a href="http://ermagazin.com/zakopao-zivu-djevojku-8-mjeseci-zbog-vjerovanja-da-ce-to-donijeti-bogatstvo-tanzanija-u-soku" class="more-articles-button">show all articles</a>

It appears that when I click on that button in a browser, I don't get to the 404 page. You must have some JavaScript that intercepts the click and causes browsers to to something else. However, Googlebot scans the HTML source code and finds that link. When it tries to follow it, it gets a 404 version of each and every article on your site.

You need to fix that link, and look for others like it.



Another thing that you can do is redirect requests for the URLs without .html to the correct versions. Since you are using WordPress, you might want to use a WorpPress 404 plugin that allows you to monitor and redirect 404 errors. I used to use one called "True Google 404" that ran the words in not found URLs through site search and automatically redirect to the proper page. Sadly that plugin appears not to be available anymore. I did a quick search but I didn't find any plugins that allow redirects based on patterns from WordPress 404s.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme