Mobile app version of vmapp.org
Login or Join
Jamie184

: Google not crawling my site (robots.txt error) I'm currently performing SEO for my client's project. I'm kind of new to this, so please bear with me. I've read a lot mixed reviews on inclusion

@Jamie184

Posted in: #Googlebot #RobotsTxt #Seo #SeoAudit

I'm currently performing SEO for my client's project. I'm kind of new to this, so please bear with me.

I've read a lot mixed reviews on inclusion of robots.txt (some say it's good to include even if you have no URLs to block, some say we should not even have it).

Also, a lot of online tools kept mentioning that my client's site did not have a robots.txt, which I why I decided to include the robots.txt into my site.

However, my developers deployed the robots.txt containing these items:

User-agent: *
Disallow: /


I understand that by adding that backslash for disallow, it will tell Google to not crawl everything on my site

31st Jan: Wrong robots.txt was deployed

6th Feb: I realized that I couldn't find my website in SERPs, and found the robots.txt error, which I told my developers to change it immediately.

14th Feb: Correct robots.txt was deployed

User-agent: *
Disallow:


9th March: Till date, all my pages (except homepge) can't be found in Google

I just can't seem to figure out what the problem is. My only best guess is that because of the disallow backslash, Google kind of "blacklisted" all my webpages. After I changed the robots.txt to the right one, Google have yet to crawl my site, and hence my webpages are still in their "blacklist".

What should I do now?

====================================================

Edited information:

I thought it could be because of the shift in HTTP to HTTPS because Google webmaster tool sees http and https as separate sites. I've read from here (https://webmasters.stackexchange.com/questions/68435/moving-from-http-to-https-google-search-console) stating that we need to have both old and new sitemaps in GWT.

In my GWT I only had http, so I included https recently. However, the sitemap.xml for both my http and https console are linking to the same thing. Could that be a problem?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Jamie184

3 Comments

Sorted by latest first Latest Oldest Best

 

@Murphy175

FIRST
By this:

User-agent: *
Disallow: /


You are saying to not index your site. That means your entire site won't appear in search results. For example, if you have a directory call 'test' and inside you have pages for your tests, you could do something like:

User-agent: *
Disallow: /test/


robots.txt will apply to all Search Engines that can interpret this file, and that means not only Google, also Yahoo and Bing (and probably much more minor Search Engines)

SECOND
If you have both 'http' and 'https' becarefeul with duplicate content. One of your directories should be empty except for the htaccess file that redirects to the other site (HTTP to HTTPS, HTTPS to HTTP).
In your domain registrar settings, check that you don't have any redirect set, so the only this that controls redirections are your htaccess files.

THIRD
Set up a Google Search Console account for that site. Once you have your site verified (if you have Analytics set up that will be fast) you will see an amount of options to check all that say:



Crawl errors.
Crawl stats.
robots.txt tester.
sitemaps.xml tester (which we didnt talk about but is VERY important aswell).
Index status.
Blocked resources.
Much more.

10% popularity Vote Up Vote Down


 

@Merenda212

I think Disallow: / will prevent google bots from indexing your entire domain.


Disallow: [the URL path you want to block]
Allow: [the URL path in of a subdirectory, within a blocked parent
directory, that you want to unblock]


Did you try the robots.txt tester?: support.google.com/webmasters/answer/6062598

10% popularity Vote Up Vote Down


 

@Ann8826881

some say it's good to include even if you have no URLs to block


This simply prevents your logs being polluted with a lot of unnecessary 404s - since the bots will request it anyway. But this isn't necessary a problem - it just depends how your stats software reports it. (The request is logged anyway, regardless of whether it exists or not - either with a "200 OK" if it exists or with a "404 Not Found" if it doesn't.)

If you specify a robots.txt file at all and you wish the bots to crawl all pages then it should either be empty or contain the minimal:

User-agent: *
Disallow:


(Note there is no slash in the URL-path of the Disallow directive.)

You need to verify your site with Google Search Console (formerly Google Webmaster Tools), if you've not already, and check with the Crawl > "robots.txt Tester" and "Fetch as Google" tools to make sure what robots.txt Google is seeing, when it was accessed and that your pages are accessible.


Check your server logs - has Googlebot visited your site?
What does a site: search return in the SERPs?



because of the disallow backslash, Google kind of "blacklisted" all my webpages.


Google doesn't "blacklist" your pages in this way. Simply "correcting" your robots.txt file should be sufficient. Btw, this is a (forward) slash, not a backslash.

In fact, it is not uncommon for a site to be blocked with robots.txt whilst it is being developed and this block is only removed when the site goes live.

There can be many reasons why your site is not appearing in the SERPs yet. One thing is that your site is new and it takes time - you may not have given it enough time. And deploying a blocking robots.txt file may only have slowed things down.

For more information:


Why isn't my website in Google search results?
How long did it take for your new website to be indexed by search engines

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme