Mobile app version of vmapp.org
Login or Join
Miguel251

: Google is reporting only 5% of our site's pages as indexed after one year Our sitemap contains roughly 750,000 paths to discrete pages and only around 30 thousand are reported by google as

@Miguel251

Posted in: #Google #Indexing

Our sitemap contains roughly 750,000 paths to discrete pages and only around 30 thousand are reported by google as being indexed. I am trying to figure out why this is.

The content of our pages is not what I understand to be "thin"; they contain a large amount of unique text, images and links, so i'm hoping this is not a duplicate question. The urls are broken up into several xml files of 30 to 50k urls each, and our robots.txt points to a sitemap that is an index of these files.

I understand that this is a broad question, so I have a few theories:

Theory 1: Our site is viewed as a link farm: our overall site contains about 5 million pages, 99% of which contain links to 3 domains, "www.outbound1.com", "www.outbound2.com" and our site itself (example.com). Some of our pages have 500+ links or more. My assumption is that google views this negatively and does not index our pages accordingly.

Theory 2: Google only indexes pages that others have searched for. Much of our data is incredibly specific, we have pages unique to an individual, or a page showing all content related to a unique topic. My assumption is that users are simply not visiting a majority of our pages and that this relates directly to our low indexing.

I'm hoping someone can confirm or dismiss my assumptions and maybe fill me in on something obvious I'm missing.

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Miguel251

3 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

The number of unique pages that Google is willing to index on any particular site is tied to the reputation of that site.

When you start a brand new site, Google may only be willing to index 1,000 pages. A year in, Google is willing to index near 40,000 pages on your site. That is an indication that your site's reputation has grown with Google within your first year.

Even for a very high quality site, it may end up taking two or three years to get enough reputation to get Google to index 3/4 of a million pages. I don't think that there is an indication of any sort of problem other than that your site is still not very old.

There are two things that you can do:

Improve your site's reputation

Google still uses back links as the primary indication of reputation. If you want to work on increasing the reputation of your site, make sure you are taking advantage of every possible inbound link.

It has long been blackhat to spam links to your site. Recently, Google has even cracked down on "link building." If you do decide to do link building, make sure that you do so from topical sites, with non-keyword rich anchor text, and in a way that doesn't look unnatural to users.

Make sure your best content gets indexed

Google may not index everything on your site, but you can make sure that it indexes the best content on your site. Link prominently and often to your top content. Your home page likely has the most link juice of any page on your site. Use it well. Any content you feature on the homepage with a link, will get indexed easily. From those content pages, link to your second tier. From their link to a third tier. Anything more than three clicks from the home page is not very likely to be indexed.

10% popularity Vote Up Vote Down


 

@Nimeshi995

What can prevent Google from Indexing Pages?


Low quality (low interaction, thin or duplicate)
Robots.txt
Incorrect header responses
Blocked resources


It's about the quality of articles

While an article made up of 250-500 unique words may not be treated as Google as 'thin' doesn't necessary mean the content is high quality. It used to be that size matters but now SEO has shifted and is more about quality than ever. I very much doubt you have 5 million quality articles!

What is a quality page or article?

A quality page is not one that is unique but one that people expect to find and want to see, a page that receives little, or no interaction is a page that will likely be poorly ranked. Sites that have an insane amount of content with little natural interaction makes Google believe that the site is not important and will allocate less time for crawling per a visit.

It's about quality not quantity

Nowadays it is considered better SEO to publish less often and with higher quality content, the type of content people want to see, as interaction is key for rankings, authority and crawl time allocate.

Crawl time allocation

If you bothered to read the previous paragraphs you will notice that I've mentioned crawl time allocation, this is key... and most likely the cause of your index issues.

You have an insane amount of pages, Google has limited resources and it will only ever crawl a site for a certain amount of time before it stops and goes to another site. The allocate time that each site is allocated varies depending on how important Google believes your site is, so the chances. You also need to factor in the 'return' rate, of which Google's bots decide to revisit your site between visits.

What effects Google's Crawl Time and how often Googlebot will return to the site

Domain and site authority is a huge factor for Google in which it decides how often it will return to your site and the duration it does, to put this into respective a site like Stack Overflow will have Googlebot pretty much visiting every few minutes a day, maybe even sooner, but a site with just as much content with low interaction and low authority will be at best a few times a day.

Ensuring Googlebot can crawl as much as possible every visit


Website Speed


Not only does Google reward SEO value for sites that are fast you can also have more pages crawled between visits. Use website speed tests from multiple locations, and ensure that your website responds fast for both Google and your main target region. I recommend WebpageTest, aim for below 1.5seconds on first visit as a good guide.

Server Uptime Availability


If your DNS or server that fails to respond for just a few seconds each day can mean that you missed your return crawl, so its important to monitor your website and ensure that its performing well, Pingdom, and other providers can provide this service for you.

Robots.txt


Ensure that you have a good robots.txt, most sites don't need to worry about robots.txt since they only have a low volume of URLS, but sites with high volume and low authority, then the time spent on your site is critical and having the Googlebot, crawl, not index things like login pages, and pages with noindex will only prevent Google from spending more on pages it can index. Use robots in conjunction with noindex header responses.

Clean up 404's with 301 or 410's


Google loves to crawl, and re-crawl pages that return a 404 status, for most webmasters this is not an issue. However since you have a high volume of pages, it falls back on time is critical. Ensure that your 404's redirect to pages that are on topic, or they return the 410 Gone status. Google will learn by this and should stop attempting to crawl those pages, giving you more index time on pages not crawled.

Remove duplicate pages and avoid canonicals


Nowadays most SEO savvy webmasters will use canonical links to avoid duplication, this for most webmasters is AWESOME! it let Google know what is duplicate, and what is not, but the major problem with canonicals is that they too are a page, and that page needs to be crawled, if you have a page that can be accessed via www, non-www, tag pages and all other type of pages then your simply losing crawl time that could be spent on discovering new pages. So something to bare in mind.

Compile pages


If you have a high volume of pages then its extremely likely that you have similar pages, or pages that can be merged. Google loves long pages! so do users, if you have pages 1 of 5, merge them, if you have relevant pages to one another, merge them.

Crawl errors


Actively monitor your crawl errors, these use crawl time up and you need to keep on track.

Tracking Google


Record when Google visits your site and for how often, keep a track on it and see if you can improve it, doing the above will certainly help.

10% popularity Vote Up Vote Down


 

@Smith883

I'm betting everything you said in your question is the problem though I don't quite understand number two.

But here is the real issue. Five million pages? Is your site the authority for the content displayed on all those pages? If not, then that's your problem.

I wonder if a five million page web site puts yours into the category of "world's largest web site"?

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme