: How to get tens of millions of pages indexed by Google bot? We are currently developing a site that currently has 8 million unique pages that will grow to about 20 million right away, and
We are currently developing a site that currently has 8 million unique pages that will grow to about 20 million right away, and eventually to about 50 million or more.
Before you criticize... Yes, it provides unique, useful content. We continually process raw data from public records and by doing some data scrubbing, entity rollups, and relationship mapping, we've been able to generate quality content, developing a site that's quite useful and also unique, in part due to the breadth of the data.
It's PR is 0 (new domain, no links), and we're getting spidered at a rate of about 500 pages per day, putting us at about 30,000 pages indexed thus far. At this rate, it would take over 400 years to index all of our data.
I have two questions:
Is the rate of the indexing directly correlated to PR, and by that I mean is it correlated enough that by purchasing an old domain with good PR will get us to a workable indexing rate (in the neighborhood of 100,000 pages per day).
Are there any SEO consultants who specialize in aiding the indexing process itself. We're otherwise doing very well with SEO, on-page especially, besides, the competition for our "long-tail" keyword phrases is pretty low, so our success hinges mostly on the number of pages indexed.
Our main competitor has achieved approx 20MM pages indexed in just over one year's time, along with an Alexa 2000-ish ranking.
Noteworthy qualities we have in place:
page download speed is pretty good (250-500 ms)
no errors (no 404 or 500 errors when getting spidered)
we use Google webmaster tools and login daily
friendly URLs in place
I'm afraid to submit sitemaps. Some SEO community postings suggest a new site with millions of pages and no PR is suspicious. There is a Google video of Matt Cutts speaking of a staged on-boarding of large sites, too, in order to avoid increased scrutiny (at approx 2:30 in the video).
Clickable site links deliver all pages, no more than four pages deep and typically no more than 250(-ish) internal links on a page.
Anchor text for internal links is logical and adds relevance hierarchically to the data on the detail pages.
We had previously set the crawl rate to the highest on webmaster tools (only about a page every two seconds, max). I recently turned it back to "let Google decide" which is what is advised.
More posts by @Ann8826881
6 Comments
Sorted by latest first Latest Oldest Best
There are two Possible options I know of thay be of aome assistance.
One: A little trick I tried with a website that had three million pages which worked surprisingly well was what my colleague coined a crawl loop. You may have to manipulate the idea a bit to make it fit with your site.
Basically we set a day where we didnt think we would be getting much traffic (christmas) and we literally copied a list of every single link on our site and pasted every single one into a php file that was called on every single webpage. (The sidebar php file)
We then perceded to go to google search console (formerly google webmaster tools) and told google to fetch a url and crawl every single link on that urls page.
Since you have so many links, and the pages those link to also have an abundant amount of links, google goes into a bit of a loop and crawls the site in a much quicker fashion. I was skeptical at first but it worked like a charm.
Before you do this you must make sure you have an extremely efficient database setup and a very powerful server otherwise it could either overload the server or hurt your SEO due to the slow page load times.
If that isnt an option for you you can always look into google's cloud console apis. They have a search console api so you could write a script to either add each webpage as its own website instance in search console or to have google fetch every single one of your urls.
The apis can get complicated extremely quickly but are an amazing tool when used right.
Good luck!
There are two Possible options I know of thay be of aome assistance.
One:
A little trick I tried with a website that had three million pages which worked surprisingly well was what my colleague coined a crawl loop. You may have to manipulate the idea a bit to make it fit with your site.
Basically we set a day where we didnt think we would be getting much traffic (christmas) and we literally copied a list of every single link on our site and pasted every single one into a php file that was called on every single webpage. (The sidebar php file)
We then perceded to go to google search console (formerly google webmaster tools) and told google to fetch a url and crawl every single link on that urls page.
Since you have so many links, and the pages those link to also have an abundant amount of links, google goes into a bit of a loop and crawls the site in a much quicker fashion. I was skeptical at first but it worked like a charm.
Before you do this you must make sure you have an extremely efficient database setup and a very powerful server otherwise it could either overload the server or hurt your SEO due to the slow page load times.
If that isnt an option for you you can always look into google's cloud console apis. They have a search console api so you could write a script to either add each webpage as its own website instance in search console or to have google fetch every single one of your urls.
The apis can get complicated extremely quickly but are an amazing tool when used right.
Good luck!
One thing I notice with google webmaster tools is that they start off by allowing a maximum crawl rate of about two requests per second. Then about a week or so later, if they find that the website is frequently accessed, then they will allow you to increase your limit.
I co-run a website that hosts over 500,000 original images and at times, my max limit is 10 requests per second because I get at least 700 to 1000 hits a day if not more.
So what you might want to do is check with webmaster tools every week to see if you can increase the crawl limit. When you change the crawl limit, google will reset it back to their preferred settings after a certain day has passed (which the interface will show you). Then on that day, raise the limit again.
Gaming the system is never a good idea if you're running a legitimate business that values its online reputation. Also, if your site genuinely provides value, then the longer it's around (I assume you're doing some form of marketing?) the more backlinks it will accrue, so your PR will go up and your crawl rate will go up.
Also, if you have a good link structure on your site (all of your pages are discoverable in a reasonable number of clicks/links), then you only need to submit the main indexes via sitemap. Once those pages are indexed by Google, they will be crawled by Google, and Google will index the rest of the pages on its own.
How to get tens of millions of pages
indexed by Google bot?
It won't happen overnight, however, I guarantee that you would see more of your pages spidered sooner if inbound links to deep content (particularly sitemap pages or directory indexes which point to yet deeper content) were being added from similarly-large sites which have been around for a while.
Will an older domain be sufficient to
get 100,000 pages indexed per day?
Doubtful, unless you're talking about an older domain that has had a significant amount of activity on it (i.e. accumulated content and inbound links) over the years.
Are there any SEO consultants who
specialize in aiding the indexing
process itself.
When you pose the question that way, I'm sure you'll find plenty of SEO's who loudly proclaim "yes!" but, at the end of the day, Virtuosi Media's suggestions are as good advice as you'll get from any of them (to say nothing of the potentially-bad advice).
From the sound of it, you should consider utilizing business development and public relations channels to build your site's ranking at this point - get more links to your content (preferably by partnering with an existing site which offers regionally-targeted content to link in to your regionally-divided content, for example), get more people browsing to your site (some will have the Google toolbar installed so their traffic may work toward page discovery), and, if possible, get your business talked about on the news or in communities of people who have a need for it (if you plan to charge for certain services, consider advertising a free trial period to draw interest).
Some potential strategies:
Google Webmaster Tools allows you to
request an increased crawl rate. Try
doing that if you haven't already.
Take another look at your navigation
architecture to see if you can't
improve access to more of your
content. Look at it from a user's perspective: If it's hard for a user to find a specific piece of information, it may be hard for search engines as well.
Make sure you don't have duplicate content because of inconsistent URL parameters or improper use of slashes. By eliminating duplicate content, you cut down on the time Googlebot spends crawling something it has already indexed.
Use related content links and in-site
linking within your content whenever
possible.
Randomize some of your links. A sidebar with random internal content is a great pattern to use.
Use dates and other microformats.
Use RSS feeds wherever possible. RSS
feeds will function much the same as
a sitemap (in fact, Webmaster Tools
allows you to submit a feed as a
sitemap).
Regarding sitemaps, see this
question.
Find ways to get external links to your content. This may accelerate the process of it getting indexed. If it's appropriate to the type of content, making it easy to share socially or through email will help with this.
Provide an API to incentivize use of
your data and external links to your
data. You can have an attribution
link as a requirement to the data
use.
Embrace the community. If you reach out to the right people in the right way, you'll get external links via blogs and Twitter.
Look for ways to create a community around your data. Find a way to make it social. API's, mashups, social widgets all help, but so do a blog, community showcases, forums, and gaming mechanics (also, see this video).
Prioritize which content you have indexed. With that much data, not all of it is going to be absolutely vital. Make a strategic decision as to what content is most important, e.g., it will be most popular, it has the best chance at ROI, it will be the most useful, etc. and make sure that that content is indexed first.
Do a detailed analysis of what your competitor is doing to get their content indexed. Look at their site architecture, their navigation, their external links, etc.
Finally, I should say this. SEO and indexing are only small parts to running a business site. Don't lose focus on ROI for the sake of SEO. Even if you have a lot of traffic from Google, it doesn't matter if you can't convert it. SEO is important, but it needs to be kept in perspective.
Edit:
As an addendum to your use case: you might consider offering reviews or testimonials for each person or business. Also, giving out user badges like StackOverflow does could entice at least some people to link to their own profile on your site. That would encourage some outside linking to your deep pages, which could mean getting indexed quicker.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.