Mobile app version of vmapp.org
Login or Join

Login to follow query

More posts by @Holmes151

5 Comments

Sorted by latest first Latest Oldest Best

 

@BetL925

In December 2015, Google only indexed the first nine chapters of Pride and Prejudice on Project Gutenberg's website. Chapter 10 is right about 100k (1/10th of a megabyte).

Chapter 9



Chapter 10



Repeating this search in March 2017 showed the result from Chapter 10 however.

If you have very large amounts of text in a document, Google won't index all of it. In fact it appears that Google only indexed approximately the first 100k in 2015, although this has since been increased at least for some sites.

I believe that Googlebot is willing to download more data than that. It just may not index the text in the document after a certain point.

It is also not clear from this experiment whether Google counts markup which is non-visible to the user towards that 100k. My guess is not. Many pages have far more than 100k of markup with text that Google would like to index near the bottom.

10% popularity Vote Up Vote Down


 

@Eichhorn148

As Suggested by Mr. Lavalamp, i went through the Google's Documentation and found the same even after searching the different sites as well. but Talking about Googlebots, they are far more smarter, i think:



Googlebot will not reject a page just because its size is too large, rather it will first crawl Title, URL, Images, headings, Subheadings and Anchor text throughout the page.



This will be probably very less in size and Google can find most of the web page meta data in this document, after that Googlebot can think if more document need to crawl or not, but most of your Web Page resource which are relevant (More) is already Indexed by Google!
So one should be careful to Fragment their long Document into pieces of Sections by using headings and subheadings.

10% popularity Vote Up Vote Down


 

@Alves908

Don't have an answer and not something that shows up on Google.

However, suggestion:

Create html files of varying sizes, ranging from 50KB up through to 10MB and link to all of them. Trigger a crawl of the site and see if any webmaster errors come up. That will give you the upper limit for the max size that Google will be hit before errors show up. Prior to the error showing, after some time, you'll see how much of each file gets indexed.

10% popularity Vote Up Vote Down


 

@Holmes151

Short Answer:
Google will index up to 2.5MB of an HTML file.

Long Answer:

According to Google's Documentation:

All files larger than 30MB will be completely ignored.

They will index up to 2.5MB of an HTML file.

Non-HTML files will be converted to HTML. If the files exceed 4,000,000 bytes they will be completely ignored. Otherwise, the first 2MB will be cached.

10% popularity Vote Up Vote Down


 

@Si4351233

The only post I found seems to have data from 2008. So I wouldn't trust that.

You'll have to test this. Go to Webmaster Tools and click Health > Fetch as Google
You can trust that what it fetches will be exactly what the crawler fetches.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme