: Does Google index portions of the page that are unique and ignore the duplicate content? When a page has content on it that is also on other pages on the site but also some unique content,
When a page has content on it that is also on other pages on the site but also some unique content, how does Google handle it? Does Google:
Index the entire page (including the duplicate content)
Index just the unique text on the page
Index none of the page (not even the unique content)
For clarification, I'm only talking about content duplicated within a website, not content copied from other sites.
I ask because I have answered several questions here assuming that Google will index unique content even when it has duplicate content near it in the same page. However, I realized I don't have any evidence that this is actually true.
This is a duplicate content scenario that is not addressed by our catch all question on duplicate content: What is duplicate content and how can I avoid being penalized for it on my site?
More posts by @BetL925
3 Comments
Sorted by latest first Latest Oldest Best
Okay. I will try and explain what I know the best that I can quickly. Perhaps just explaining some of this will make things clear.
In the early days of Google, a term index would be, in effect, a relational or leaf table that tied terms in a term index (forward and reversed) to a document using a docID and wordID with other metrics. Part of semantic tradition is to track the position of a term (word) in relation to points within the document. Google, when researched, only maintained a single position metric based upon the start of the document (0) in bytes. This did not include HTML markup of course, but in the early days included HTML header, footer, sidebar content etc.
In this way, Google would be able to look for patterns of terms in relationship with each other. This means that while a document did not have to completely duplicated, it was fairly easy to determine that a document was duplicate within a certain set of metric guidelines be it percentage, ratio, or whatever.
The problem with this method is that rearranging a document or using a spinner could easily defeat this.
Given that semantics is more involved than term relationships from a single point and the use of ontologies that relate similar terms, plural terms, etc., duplicate content was more easily found, though still not fully complete if taken in a relatively linear comparative model.
Enter the DOM.
Using the HTML DOM model, sections of repeated content can more easily be compared to extract templated sections such as headers, footers, sidebars, etc. This is a given these days since this has been in place for a long time with excellent results. Content is now the page content that people would recognize. These templated content sections are indexed of course (basing this on a Google flaw that evidenced this fact even in 2015), but largely ignored for search matches.
Okay, we understand this. But what about actual content?
The HTML DOM model is still used. For each content DOM element, largely header tags, paragraphs, tables, etc., each is semantically weighed using a variety of semantic algorithms some singular and some in combination to create a matrix which you can think of a spreadsheet/table of sorts. This lists each term with the algorithm weights. Since semantics is not a direct comparison of terms, meaning that car, automobile, vehicle, etc., are all the same, along with plural versions of these terms, etc., any algorithm can easily find content that has been spun, reorganized, etc. The key is that a matrix can cover varying sizes of content by overlapping several matrices into a matrix of matrix.
A matrix will represent content segments (as defined in semantics). This, for HTML, would be a header tag, the paragraphs found following the header ending at the next header taken as both singular paragraphs and as a group. A content segment can also be a singular sentence but we will get into this in a bit. Using term position from the beginning of a header, the beginning of a paragraph, the beginning of a group of paragraphs between header tags, etc., the original patterns of term relationships can still be used. But more importantly, within the matrices, patterns also can be seen quite easily. It does not take a rocket scientist to recognize them. The semantic scores give duplication away.
Knowing that a content segment is also as small as a singular sentence, there is something new going on. Content segments are also being looked at in new ways to recognize content that is being created using variables from a programming language. This is still rather easy to discover, though as of right now, I am still figuring this out. It is still semantics based, but how that varies may only mean a more granular semantic analysis. Be that as it may, header tags, paragraphs, and sentences are being analyzed beginning in 2015 for automated content creation that may otherwise escape other duplicate content analysis. The result of this analysis is penalizing sites as we speak.
Okay. Back to what duplication is effected.
The first thing to remember is that once Google fetches a page, the entire HTML code is stored for reference. This is used to build the cache of a page, but is really used to allow Google to go back and reapply new or updated analysis to content without re-fetching the page.
Obviously, HTML templated content is completely ignored when a search query is made though there are some extremely minor exceptions that seem to have escaped Google until recently. You will find that it is extremely rare that Google will match a search query to a header, footer, sidebar, etc. Good.
Google has stated that replicated portions of content are indexed and weighted normally assuming that spam is not an issue. This is because for most sites, it is nearly impossible to not replicate portions of one page on another for a site of a certain size or greater. As well, this would cover quoted sections of content as a citation. Still good.
Google, as stated, is looking at smaller content segments for variable based content creation. This is where it gets tricky and not all of this is figured out yet. If you were to look at some automated sites, some are being hit while others are not. Clearly, these sites are programmatically generated and extremely similar, but what is the difference? Looking at Whois sites as an example, it is still fuzzy. I believe that other factors that we all know come into play such as the velocity of page creation, link velocity, site and page authority as defined by linking patterns, social engagement, etc., continue to play a role but in a different way. So for a site with a good reputation and solid metrics will be forgiven if content is driven by filling in variables where others will be more strongly seen as spam if the metrics are poor. This means that the bar for content quality and value is more measured by users than the content itself thus raising the bar of acceptability. One savior from this effect is unique content. Is the site adding value that is significant over others? How this is measured is still unclear, however, it seems that for now, uniqueness of a portion of content within a field of comparable sites is a metric though likely less than the others listed above.
Clear as mud?? Did I do a good job here?
I assume, Google decides about indexing through measuring of duplication (or similarity) rate of certain page on URL basis and indexes all pages, containing less then 100% (or 90%, or X% - only Google knows exactly the number) duplicate (if nothing, like noindex, prevents it).
Finding duplicated content isn't a trivial task and is error-prone because of page chrome. That is why i think Google would index pretty all pages and kick out only doubtlessly duplicated pages.
An interesting thing is, that pages having some internal duplicated content (requirement again: less then 100%) can cannibalize rankings of their internal competitors.
Just about every website you visit will at least have a certain percentage of duplicate content. A perfect example of this is a logo specific to the website that appears on all content pages to indicate that the content is part of the website itself. This kind of thing is something google will index in its entirety (provided of course the rest of the content is original and unique to the site and not copied verbatim from another site). Heck, if google did not index this, then thousands of legit online companies will make mass complaints about their site not being indexed.
If on the other hand you have pages that mostly consist of duplicate content where the difference between each page is less than a few words of text, then google could very well treat that as duplicate content and will decide which one of the duplicate pages to index if any.
What I would suggest is to try to make the duplicate level between two pages to less than 60% (ideally), or at least under 80% at bare minimum.
Using tools such as the one found here: www.webconfs.com/similar-page-checker.php can give you an idea how similar two pages are. Never aim for 100% with this tool.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.