: What are the most common "normal" HTML errors? I'm trying to find real-world HTML pages that are actually valid (in the sense that they are displayed by modern browsers), but are not compliant

I'm trying to find real-world HTML pages that are actually valid (in the sense that they are displayed by modern browsers), but are not compliant with HTML specification and are considered invalid.
Any thoughts, links etc. are very welcome.

P. S. I've tried using validator.w3.org/, and according to it many pages I've tried (randomly) contain tens to hundreds of errors. Are there really so many broken pages on the Web, or is this validator unrealistically strict?

P. P. S. The reason for this question is I'm writing a simple HTML parser, and I want to test it on extreme real-world pages to ensure it withstands most common errors.

10.01% popularity Vote Up Vote Down

: Google Free Views and Crawling We plan on giving our users 5 free article views before prompting them to buy a subscription. (for logged in or anonymous users). This will be a cookie-based

@BetL925

Posted in: #Cloaking #Google #GoogleSearch #Seo

1 Comments

: What's your thoughts about .it instead of .com? I'm thinking of using a my-new-webapp.it instead of my-new-webapp.com (since it's already in use by someone else) What's your thoughts about the

@BetL925

Posted in: #Domains #TopLevelDomains

5 Comments

: Google Analytics reporting 5% of users using IE on one site. Most other sites I track are still around 40%. What gives? I have Google Analytics (GA) tracking set up for a number of websites,

@BetL925

Posted in: #GoogleAnalytics #InternetExplorer

3 Comments

: Google Analytics reporting pages with no titles set - how do I debug this? I have been looking through some reports on Google Analytics and if I go to CONTENT > Site Content > Pages

@BetL925

Posted in: #GoogleAnalytics

2 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Angie530

[There's an actual suggestion at the end of this, but there are also a bunch of problems with your entire question that I think have to be worked through, so bear with me.]

There's no such thing as "unrealistically strict" here. The validator is not intended to be a liberal parser; it's a validator(obviously) and "valid" has a fixed technical definition. It tests against the rules set out by the specs and pages that don't follow those rules are invalid, end of discussion. Strictness is the entire point. It's also why, at least until recently with HTML5, we specified a DTD: the rules change sometimes and so it was necessary to point out which set your page was following. See my answer to a previous question for some notes on that.

So, yes. There really are that many messed up pages on the web, as far as validation is concerned.

Whether those pages display is a separate issue altogether; you're mixing concepts in your question(which is why your first paragraph actually doesn't make sense). They generally do display, because browsers tend to have extremely liberal parsers that bend over backwards to make do with anything thrown at them. That's because it's generally not considered fair to penalize the user for bad code on the site builder's part.

Beyond a few pretty trivial ones like case differences between matching tags(<p>...</P>), I don't personally think a list of "common" errors is going to be very helpful, or even really exists. That list is going to be different depending upon who you're looking at, and if you're planning on building a simple parser, sorry but it's doomed if you plan on pointing it at an arbitrary set of "extreme real-world pages." HTML coders–at varying skill levels–don't, for example, make the same errors as content producers. The writers for a site I manage occasionally have to insert bits of markup and make mistakes that I can't figure out how they even managed to type.

If your parser really has to be simple and custom, you would do better to tailor what it handles to what's common in the data you actually plan on feeding it. If you do have to parse arbitrary content from around the web, you should probably just use one of the existing parsers. Either way, though, the best source of errors to look for would probably be for you to examine one of those parser's test suite and crib from them. They've obviously already done the research. (But at that point, I'd again lean toward just going ahead and using that parser rather than writing your own.)

10% popularity Vote Up Vote Down

Feed

: What are the most common "normal" HTML errors? I'm trying to find real-world HTML pages that are actually valid (in the sense that they are displayed by modern browsers), but are not compliant

More posts by @BetL925

: Google Free Views and Crawling We plan on giving our users 5 free article views before prompting them to buy a subscription. (for logged in or anonymous users). This will be a cookie-based

: What's your thoughts about .it instead of .com? I'm thinking of using a my-new-webapp.it instead of my-new-webapp.com (since it's already in use by someone else) What's your thoughts about the

: Google Analytics reporting 5% of users using IE on one site. Most other sites I track are still around 40%. What gives? I have Google Analytics (GA) tracking set up for a number of websites,

: Google Analytics reporting pages with no titles set - how do I debug this? I have been looking through some reports on Google Analytics and if I go to CONTENT > Site Content > Pages

Login to post a comment!

1 Comments

Back to top | Use Dark Theme