Mobile app version of vmapp.org
Login or Join
BetL925

: What are the most common "normal" HTML errors? I'm trying to find real-world HTML pages that are actually valid (in the sense that they are displayed by modern browsers), but are not compliant

@BetL925

Posted in: #Html #Validation

I'm trying to find real-world HTML pages that are actually valid (in the sense that they are displayed by modern browsers), but are not compliant with HTML specification and are considered invalid.
Any thoughts, links etc. are very welcome.

P. S. I've tried using validator.w3.org/, and according to it many pages I've tried (randomly) contain tens to hundreds of errors. Are there really so many broken pages on the Web, or is this validator unrealistically strict?

P. P. S. The reason for this question is I'm writing a simple HTML parser, and I want to test it on extreme real-world pages to ensure it withstands most common errors.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @BetL925

1 Comments

Sorted by latest first Latest Oldest Best

 

@Angie530

[There's an actual suggestion at the end of this, but there are also a bunch of problems with your entire question that I think have to be worked through, so bear with me.]

There's no such thing as "unrealistically strict" here. The validator is not intended to be a liberal parser; it's a validator(obviously) and "valid" has a fixed technical definition. It tests against the rules set out by the specs and pages that don't follow those rules are invalid, end of discussion. Strictness is the entire point. It's also why, at least until recently with HTML5, we specified a DTD: the rules change sometimes and so it was necessary to point out which set your page was following. See my answer to a previous question for some notes on that.

So, yes. There really are that many messed up pages on the web, as far as validation is concerned.

Whether those pages display is a separate issue altogether; you're mixing concepts in your question(which is why your first paragraph actually doesn't make sense). They generally do display, because browsers tend to have extremely liberal parsers that bend over backwards to make do with anything thrown at them. That's because it's generally not considered fair to penalize the user for bad code on the site builder's part.

Beyond a few pretty trivial ones like case differences between matching tags(<p>...</P>), I don't personally think a list of "common" errors is going to be very helpful, or even really exists. That list is going to be different depending upon who you're looking at, and if you're planning on building a simple parser, sorry but it's doomed if you plan on pointing it at an arbitrary set of "extreme real-world pages." HTML coders–at varying skill levels–don't, for example, make the same errors as content producers. The writers for a site I manage occasionally have to insert bits of markup and make mistakes that I can't figure out how they even managed to type.

If your parser really has to be simple and custom, you would do better to tailor what it handles to what's common in the data you actually plan on feeding it. If you do have to parse arbitrary content from around the web, you should probably just use one of the existing parsers. Either way, though, the best source of errors to look for would probably be for you to examine one of those parser's test suite and crib from them. They've obviously already done the research. (But at that point, I'd again lean toward just going ahead and using that parser rather than writing your own.)

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme