: Converting PDFs into HTML with pdf2htmlEX: is the output SEO friendly? My team and I are implementing the pdf2htmlEX conversion to convert and display over 200k pdf documents (available in our
My team and I are implementing the pdf2htmlEX conversion to convert and display over 200k pdf documents (available in our database) on our website.
The HTML produced by pdf2htmlEX will be placed "in page", crawlable by Search Engines for up to 3 pages of each document. Today, just to let you know, we display a collection of PNGs instead of the PDFs themselves.
The pdf2HTMLEx library works great in terms of UX, but the HTML text produced is full of <span> and <div> tags and might result difficult to understand by Google.
Like this:
21. The model of perfect competition is more useful for analy <span class="_ _0"> <span>zing situations in which firms <span class="_ _1"></span> </div><div class="t m0 x5 h2 y35 ff2 fs1 fc0 sc0 ls1 ws0">a. engage in price wars in order to secure a position in the market </div>
My questions are:
Will this cause problems with Google SE positioning? Is, in your opinion, better than having a list of PNGs?
Or will this "dirty" HTML look like a bad SEO technique causing us to incur in penalization risks?
More posts by @Si4351233
3 Comments
Sorted by latest first Latest Oldest Best
If you are worried about the divs and spans, they can be removed. I would recommend passing the html through a tool like Pandoc.
Pandoc is a command line file converter, once you have the PDF, you can use the Pandoc converter to convert the HTML into Markdown and then back into HTML. This should remove all unnecessary tags and clean up the markup dramatically.
If you're using bash, then this line should do it.
cat example.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html
As far as SEO goes, I am not sure it will matter much. What matters more is semantic markup and quality of content. The quality will be as good as the PDF and I doubt the html2pdf tool will give you very semantic markup anyway. What matter most is that your text is machine readable and, regardless of span and div tags, it should be.
I'm not going to say "go for it" since I have not used that specific library -- so I will let you infer instead :)
We use DOMpdf for a similar reason (renders a simple product view) and Google indexes it fine. We don't opt for the forced download option, instead it's just displayed via built in viewer, and users can choose to save it. Like pdf2HTMLEx, the markup is very "messy" and cryptic, but Google does not seem to have a problem with it.
Actually, we recently put nofollows and robots.txt denies to the PDF generation area because bots were hitting it too much. Google started to rank a few of the PDF views higher than the actual product itself. True there is a tiny icon in SERP hinting it's a PDF, and often they may find the real product below...but the issue is that when a human drops in. They see the PDF in the browser, but there is no obvious navigation or "back to product" button without baking it into the generated PDF.
SEO-wise any text is better than nothing (the png). The pdf2HTMLEx output indeed looks horrible for humans, but for bots (such as the google crawler), its just heavily marked site and most of the times the bots ignore the markings (except for cases of text color, visibility, font sizes etc. that affect the readability).
The bigger problem however is not with the too many tags, but rather with the way important/key words often end-up being broken across tags.
Having said that, there is one other alternative, pdf.js, which uses layers for the text that may address some of the concern you have. Try out the outputs from both pdf2htmlex and pdf.js and see which one fares better.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2025 All Rights reserved.