: Displaying pages that rely on iframes in crawler friendly way THE PREAMBLE I've created a site which re-displays the text of new legislation screen scraped from multiple pages on the UK Parliament
THE PREAMBLE
I've created a site which re-displays the text of new legislation screen scraped from multiple pages on the UK Parliament website, which are then joined together and tweaked to make them more readable and to allow people to comment and vote on each Bill.
It displays information about each Bill before Parliament on it's own page:
public-scrutiny-office.org/bills/2013-2014/offender-rehabilitation
The actual content of the each Bill page is an iframe, with a URL like this:
public-scrutiny-office.org/bills/2013-2014/offender-rehabilitation/content
Note: The latter URL will redirect to the former URL with JS if it detects it's not in an iframe - this is an attempt to deal with bill pages being index by search engines (I intentionally don't want crawlers to ignore them, at least not for now, because it's the only way they can get indexed right now).
I had to start using an iframe like this because of the way the original content is formatted (using html tidy routines doesn't help, see notes below). Raw Bill HTML is just sometimes just so mangled that it otherwise breaks the page it's in.
THE PROBLEM
The problem with having the main content in an iframe like this is it's lousy for SEO, which means people searching for a bill aren't going to be as likely to find it (except in their raw form on the official Parliament site, where they are almost unreadable).
THE QUESTION
How I can play nicely with search engines- particularly Google - and get them to index the contents of Bills, while having that content in an iframe, in way that search engines approve of?
I'm thinking of stuff like canonicalization, how I should structure a Sitemap file, what meta data is actually useful, etc. Displaying different content to search engines seems to be a no-no, even with honest motives - they don't really allow for corner cases like this.
Note the Bill Name and Description and not visible in the iframe URL, which is another hurdle to good indexing (though not indexing the description isn't a huge loss, but the title is pretty important).
Ultimately, I just want search engines to index the text of the page as users will see it, but I can't see a way to do that when I have to deal with broken HTML I need to contain in an iframe and if I'm not allowed to show different content to crawlers. I'm really not up on canonicalization in this context.
Should I have just one URL for both version of the page and reply on using a query parameter and some JavaScript hackery to re-render the page, for example (blergh!)?
PS: You may wish to read the notes before venturing to point out why this is a bad idea and how I'm doing it all wrong.
This is a project for the public good, is open source and on GitHub - so if you think there is a better way to do this, please feel free to fork it.
To address some inevitable questions:
Yes, re-displaying content this way is explicitly permitted by Parliament.
No there are no API's to get it in a better format. There are PDFs but for complicated reasons I don't want to go into, even amazing PDF parsers fail at parsing them well.
tl;dr version: It's to do with how bills are marked up - at least in the HTML metadata added to them is marked up with classes so it's easy to find and hide/remove. Doing that with the PDFs would mean creating a custom PDF parser, which isn't going to happen.
The HTML is sometimes very broken and I've tried using several different parsing engines to fix it but it's too broken for them to be able to work out how to "correct it" (webkit, gecko & trident all fail that too), which is why it's in an iframe.
Yes, I could convert it to plain text and back into HTML again, but then I'd lose tabular info and which some bills don't make sense without.
Yes, it's possible for Parliament to punk me and commit an XSS attack, but no that's not really a concern here (not even if the Parliament site itself was to be hacked; you can't even log into public-scrutiny-office.org).
More posts by @Karen161
1 Comments
Sorted by latest first Latest Oldest Best
You could dynamically write the IFrame into the page using JavaScript. I see that you are already using JavaScript to resize the IFrame.
If you were to do that, you could then use <noscript> tags in the page to include the full text of the bill (maybe with no formatting) in a way that search engine spiders would be able to see the words and users with JavaScript would get the pretty version in the IFrame.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.