: Recovering a lost website with no backup? Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites: http://blog.stackoverflow.com http://www.codinghorror.com
Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites:
(Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!)
I am beginning the slow, painful process of recovering the website from web crawler caches.
There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick, but I had some bad results using this:
My IP address was quickly banned from Google for using it
I get lots of 500 and 503 errors and "waiting 5 minutes…"
Ultimately, I can recover the text content faster by hand
I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments.
However, the images that go with each blog post are proving…more difficult.
Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?
(And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)
This is my python script, it will scrape though google cache and download the content of your webiste, and it can run without trouble with 503 504 404 error (Google blocks IP that send many request): gist.github.com/3787790
At the risk of pointing out the obvious, try mining your own computer's backups for the images. I know my backup strategy is haphazard enough that I have multiple copies of a lot of files hanging around on external drives, burned discs, and in zip/tar files. Good luck!
A suggestion for the future: I use Windows Live Writer for blogging and it saves local copies of posts on my machine, in addition to publishing them out to the blog.
About five years ago, an early incarnation of an external hard drive on which I was storing all my digital photos failed badly. I made an image of the hard drive using dd and wrote a rudimentary tool to recover anything that looked like a JPEG image. Got most of my photos out of that.
So, the question is, can you get a copy of the virtual machine disk image which held the images?
If your images were stored on an external service such as Flickr or a CDN (as mentioned in one of your podcasts), you may still have the image resources there.
Some of the images could be found searching on Google Images and click on "Find similar images", maybe there are copies on other sites.
Very sorry to hear this and I am very annoyed for you, and the timing - I wanted an offline copy of a few of your posts and did HTTrack on your entire site but had to go out (this was a couple of weeks ago) and I stopped it.
If the host is half descent - and by the fact I am guessing you are a good customer... I would ask them to either send you the hard drives (as I am guessing they should be using RAID) or do some recovery themselves.
Whilst this may not be a fast process, I did this with one host for a client and was able to recover entire databases intact (... basically, the host tried an upgrade for the control panel they were using and messed it up.. but nothing was overwritten).
Whatever happens - Good luck from all your fans on the SO sites!
Here's my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img ...> tags, which your question seems to imply that you will be able to do.)
This is basically a fancy way of saying, "get it from your readers' web browser caches." You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers' caches is difficult, and the entire approach works best if it's easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don't go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.
Sorry to hear about the blogs. Not going to lecture. But I did find what appears to be your images on Imageshack. Are they really yours or has somebody been keeping a copy of them around.
They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump.
Some of us follow you with an RSS reader and don't clear caches. I have blog posts that appear to go back to 2006. No images, from what I can see, but might be better than what you're doing now.
By going to Google Image search and typing site:codinghorror.com you can at least find the thumbnailed versions of all of your images. No, it doesn't necessarily help, but it gives you a starting point for retrieving those thousands of images.
It looks like Google stores a larger thumbnail in some cases:
Google is on the left, Bing on the right.