Mobile app version of vmapp.org
Login or Join
Kaufman445

: Serious 404 problem, suggestions for hunting them all down I have a bit of a situation coming up. Due to a complete website structure redesign that is basically inevitable, I expect to have

@Kaufman445

Posted in: #301Redirect

I have a bit of a situation coming up. Due to a complete website structure redesign that is basically inevitable, I expect to have the following:


Our sitemap of about 12,000 url's have about 90-95% of them change
Out of those 12,000, I expect around 5000-6000 internal links to go dead in the process.
No external links to this site yet, as it is still in development.


Is there a tool out there that can do the following:


I can feed the sitemap.xml after the restructuring
have it parse each pages links for 404 errors on that page
only report the pages/errors, preferably with just the url it is on, the url of the error, and the anchor text


I have found a few tools, but all of them seem to be limited to 100 pages.

Any advice for an intermediate webmaster to help this situation? 301 redirects are not viable in this situation.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Kaufman445

2 Comments

Sorted by latest first Latest Oldest Best

 

@Pierce454

Might be too much for your purposes but I would suggest using one of my favorite tool for this is the opensource webcheck. Written in python and currently maintained by Arthur de Jong. Check out the source and download at arthurdejong.org/webcheck/
It is very versitile tool which can crawl any given website, which runs via the command line, has a ton of switches and options (see the man page after installing) and then when finished it will generate reports like this: arthurdejong.org/webcheck/demo/

10% popularity Vote Up Vote Down


 

@Rambettina238

I've finished what I can on the script so far (read comments of original question for detail and context). Source:
www.ionfish.org/projects/xml-spider/
Features:


ability to start from any point (resume crashed attempt?) since it tells you which number it's processing.
scans any sitemap public accessible
finds all the links it has to scan BEFORE actually scanning (uses nearly zero memory or CPU)
orders output by page scanned, with the links in order in which they appear in the HTML
filters out bad links and removes duplicates
uses cURL to get the headers of links ONLY (not wasting time/bandwidth)
tells you exactly what type of error it encounters (if not 200) and highlights the output RED, and puts it in a <span> with the class being the error code
has no output buffering (most servers) so the output is "live" unless your browser likes to chunk it
technically could run forever, so VERY large (50k+ links) are possible
option to show ONLY errors and hide the "successful" links
option to debug, dumping the arrays of first every link in the sitemap, second all the cURL variables for each link in the sitemap, and third, an array of every link on every page in the sitemap


This is distributed (officially) under the MIT license. (Included in source).

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme