Mobile app version of vmapp.org
Login or Join
Goswami781

: Is there a spider / link checker that can start deep inside a login-protected site We use vendor hosted Blackboard for our distance education courses, but host course multimedia on our own servers.

@Goswami781

Posted in: #DeadLinks #Links #WebCrawlers

We use vendor hosted Blackboard for our distance education courses, but host course multimedia on our own servers. The multimedia server has been moved and the domain has changed. Blackboard DBAs have run queries to update the links in the DB, but we need to make sure they got them all. There are hundreds of thousands of links to check.

I need to be able to login to the blackboard administrator, navigate to the courses section and execute a search to bring up the course list before running the link checker on the links in the search results.

Is there a product or service that does this? I've never used selenium, but I wonder if a scripting solution might be more appropriate. All advice welcome.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Goswami781

1 Comments

Sorted by latest first Latest Oldest Best

 

@BetL925

Yes, there are crawlers that can crawl a site which requires login. This requires that you log into your site using your web browser and export your cookies. Then you start the crawler with the cookies that you had used to log in and the the crawler crawls the site as your logged in user.

To export your cookies, use Firefox with the Export Cookies Add-on. Log in to your site and then export your cookies using "Tools" -> "Export Cookies. Save the file as cookies.txt.

The wget command line crawler can use your cookies.txt file to start crawling.

wget -r --load-cookies=cookies.txt mysite.example.com/

wget will save the website locally in a directory structure like mysite.example.com/pages/index.html You can then run a link checker against these locally saved files.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme