: Why doesn't httrack follow robots.txt? I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download
I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download everything in the /search subdirectory. This leads to an infinite regress of searches on searches.
Here's the robots.txt file (I've replaced my blog name with "myblog"):
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Allow: /
Sitemap: myblog.blogspot.com/feeds/posts/default?orderby=updated
I can limit the crawl to depth 3 or 4, but I still get tons of search*.html and search/label/*.html files in the mirrored directory.
httrack claims to follow robots.txt. Why doesn't it work here? What can I do to fix it?
More posts by @Jamie184
1 Comments
Sorted by latest first Latest Oldest Best
I don't know for sure, but maybe httrack is reading your 'Allow' rule as overriding the 'Disallow' rule.
You should remove the "Allow" rule regardless, as it is useless. User agents will crawl everything by default. You blocked the search directory, that is all that's required.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.