Mobile app version of vmapp.org
Login or Join
Jamie184

: Why doesn't httrack follow robots.txt? I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download

@Jamie184

Posted in: #Mirror

I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download everything in the /search subdirectory. This leads to an infinite regress of searches on searches.

Here's the robots.txt file (I've replaced my blog name with "myblog"):

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: myblog.blogspot.com/feeds/posts/default?orderby=updated

I can limit the crawl to depth 3 or 4, but I still get tons of search*.html and search/label/*.html files in the mirrored directory.

httrack claims to follow robots.txt. Why doesn't it work here? What can I do to fix it?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Jamie184

1 Comments

Sorted by latest first Latest Oldest Best

 

@Yeniel560

I don't know for sure, but maybe httrack is reading your 'Allow' rule as overriding the 'Disallow' rule.

You should remove the "Allow" rule regardless, as it is useless. User agents will crawl everything by default. You blocked the search directory, that is all that's required.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme