: Why doesn't httrack follow robots.txt? I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download

Posted in: #Mirror

I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download everything in the /search subdirectory. This leads to an infinite regress of searches on searches.

Here's the robots.txt file (I've replaced my blog name with "myblog"):

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: myblog.blogspot.com/feeds/posts/default?orderby=updated

I can limit the crawl to depth 3 or 4, but I still get tons of search*.html and search/label/*.html files in the mirrored directory.

httrack claims to follow robots.txt. Why doesn't it work here? What can I do to fix it?

10.01% popularity Vote Up Vote Down

: How to host my jsp files in google appengine? I wondered how to add my own JSP files in Google app engine. I created my account in Google app engine. I don't know where to host my file

@Jamie184

Posted in: #GoogleAppEngine #Jsp

1 Comments

: Does the meta description affect page ranking on Google? According to the following site http://dailyseotip.com/meta-description-tags-and-ctr-the-new-way-that-meta-tags-affect-rankings/1448/ We’ve known

@Jamie184

Posted in: #Google #MetaTags #Seo

2 Comments

: How do I install and run django on iPage (a shared webhost)? I am using an iPage hosted server and I am willing to use django. As per the provider, they support python inside cgi-bin dir.

@Jamie184

Posted in: #Django #Python #WebHosting

3 Comments

: LaunchRock CNAME configs, what do they do? Trying out LaunchRock, and ran across this: general-instructions-for-creating-your-cname-record Basically as far as I'm able to tell they only support using

@Jamie184

Posted in: #Dns #Subdomain #WebHosting

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Yeniel560

I don't know for sure, but maybe httrack is reading your 'Allow' rule as overriding the 'Disallow' rule.

You should remove the "Allow" rule regardless, as it is useless. User agents will crawl everything by default. You blocked the search directory, that is all that's required.

10% popularity Vote Up Vote Down

Feed

: Why doesn't httrack follow robots.txt? I'm trying to use httrack to mirror my blog, which is currently hosted on blogger. Problem: in spite of the robots.txt file, httrack tries to download

More posts by @Jamie184

: How to host my jsp files in google appengine? I wondered how to add my own JSP files in Google app engine. I created my account in Google app engine. I don't know where to host my file

: Does the meta description affect page ranking on Google? According to the following site http://dailyseotip.com/meta-description-tags-and-ctr-the-new-way-that-meta-tags-affect-rankings/1448/ We’ve known

: How do I install and run django on iPage (a shared webhost)? I am using an iPage hosted server and I am willing to use django. As per the provider, they support python inside cgi-bin dir.

: LaunchRock CNAME configs, what do they do? Trying out LaunchRock, and ran across this: general-instructions-for-creating-your-cname-record Basically as far as I'm able to tell they only support using

Login to post a comment!

1 Comments

Back to top | Use Dark Theme