Mobile app version of vmapp.org
Login or Join
Kimberly868

: Optimizing and securing robots.txt I have a couple of doubts / questions / ideas related to robots.txt: Can we deny website for all bots except for chosen ones in order to tell other bots

@Kimberly868

Posted in: #RobotsTxt

I have a couple of doubts / questions / ideas related to robots.txt:


Can we deny website for all bots except for chosen ones in order to
tell other bots not to crawl site:

User-agent: *
Disallow: /

User-Agent: Googlebot
User-Agent: bingbot
User-Agent: Slurp
Allow: /person/
Allow: /products/

Can we deny whole site and then just specify list of pages we want
to allow for indexing like in example above? I do not want to give
away URLs that I want to exclude from crawling, as this information
can be used against me.
Note that we do not have Allow: /, so will
bot be able to access home page?
In example above, will Allow be
applied for all three user agents on the list or Allow needs to be
pasted under each user agent?

User-Agent: Googlebot
Allow: /person/
Allow: /products/

User-Agent: bingbot
Allow: /person/
Allow: /products/

User-Agent: Slurp
Allow: /person/
Allow: /products/

10.05% popularity Vote Up Vote Down


Login to follow query

More posts by @Kimberly868

5 Comments

Sorted by latest first Latest Oldest Best

 

@Kimberly868

Based on your suggestions I made some changes. Let me note that in
addition that we would like not to expose areas which we do not want to
crawl (by disallow / and then just specifying what we allow), we have
big issue that we have over 500 bots which come in and destroy our
caching system. We do not care for any bots except for google, bing
and yahoo.
Anyhow based on you suggestions (thank you all for very good inputs),
this is what I came up with:

User-agent: *
Disallow: /

# www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0 User-Agent: bingbot
User-Agent: Adidxbot
User-Agent: msnbot
User-Agent: msnbot-*
User-Agent: BingPreview
# support.google.com/webmasters/answer/1061943?hl=en User-Agent: Googlebot
User-Agent: Googlebot-News
User-Agent: Googlebot-Image
User-Agent: Googlebot-Video
User-Agent: Googlebot-Mobile #User -Agent: Mediapartners-Google #User -Agent: AdsBot-Google
# help.yahoo.com/l/uk/yahoo/mobile/onesearch/onesearchmb/search-69.html User-Agent: Slurp
User-Agent: YahooSeeker/M1A1-R2D2
Disallow: /
Allow: /index.php
Allow: /person/
Allow: /archive/
Allow: /forkids/

Sitemap: yourdomain.com/sitemap.xml Sitemap: yourdomain.com/sitemap.xml?page=1 Sitemap: yourdomain.com/sitemap.xml?page=2
Crawl-delay: 10
# GMT (8:00pm pst to 4:00am pst)
Visit-time: 0400-1200


Let me break down my new thinking here.

1)
User-agent: *
Disallow: /

Get all bots which cooperate out of our site

2)

# www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0 User-Agent: bingbot
User-Agent: Adidxbot
User-Agent: msnbot
User-Agent: msnbot-*
User-Agent: BingPreview
# support.google.com/webmasters/answer/1061943?hl=en User-Agent: Googlebot
User-Agent: Googlebot-News
User-Agent: Googlebot-Image
User-Agent: Googlebot-Video
User-Agent: Googlebot-Mobile #User -Agent: Mediapartners-Google
#User-Agent: AdsBot-Google
#http://help.yahoo.com/l/uk/yahoo/mobile/onesearch/onesearchmb/search-69.html
User-Agent: Slurp
User-Agent: YahooSeeker/M1A1-R2D2


Now that I banned all bots, for bing, google and yahoo bots I want to
make special provisions. Just for above bots do following:

Disallow: /
Allow: /index.php
Allow: /person/
Allow: /archive/
Allow: /forkids/


With "Disallow: /" I am disallowing all links from my site, except for these:
/index.php
/person/
/archive/
/forkids/

Allow: /index.php - I have added because with "Disallow: /" (from what
I understood), I am blocking my home page, which is not what I want to
do. Therefore I have added "Allow /index.php", which redirects to
homepage (just domainname.com).

I have added these to inform crawlers of sitemap.

Sitemap: yourdomain.com/sitemap.xml Sitemap: yourdomain.com/sitemap.xml?page=1 Sitemap: yourdomain.com/sitemap.xml?page=2

Next two might not have big impact:
Crawl-delay: 10
# GMT (8:00pm pst to 4:00am pst)
Visit-time: 0400-1200

Google (and others) ignore "Crawl-delay", and google does not recognize
"Visit-time" directive.

Let me add that I have used google webmaster tools to check above rules,
and besides Crawl-delay and Visit-time disregard by google, I got:
1) That sitemap is recognized and loaded (not sure if I need to have
just first line or all 3 are needed). But this is ok.
2) Google respected first disallow for all bots, and recognized allows
that I specified in it's bot section.
3) Further more it respected disallow: / as well.

The only thing I could not verify is
Allow: /index.php
To see if I am eliminating my home page.

10% popularity Vote Up Vote Down


 

@Shanna517

First thing: If the knowledge of "private" URLs "can be used against" you, robots.txt is the wrong tool for the job. It would be safe to assume that only a small part of all bots honour your robots.txt rules.

Second thing: Note that Allow is not part of the original robots.txt specification. Some bots support it, others not. Those that don’t support Allow should simply ignore lines with this field name.



This doesn’t make sense:

User-agent: *
Disallow: /

User-Agent: Googlebot
User-Agent: bingbot
User-Agent: Slurp
Allow: /person/
Allow: /products/


Every bot follows one record only. If there is no User-agent match, it uses the User-agent: * record as "fallback".

So this snippet means:


For every bot not matching Googlebot/bingbot/Slurp: Everything is disallowed.
For every bot matching Googlebot/bingbot/Slurp: Everyting is allowed.


So there is no need to specify these two Allow lines, as everything is allowed anyway.

If you want to disallow everything except URLs starting with person/ and products/ for the matching bots, you have to repeat the Disallow line:

User-agent: *
Disallow: /

User-agent: Googlebot
User-agent: bingbot
User-agent: Slurp
Disallow: /
Allow: /person/
Allow: /products/





Note that we do not have Allow: /, so will bot be able to access home page?


Yes, the default is: everything is allowed.

If you want to make this explicit, you can use:

User-agent: …
Disallow:
# this Disallow line means: everything is allowed


It’s likely that Allow: / would mean the same (but again, Allow is not part of the specification, so every bot may implement it differently).


[…] will Allow be applied for all three user agents on the list or Allow needs to be pasted under each user agent?


It’s correct to have several User-agent lines in one record.

10% popularity Vote Up Vote Down


 

@Jamie184

Wow. There is a lot here. I appreciate your ingenuity but I suspect that you may be putting too much energy into what should be a rather simple effort.

Your first example, of the two is best though either should work the same. However, I question the wisdom of blocking the home page of your site. Without knowing more about your site configuration, I am at a disadvantage.

I am assuming a few things here. That you do have a home page at /. That the bulk of your site is in /persons and /products. That you want to protect some directories that may or may not be linked to by excluding them in the robots.txt file. That these possible links could specify nofollow and noindex. And that links exist from the home page to the restricted ares.

First things first. I have no experience, and I am sure not too many people do, as to what happens if you block your home page in robots.txt. Would the rest of the site be spidered? I doubt it unless you submit a sitemap that is actually read. Sitemaps are not always taken into account especially if after they are read and the site is small. I find that to be too much of a risk. I would not block the home page and I am not sure why you would do this except that the links to the restricted areas are made on the home page.

Second, you have not told anyone other than through the use of noindex or nofollow, according to my assumptions, that the restricted areas are restricted, they are not restricted. I think you are forgetting that an inbound link (backlink) can be made to these areas and without it being restricted, it will eventually be discovered and spidered. This could especially be true using social media. The use of noindex and nofollow is not the same as restricting something in robots.txt.

One thought that must always be considered. The robots.txt is not a security tool. It is designed for honest robots and cannot block accesses otherwise made with malintent. Other tools and tactics must be applied to protect these areas.

The use of noindex and nofollow assumes that your link is not followed. As stated before, it may be that other links are found that point to these areas. Any noindex and nofollow found upon a link is not stored as a restricted area by a spider but rather simply not followed. This means, unless you restrict areas of your site specifically within the robots.txt file, it will likely be found and spidered at some point.

I have found that spiders come (generally) in three primary flavors; honest, scraper, and hacker. The honest bot is obviously nothing to worry about. Scraper bots for the most part do read and follow robots.txt with a rather small minority percentage ignoring robots.txt altogether. Because of these minority spiders, I have rarely found links to my restricted areas if at all. The final bot is a hacker bot likely one of two types, landscaping and script-kiddie. Neither of these will ever follow or read robots.txt nor will they create links.

In other words, it rarely harms anyone to restrict a directory in robots.txt any more than it normally would be. That you are far better off restricting these areas in your robots.txt file. That failing to do so, you have only yourself to blame if the area gets spidered, scraped, and linked to. And finally, that following the traditions of the robots.txt file is far more beneficial than experimenting. The robots.txt is not part of a security scheme except to inform honest bots. If you need to secure areas of your site, you had better find another way.

10% popularity Vote Up Vote Down


 

@Pope3001725

Well, the first thing you should realize is that robots.txt is a standard, not a security protocol. Anything on your site that isn't secured can be crawled by a crawler/robot. The only thing robots.txt will do is tell well-behaved crawlers (e.g. GoogleBot) what you would like them to ignore.

Second, I'll recommend running any robots.txt you come up with against Google's Webmaster Tools, to help you optimize: support.google.com/webmasters/topic/4617736
Lastly, on to your specific questions:

1) Pretty sure that the robots spec is like an ACL in that it goes in order. Put your specified Allow rules before your Disallow rules. I've also never seen multiple User-Agent attributes in a row before. If it works, it might not work for all bots; I'd suggest making a separate entry for each.

2) Yes, but you should specify which ones you allow first, then deny the rest.

3) They have access to everything that's not behind security. You have told them you don't want them to index your home page if you do that, however.

4) The second version will work; again if the first version works it will likely not work universally.

10% popularity Vote Up Vote Down


 

@Shanna517

The robots.txt spec says prefix matching is used. This means if you don't want the excluded URLs to be visible in robots.txt, you can simply abbreviate them just enough to not match any allowed URLs.

For example, if you want to disallow /very/secret, you could simply use:

Disallow: /ve


In your robots.txt file, and as long as you don't have other URLs starting with /ve, which you do want to get crawled, this will work as intended.

You may see bots reading robots.txt and then try to get /ve from your site. Those bots will get a 404, and you'll get a useful signal in your logfile about which bots are trying to find secret stuff through robots.txt.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme