Mobile app version of vmapp.org
Login or Join
Sent6035632

: How can an exception be created for Facebook on robots.txt? I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone

@Sent6035632

Posted in: #Facebook #RobotsTxt

I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone else to get updates when I post them on our Facebook page. The problem is that the general exception to the directory is also blocking Facebook access, which causes links to this directory impossible to post.

Please ignore the fact about not allowing Google. I really just wanna know how to create an exception for Facebook but I thought I'd give the whole story to explain it better.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Sent6035632

2 Comments

Sorted by latest first Latest Oldest Best

 

@Hamm4606531

I suspect that with Facebook you meant the facebookexternalhit user-agent string that appears in your access logs? This is not a crawler and as such doesn't respect (or indeed needs to, but that's argumentative) restrictions in robots.txt. This answer from Jeff Sherlock pretty much explains their position on it. So having this in mind, you could simply create a new robots.txt rule to deny crawling to the directory in question for all robots.txt respecting crawlers:

User-agent: *
Disallow: /[directory]/


Where you replace [directory] with the name of the directory you don't want Google and other crawlers to crawl and index. Alternatively, if you have access to your web server's configuration, you could serve different X-Robots-Tag header values to different user-agents that your web server detects. If you're on Apache, your configuration could look something along the lines of:

<FilesMatch "[directory]">
Header set X-Robots-Tag "noindex, nofollow, noarchive"
</FilesMatch>


Or even add these robot directives in your HTML (if served from your [directory]):

<!DOCTYPE html>
<html><head>
<meta name="robots" content="noindex, nofollow, noarchive" />
(…)
</head>
<body>(…)</body>
</html>


As far as I'm aware, none of these directives will be respected by Facebook, so this should match your requirements. Rationale behind Facebook's decision not to respect these headers and robots.txt is that they're shared, liked,... by their users, so robots does not apply. Needless to say, we don't really have similar humans.txt restrictions we could use in cases like this.

If blocking facebookexternalhit user-agent is desireable, then you'd have to either block it in your web server's configuration, or detect such user-agent string in your web application that serves the contents of your webpages.

10% popularity Vote Up Vote Down


 

@Shelley277

You seem to be suffering from an overly broad rule. To target Google web-search explicitly, you have to put rules that prevent googlebot from crawling whichever directory that you want blocked. With a targeted disallow rule, there is not need to do anything for Facebook.

Now, if the general case is to block everyone but Facebook, you need to put a disallow that disallows all user-agents and then another rule for the Facebook user agent which is facebookexternalhit. Something like:

User-agent: *
Disallow: /confusing-directory

User-agent: facebookexternalhit
Disallow: /bogus-directory-or-filename


By specifying a rule for Facebook, the general disallow will not apply. If you have an actual dir or file you do not want FB to refer to, no need for a bogus entry.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme