: How can an exception be created for Facebook on robots.txt? I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone

I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone else to get updates when I post them on our Facebook page. The problem is that the general exception to the directory is also blocking Facebook access, which causes links to this directory impossible to post.

Please ignore the fact about not allowing Google. I really just wanna know how to create an exception for Facebook but I thought I'd give the whole story to explain it better.

10.02% popularity Vote Up Vote Down

: Google references subfolders Last week I launched a website and created a robots.txt: User-agent: * Disallow: Sitemap: http://mywebsite.tld/sitemap_robots.xml Now I've checked the indexed pages (Google

@Sent6035632

Posted in: #Google #GoogleIndex #Indexing

2 Comments

: Cloudflare cache image downloads Is there a way to have cloudflare cache image downloads, ( content type application/octet-stream, Content-Disposition attachment )? Images are hosted on AWS S3.

@Sent6035632

Posted in: #AmazonAws #Cloudflare

1 Comments

: How to make client side and server side communicate? I have apache installed, and I would like (using php/and or javascript, html) to do the following but I do not know how. To send my client

@Sent6035632

Posted in: #Apache #Browsers #Server

2 Comments

: Dynamic Meta Keywords Is it a problem to change the meta keywords dynamically depending on the content of the page / website? for example if it is a news website and this weeks top articles

@Sent6035632

Posted in: #MetaKeywords #SearchEngines #Seo

2 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Hamm4606531

I suspect that with Facebook you meant the facebookexternalhit user-agent string that appears in your access logs? This is not a crawler and as such doesn't respect (or indeed needs to, but that's argumentative) restrictions in robots.txt. This answer from Jeff Sherlock pretty much explains their position on it. So having this in mind, you could simply create a new robots.txt rule to deny crawling to the directory in question for all robots.txt respecting crawlers:

User-agent: *
Disallow: /[directory]/

Where you replace [directory] with the name of the directory you don't want Google and other crawlers to crawl and index. Alternatively, if you have access to your web server's configuration, you could serve different X-Robots-Tag header values to different user-agents that your web server detects. If you're on Apache, your configuration could look something along the lines of:

<FilesMatch "[directory]">
Header set X-Robots-Tag "noindex, nofollow, noarchive"
</FilesMatch>

Or even add these robot directives in your HTML (if served from your [directory]):

<!DOCTYPE html>
<html><head>
<meta name="robots" content="noindex, nofollow, noarchive" />
(…)
</head>
<body>(…)</body>
</html>

As far as I'm aware, none of these directives will be respected by Facebook, so this should match your requirements. Rationale behind Facebook's decision not to respect these headers and robots.txt is that they're shared, liked,... by their users, so robots does not apply. Needless to say, we don't really have similar humans.txt restrictions we could use in cases like this.

If blocking facebookexternalhit user-agent is desireable, then you'd have to either block it in your web server's configuration, or detect such user-agent string in your web application that serves the contents of your webpages.

10% popularity Vote Up Vote Down

@Shelley277

You seem to be suffering from an overly broad rule. To target Google web-search explicitly, you have to put rules that prevent googlebot from crawling whichever directory that you want blocked. With a targeted disallow rule, there is not need to do anything for Facebook.

Now, if the general case is to block everyone but Facebook, you need to put a disallow that disallows all user-agents and then another rule for the Facebook user agent which is facebookexternalhit. Something like:

User-agent: *
Disallow: /confusing-directory

User-agent: facebookexternalhit
Disallow: /bogus-directory-or-filename

By specifying a rule for Facebook, the general disallow will not apply. If you have an actual dir or file you do not want FB to refer to, no need for a bogus entry.

10% popularity Vote Up Vote Down

Feed

: How can an exception be created for Facebook on robots.txt? I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone

More posts by @Sent6035632

: Google references subfolders Last week I launched a website and created a robots.txt: User-agent: * Disallow: Sitemap: http://mywebsite.tld/sitemap_robots.xml Now I've checked the indexed pages (Google

: Cloudflare cache image downloads Is there a way to have cloudflare cache image downloads, ( content type application/octet-stream, Content-Disposition attachment )? Images are hosted on AWS S3.

: How to make client side and server side communicate? I have apache installed, and I would like (using php/and or javascript, html) to do the following but I do not know how. To send my client

: Dynamic Meta Keywords Is it a problem to change the meta keywords dynamically depending on the content of the page / website? for example if it is a news website and this weeks top articles

Login to post a comment!

2 Comments

Back to top | Use Dark Theme