: Prevent search bots from indexing server (sub)domain name A web application I wrote is hosted on an in-house server with the name myserver, which is under my university's domain (department.uni.edu),
A web application I wrote is hosted on an in-house server with the name myserver, which is under my university's domain (department.uni.edu), resulting in the server's address being myserver.department.uni.edu. When I Google myserver, the first result is that exact server hosting the web application.
I have a robots.txt file for the application (root directory) with the following contents:
User-agent: *
Disallow: /
It's the actual server domain name that was indexed, and not anything in the web application.
I know that I can remove search results with Google Webmaster Tools, but how do I prevent Google from indexing a server's domain name (or address)? I believe the server is running Nginx on Ubuntu 14.10 (I am not the person in charge of the server, just coding the web application).
The desire here is to prevent the server from being indexed by web searching tools such as Google, Bing, Yahoo, etc. - basically block all known search engine crawlers.
Perhaps a solution is to block all crawlers to the subdomain's root (mysever.department.university.edu) using an Nginx rewrite rule such as:
map $http_user_agent $limit_bots {
default 0;
~*(google|bing|yandex|msnbot) 1;
~*(AltaVista|Googlebot|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|WebWhacker) 1;
~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|Go!Zilla|Go-Ahead-Got-It|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It) 1;
~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE) 1;
~*(GrabNet|NetSpider|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|RealDownload|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider) 1;
~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|WebReaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus) 1;
~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw) 1;
}
location / {
if ($limit_bots = 1) {
return 403;
}
}
(borrowed from GD Hussle)
but, would this be sufficient or would something more sophisticated be necessary?
More posts by @Shakeerah822
1 Comments
Sorted by latest first Latest Oldest Best
With robots.txt you can control crawling, not indexing. If a search engine is not allowed to crawl a document on your host, it might still index its URL, e.g. if it found the link on an external site.
You can control indexing with the meta-robots element or the X-Robots-Tag HTTP header (see examples).
You have to decide if you want to allow search engines to crawl but not to index, or to index but not to crawl. Because if you disallow crawling in robots.txt, search engines won’t be able to reach your documents, so they’ll never learn that you don’t want these to get indexed.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.