Mobile app version of vmapp.org
Login or Join
Shakeerah822

: Prevent search bots from indexing server (sub)domain name A web application I wrote is hosted on an in-house server with the name myserver, which is under my university's domain (department.uni.edu),

@Shakeerah822

Posted in: #Nginx #SearchEngines #Subdomain #Webserver

A web application I wrote is hosted on an in-house server with the name myserver, which is under my university's domain (department.uni.edu), resulting in the server's address being myserver.department.uni.edu. When I Google myserver, the first result is that exact server hosting the web application.

I have a robots.txt file for the application (root directory) with the following contents:

User-agent: *
Disallow: /


It's the actual server domain name that was indexed, and not anything in the web application.

I know that I can remove search results with Google Webmaster Tools, but how do I prevent Google from indexing a server's domain name (or address)? I believe the server is running Nginx on Ubuntu 14.10 (I am not the person in charge of the server, just coding the web application).

The desire here is to prevent the server from being indexed by web searching tools such as Google, Bing, Yahoo, etc. - basically block all known search engine crawlers.

Perhaps a solution is to block all crawlers to the subdomain's root (mysever.department.university.edu) using an Nginx rewrite rule such as:

map $http_user_agent $limit_bots {
default 0;
~*(google|bing|yandex|msnbot) 1;
~*(AltaVista|Googlebot|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|WebWhacker) 1;
~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|Go!Zilla|Go-Ahead-Got-It|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It) 1;
~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE) 1;
~*(GrabNet|NetSpider|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|RealDownload|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider) 1;
~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|WebReaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus) 1;
~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw) 1;
}

location / {
if ($limit_bots = 1) {
return 403;
}
}


(borrowed from GD Hussle)

but, would this be sufficient or would something more sophisticated be necessary?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Shakeerah822

1 Comments

Sorted by latest first Latest Oldest Best

 

@Ann8826881

With robots.txt you can control crawling, not indexing. If a search engine is not allowed to crawl a document on your host, it might still index its URL, e.g. if it found the link on an external site.

You can control indexing with the meta-robots element or the X-Robots-Tag HTTP header (see examples).

You have to decide if you want to allow search engines to crawl but not to index, or to index but not to crawl. Because if you disallow crawling in robots.txt, search engines won’t be able to reach your documents, so they’ll never learn that you don’t want these to get indexed.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme