: Using robots.txt to deny access to MediaWiki special pages using substring matching I am running a Mediawiki at the domain someurl.com/wiki/. Unfortunally it generates a bunch of automatically generated
I am running a Mediawiki at the domain someurl.com/wiki/. Unfortunally it generates a bunch of automatically generated Special pages which mainly are of low quality but nevertheless are massively scanned by search engines with queries like:
/index.php/Special:Whatlinkshere/some_topic
or as well
/index.php?title=Special:Whatlinkshere&target=some_topic
where some_topic is an article of the wiki.
These requests seem to be of very low benefit but they consume a good deal of bandwidth and in addition I fear those automatically generated pages are not so good for my site's quality reputation in the serchengines evaluation.
As the requests a mostly done by 'good' engines such as Google or Bing I am quite sure they would obey robots.txt. So I added the following robots.txt to the folder of the base url someurl.com (I added the whole robots.txt, even though only line 1 and 6 are relevant for the queries named above):
User-agent: *
Disallow: User:
Disallow: Discussion:
Disallow: MediaWiki:
Disallow: Special:
Disallow: /login.php
Disallow: /profile.php
Disallow: /author/
Disallow: /category/
Disallow: /tag/
This robots.txt is active for about two days now and has been crawled already, but still there are many requests to URLs like the above which I thought blocked.
So I have the following questions now:
1) is the above logic correct and capable of denying access (to well behaving bots). Especially I wonder whether Disallow: Spezial: correctly works as a wildcard to deny all request having "Special:" in URL or in parameter. I also wonder if the ":" in "Special:" might be a problem.
2) If so why then there is no effect yet? I consider I just have to grant more time to see the effect?
3) Will denying in the robots.txt lead to the de-indexing of this sites from the searchengines results? If not how can I get this huge amount of automatically generated URLs de-indexed?
More posts by @Berryessa370
1 Comments
Sorted by latest first Latest Oldest Best
robots.txt disallow rules are all "starts with" rules, not substring rules.
MediaWiki suggests using this in robots.txt for a case like yours:
User-agent: *
Disallow: /index.php?
Disallow: /index.php/Help
Disallow: /index.php/MediaWiki
Disallow: /index.php/Special:
Disallow: /index.php/Template
Disallow: /skins/
Google says that it supports more advanced syntax along with some other major search engines:
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
* designates 0 or more instances of any valid character
$ designates the end of the URL
For those user agents you could use rules like:
Disallow: *Help
Disallow: *MediaWiki
Disallow: *Special:
Disallow: *Template
Other crawlers will end up just ignoring those rules because none of your URLs will start with any of the rules.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.