Mobile app version of vmapp.org
Login or Join
Berryessa370

: Using robots.txt to deny access to MediaWiki special pages using substring matching I am running a Mediawiki at the domain someurl.com/wiki/. Unfortunally it generates a bunch of automatically generated

@Berryessa370

Posted in: #Indexing #Mediawiki #RobotsTxt #WebCrawlers

I am running a Mediawiki at the domain someurl.com/wiki/. Unfortunally it generates a bunch of automatically generated Special pages which mainly are of low quality but nevertheless are massively scanned by search engines with queries like:

/index.php/Special:Whatlinkshere/some_topic
or as well
/index.php?title=Special:Whatlinkshere&target=some_topic
where some_topic is an article of the wiki.

These requests seem to be of very low benefit but they consume a good deal of bandwidth and in addition I fear those automatically generated pages are not so good for my site's quality reputation in the serchengines evaluation.

As the requests a mostly done by 'good' engines such as Google or Bing I am quite sure they would obey robots.txt. So I added the following robots.txt to the folder of the base url someurl.com (I added the whole robots.txt, even though only line 1 and 6 are relevant for the queries named above):

User-agent: *

Disallow: User:
Disallow: Discussion:
Disallow: MediaWiki:
Disallow: Special:

Disallow: /login.php
Disallow: /profile.php

Disallow: /author/
Disallow: /category/
Disallow: /tag/


This robots.txt is active for about two days now and has been crawled already, but still there are many requests to URLs like the above which I thought blocked.

So I have the following questions now:

1) is the above logic correct and capable of denying access (to well behaving bots). Especially I wonder whether Disallow: Spezial: correctly works as a wildcard to deny all request having "Special:" in URL or in parameter. I also wonder if the ":" in "Special:" might be a problem.

2) If so why then there is no effect yet? I consider I just have to grant more time to see the effect?

3) Will denying in the robots.txt lead to the de-indexing of this sites from the searchengines results? If not how can I get this huge amount of automatically generated URLs de-indexed?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Berryessa370

1 Comments

Sorted by latest first Latest Oldest Best

 

@BetL925

robots.txt disallow rules are all "starts with" rules, not substring rules.

MediaWiki suggests using this in robots.txt for a case like yours:

User-agent: *
Disallow: /index.php?
Disallow: /index.php/Help
Disallow: /index.php/MediaWiki
Disallow: /index.php/Special:
Disallow: /index.php/Template
Disallow: /skins/


Google says that it supports more advanced syntax along with some other major search engines:


Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:


* designates 0 or more instances of any valid character
$ designates the end of the URL



For those user agents you could use rules like:

Disallow: *Help
Disallow: *MediaWiki
Disallow: *Special:
Disallow: *Template


Other crawlers will end up just ignoring those rules because none of your URLs will start with any of the rules.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme