: Google references subfolders Last week I launched a website and created a robots.txt: User-agent: * Disallow: Sitemap: http://mywebsite.tld/sitemap_robots.xml Now I've checked the indexed pages (Google
Last week I launched a website and created a robots.txt:
User-agent: *
Disallow:
Sitemap: mywebsite.tld/sitemap_robots.xml
Now I've checked the indexed pages (Google -> site:mywebsite.tld) and found links to subfolders and non-public/internal files, that are place outside of the web root of the website. E.g.: mywebsite.tld/vendor/, mywebsite.tld/module/, mywebsite.tld, mywebsite.tld/dev/, mywebsite.tld/vendor/zendframework/zendframework/tests/ZendTest/Test/PHPUnit etc.
Why are such folders being indexed? Why is it possible, if they are placed outside of the website root and there are not links to them? How to avoid indexing of such paths?
EDIT
The folder structure of my website is:
/composer.json
/composer.lock
/composer.phar
/config/
/data/
/dev/
/init_autoloader.php
/LICENSE.txt
/module/
/public/ <-- web root
/README.md
/temp/
/vendor/
It's a common structure of a Zend Framework 2 project (with some additional folders). The document root is the folder public (set in Plesk):
More posts by @Sent6035632
2 Comments
Sorted by latest first Latest Oldest Best
OK, I've finally found a plausible explanation, why Google has indexed folders and files outside of the document root:
The website I've launched last week is a second version of a project. The first version was Joomla! based and its dosument root was to the project root:
/var/www/.../mywebsite.tld
The second version bases on Zend Framework 2 and uses its common project structure. So the web root is
/var/www/.../mywebsite.tld/public
We had some technical troubles. So the initial deploy of the second version took several days and only at the last day the document root of the website has been changed to .../public. During all this time the index page was looking like this:
So the Google bot indexed all these links, that are broken now, but was returning status 200 some days ago.
In order to block all robots in your website, the correct code is:
User-agent: *
Disallow: /
Don't forget the / (slash) after Disallow:.
To block only subdirectories and their internal webpages with robots.txt, you have to list them one by one. Thus you need to know their names:
User-agent: *
Disallow: /vendor/
Disallow: /module/
...
To understand how robots.txt works, you can read this or this.
Moreover, for your information, when you buy a domain name, several websites are notified and put links to your website even if you don't want to. That's why, you need to block all robots if you don't want to see your webpages being indexed.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.