Mobile app version of vmapp.org
Login or Join
Sent6035632

: Google references subfolders Last week I launched a website and created a robots.txt: User-agent: * Disallow: Sitemap: http://mywebsite.tld/sitemap_robots.xml Now I've checked the indexed pages (Google

@Sent6035632

Posted in: #Google #GoogleIndex #Indexing

Last week I launched a website and created a robots.txt:

User-agent: *
Disallow:
Sitemap: mywebsite.tld/sitemap_robots.xml

Now I've checked the indexed pages (Google -> site:mywebsite.tld) and found links to subfolders and non-public/internal files, that are place outside of the web root of the website. E.g.: mywebsite.tld/vendor/, mywebsite.tld/module/, mywebsite.tld, mywebsite.tld/dev/, mywebsite.tld/vendor/zendframework/zendframework/tests/ZendTest/Test/PHPUnit etc.

Why are such folders being indexed? Why is it possible, if they are placed outside of the website root and there are not links to them? How to avoid indexing of such paths?



EDIT

The folder structure of my website is:

/composer.json
/composer.lock
/composer.phar
/config/
/data/
/dev/
/init_autoloader.php
/LICENSE.txt
/module/
/public/ <-- web root
/README.md
/temp/
/vendor/


It's a common structure of a Zend Framework 2 project (with some additional folders). The document root is the folder public (set in Plesk):

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Sent6035632

2 Comments

Sorted by latest first Latest Oldest Best

 

@Karen161

OK, I've finally found a plausible explanation, why Google has indexed folders and files outside of the document root:



The website I've launched last week is a second version of a project. The first version was Joomla! based and its dosument root was to the project root:


/var/www/.../mywebsite.tld


The second version bases on Zend Framework 2 and uses its common project structure. So the web root is


/var/www/.../mywebsite.tld/public


We had some technical troubles. So the initial deploy of the second version took several days and only at the last day the document root of the website has been changed to .../public. During all this time the index page was looking like this:



So the Google bot indexed all these links, that are broken now, but was returning status 200 some days ago.

10% popularity Vote Up Vote Down


 

@YK1175434

In order to block all robots in your website, the correct code is:

User-agent: *
Disallow: /


Don't forget the / (slash) after Disallow:.

To block only subdirectories and their internal webpages with robots.txt, you have to list them one by one. Thus you need to know their names:

User-agent: *
Disallow: /vendor/
Disallow: /module/
...


To understand how robots.txt works, you can read this or this.



Moreover, for your information, when you buy a domain name, several websites are notified and put links to your website even if you don't want to. That's why, you need to block all robots if you don't want to see your webpages being indexed.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme