sitemap_robots.xml Now I've checked the indexed pages (Google

Posted in: #Google #GoogleIndex #Indexing

Last week I launched a website and created a robots.txt:

User-agent: *
Disallow:
Sitemap: mywebsite.tld/sitemap_robots.xml

Now I've checked the indexed pages (Google -> site:mywebsite.tld) and found links to subfolders and non-public/internal files, that are place outside of the web root of the website. E.g.: mywebsite.tld/vendor/, mywebsite.tld/module/, mywebsite.tld, mywebsite.tld/dev/, mywebsite.tld/vendor/zendframework/zendframework/tests/ZendTest/Test/PHPUnit etc.

Why are such folders being indexed? Why is it possible, if they are placed outside of the website root and there are not links to them? How to avoid indexing of such paths?

EDIT

The folder structure of my website is:

/composer.json
/composer.lock
/composer.phar
/config/
/data/
/dev/
/init_autoloader.php
/LICENSE.txt
/module/
/public/ <-- web root
/README.md
/temp/
/vendor/

It's a common structure of a Zend Framework 2 project (with some additional folders). The document root is the folder public (set in Plesk):

10.02% popularity Vote Up Vote Down

: Cloudflare cache image downloads Is there a way to have cloudflare cache image downloads, ( content type application/octet-stream, Content-Disposition attachment )? Images are hosted on AWS S3.

@Sent6035632

Posted in: #AmazonAws #Cloudflare

1 Comments

: Google and SEO services Are there any good and reasonably priced SEO services that work well? Also can I pay Google to have my site at the top of searches? I was looking into Google Adsense

@Sent6035632

Posted in: #Google #GoogleSearch #Seo

3 Comments

: How can an exception be created for Facebook on robots.txt? I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone

@Sent6035632

Posted in: #Facebook #RobotsTxt

2 Comments

: How to make client side and server side communicate? I have apache installed, and I would like (using php/and or javascript, html) to do the following but I do not know how. To send my client

@Sent6035632

Posted in: #Apache #Browsers #Server

2 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Karen161

OK, I've finally found a plausible explanation, why Google has indexed folders and files outside of the document root:

The website I've launched last week is a second version of a project. The first version was Joomla! based and its dosument root was to the project root:

/var/www/.../mywebsite.tld

The second version bases on Zend Framework 2 and uses its common project structure. So the web root is

/var/www/.../mywebsite.tld/public

We had some technical troubles. So the initial deploy of the second version took several days and only at the last day the document root of the website has been changed to .../public. During all this time the index page was looking like this:

So the Google bot indexed all these links, that are broken now, but was returning status 200 some days ago.

10% popularity Vote Up Vote Down

@YK1175434

In order to block all robots in your website, the correct code is:

User-agent: *
Disallow: /

Don't forget the / (slash) after Disallow:.

To block only subdirectories and their internal webpages with robots.txt, you have to list them one by one. Thus you need to know their names:

User-agent: *
Disallow: /vendor/
Disallow: /module/
...

To understand how robots.txt works, you can read this or this.

Moreover, for your information, when you buy a domain name, several websites are notified and put links to your website even if you don't want to. That's why, you need to block all robots if you don't want to see your webpages being indexed.

10% popularity Vote Up Vote Down

Feed

: Google references subfolders Last week I launched a website and created a robots.txt: User-agent: * Disallow: Sitemap: http://mywebsite.tld/sitemap_robots.xml Now I've checked the indexed pages (Google

More posts by @Sent6035632

: Cloudflare cache image downloads Is there a way to have cloudflare cache image downloads, ( content type application/octet-stream, Content-Disposition attachment )? Images are hosted on AWS S3.

: Google and SEO services Are there any good and reasonably priced SEO services that work well? Also can I pay Google to have my site at the top of searches? I was looking into Google Adsense

: How can an exception be created for Facebook on robots.txt? I have a directory that I don't want Google to index because it could confuse people who don't know the site. However, I want everyone

: How to make client side and server side communicate? I have apache installed, and I would like (using php/and or javascript, html) to do the following but I do not know how. To send my client

Login to post a comment!

2 Comments

Back to top | Use Dark Theme