: How can I figure out how a search engine is finding hidden pages? We have a system hosting many websites for our customers, and inside that system there is a method that non-live customers
We have a system hosting many websites for our customers, and inside that system there is a method that non-live customers can view their sites before we turn them on. Say the link is something like this:
ourbigcompany.com/customer/domain=thisisanewsiteurl
Those links are not linked to anywhere outside a secure login - they are only sent to the customer via email. They are publicly viewable, as they have to be, but that's not the real problem. The real problem is that somehow Bing is getting hold of them and trying to crawl the sites. I know how to stop the crawling, but that would be like treating the symptoms without fixing the problem.
We log the traffic and there is no referrer - so that is not helpful.
If I change the querystring value for a site, Bing has it within hours. I need to figure out where Bing is getting the links from so that I can close what is obviously a security hole, but I am not sure how. Any ideas on how to figure that out?
More posts by @Sent6035632
1 Comments
Sorted by latest first Latest Oldest Best
You won't be able to know for sure how search engines got the URL. They don't tell you that information. There are several possible ways that it could have happened:
The user shares or publishes the link themselves
The site has a link to another site. When that link is clicked, the secret URL is sent as a referrer. Some sites publish referrer URLs in places that search engines can find them.
Some browsers send information about every page you visit directly to the companies that run search engines. Google at least says they do not rely on any sent data to feed their crawler. Some browser features that rely on this are:
Safe browsing features that flag malware pages as you surf
Pagerank indicator toolbars
Usage of social buttons on the page such as Google +1 buttons
Usage of analytics software
Inclusion of advertisements on the site
Any 3rd party JavaScript, CSS, or image usage
The email you send with a link traverses through an email server owned by the search engine (Gmail, Hotmail). Links in such an email could be harvested for crawling.
As Google says:
It's almost impossible to keep a web server secret by not publishing links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log...
If you want to prevent Googlebot from crawling content on your site, you have a number of options, including using robots.txt to block access to files and directories on your server.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.