: How to hide my XML Sitemap from competitors but not from search engines I want to hide my sitemap XML file from all but allow access from search engines. What is the way to do it? I want
I want to hide my sitemap XML file from all but allow access from search engines.
What is the way to do it?
I want to hide the depth of site's content from competitors.
More posts by @Courtney195
6 Comments
Sorted by latest first Latest Oldest Best
The crafty solution is to generate two sitemaps. The first of these is for the benefit of your competitors the second is for the benefit of your preferred search engines. In military parlance this first sitemap is a feint.
The 'feint' contains your basic website structure, home page, contact us, about us, main categories. It looks like the real deal and will work great in obscure search engines that you do not care for. It will also be of no use to your competitors. Allow it to be indexed so that they find it, give it an obvious name like sitemap.xml.
Now create your real sitemap with code. Give it a name such as 'product-information-sitemap.xml' so that it is a sensible name but not actually any easier to guess than your password.
In your apache config for the sitemap folder put something in place so that this second sitemap can be accessed by search engines but not indexed:
<IfModule mod_rewrite.c>
<Files product-information-sitemap.xml>
Header set X-Robots-Tag "noindex"
</Files>
</IfModule>
Now create the code to keep that updated, consider a third sitemap for images. Dowwngrade it as required to create the 'feint'. Pay attention to the time stamps too, Google does pay attention to those and this is important if your sitemap is a big one.
Now create a 'cron' job to submit your products sitemap to Google on a regular basis. In your crontab entry add something like this to submit your real sitemap every week:
0 0 * * 0 wget google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.example.com%2Fsitemaps%2Fproduct-information-sitemap.xml
Note that the URL is URL encoded.
You can also gzip your sitemap if size is an issue although your web server should serve that gzipped if you have that enabled.
Your robots.txt does not have to be anything special, just so long as it does not bar entry to your sitemaps it should be fine. There really is no need to send different robots.txt files out based on user agent strings or anything so complicated. Just pull out your precious content into a supplementary, non-advertised file and submit it to Google on a cron job (rather than wait for the bot). Simple.
One way you can try : In a usual crawling session, Google bots access robots.txt and then go to sitemap file. Push in a cookie for all servings of robots.txt and allow access to sitemap only to those people with the cookie. There will be the problem when Google bots don't accept cookies. So do the opposite. Push in a cookie when a user access a page other than the robots.txt and deny access to sitemap for those with the cookie. Also, give a scrambled name to your sitemap, something that changes with time and make it un guessable. If your competitors have cookies enabled in their browser, it will be extremely difficult for them to access the sitemap unless they follow the exact path a search engine is following.
The first step would be to detect the User-Agent of the bots you want to allow, and serve a different file if it is not a User-Agent that you want to allow.
For example, you could have two versions of robots.txt, one with and one without a reference to the sitemap, so your competitors won't find the sitemap if they look inside your robots.txt.
Then, you could detect visits to your sitemap URL and serve the site map only when the UA is correct. If you serve a generic 404 page otherwise, your competitors may not even know your sitemap exists.
However, all measures described up to this point are merely security through obscurity. A User-Agent can easily be spoofed.
Therefore, Google recommends that, to detect the real GoogleBot, you:
Perform a reverse DNS lookup for the IP address claiming to be GoogleBot.
Check if the host is a sub-domain of googlebot.com..
Perform a normal DNS lookup for the sub-domain.
Check if the sub-domain points to the IP address of the bot crawling your site.
To sum it up:
Microsoft advises to use the same procedure to detect their crawler.
This trick works for Yahoo! as well.
Note
You don't need to use a 404 error if you use the DNS-lookup based spider detection.
The purpose of using the 404 error page is to conceal that your sitemap exists at all. However, if you are using the more advanced technique which does not solely rely on User-Agent headers, it should not be possible to circumvent it so you can safely use a different error code, such as 403 Forbidden which is the correct error code to use here.
If you have the IPaddresses of the bots you want to allow:
<Limit GET POST PUT>
order deny,allow
deny from all
allow from 192.168.1.1 # IP 1
allow from 192.168.1.2 # IP 3
allow from 192.168.1.3 # IP 2
</LIMIT>
If you want it based on user agent string:
Order Allow,Deny
allow from env=good_bot_1
allow from env=good_bot_2
I make an assumption that I understood your requirement correctly so I show courage to answer.
give an image link to your sitemap just before your </html> tag. Use a transparent 1px gif file:
<a href="sitemap.xml"><img src="transparent.gif" alt="" height="1" width="1" /></a>
In the page which has link of your sitemap, set your related metatag:
<meta name="robots" content="{index or noindex},follow">
check the visual state when you press ctrl+A to select all page. Is the 1px link visible, risky for you?
If you say yes, may be another option is:
create a link to your sitemap: <a href="sitemap.xml"> </a>
change font color same with background color
Using CSS techniques, hide this link behind an image
This way a uncurious normal user won't notice your link. Search engines will aware of it.
But please be aware of that inherent nature of your question involves impossibility.
I say impossibility because if a user search in Google for example with this terms
* site:www.yoursite.com
whole world can see all of your links if they don't get tired to click next links.
I hope these helps.
The problem is that if you (quite rightly) want your content to be indexed by search engines, anyone who performs a site: search in one of the search engines will be able to see what URLs are indexed.
If you want to "hide" your sitemap you could have it on a URL with a "secret" name so it's not obvious to anyone who may be looking for it, but seeing as it's best practice to include a sitemap in a robots.txt file and upload it to one of the search engines' webmaster tools profiles, as other people have said, it's hard to see why you'd need to do this.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.