Mobile app version of vmapp.org
Login or Join
Si4351233

: Limit Sitemap Access to Search Engines I have a sitemap file which generates the sitemap in real time and lists every page on the site. Eventually the site will have tens of thousands of pages

@Si4351233

Posted in: #AccessControl #XmlSitemap

I have a sitemap file which generates the sitemap in real time and lists every page on the site. Eventually the site will have tens of thousands of pages so generating the sitemap will take a fair bit of resources to do so I don't want just anyone to access the sitemap. What I am trying to achieve is something similar to the way Stack Exchange does it where search engines requesting the sitemap will be able to access it properly but if a person tries to access the sitemap file directly they will be served with a 404 not found error.

Could anyone point me in the right direction for how to do this. I already have a rewrite rule in place to rewrite sitemap.xml to sitemap.php for the generating script so now just need to limit access to the file. I would prefer to do this through a .htaccess file or through the vhost file but if it has to be done in the PHP so be it.

Thanks

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Si4351233

1 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

One way to limit access to search engines is to publish the sitemap at a secret url like /sitemap-poakunmecruight.xml. Instead of directing /sitemap.xml to that, or publishing that URL in robots.txt, only submit the sitemap URL to search engines via their webmaster tools. That way only search engines know where your sitemap is.

It also occurs to me that generating the sitemap in realtime may not be the best solution. If it takes resources to generate (like database queries), it would be better just to generate it once a day. You could fairly easily write a cron job that runs something like curl -s example.com/sitemap.php > /var/www/example.com/sitemap.xml

It sounds like you are leaning towards user agent sniffing to only allow bots. You can do that through .htaccess. I took the list of search engine bots from here.

BrowserMatchNoCase adsbot-google search_engine_bot
BrowserMatchNoCase aolbuild search_engine_bot
BrowserMatchNoCase baidu search_engine_bot
BrowserMatchNoCase bingbot search_engine_bot
BrowserMatchNoCase bingpreview search_engine_bot
BrowserMatchNoCase duckduckgo search_engine_bot
BrowserMatchNoCase googlbot search_engine_bot
BrowserMatchNoCase mediapartners-google search_engine_bot
BrowserMatchNoCase msnbot search_engine_bot
BrowserMatchNoCase slurp search_engine_bot
BrowserMatchNoCase teoma search_engine_bot
BrowserMatchNoCase yandex search_engine_bot

<Files "sitemap.php">
Order Deny,Allow
Deny from all
Allow from env=search_engine_bot
</Files>

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme