Mobile app version of vmapp.org
Login or Join
Bryan171

: Why is my site getting requests for URLs converted to lowercase? On my sites I am seeing requests for what would be valid URLs, but with the path converted to lowercase. For example, a valid

@Bryan171

Posted in: #Apache #Browsers #Hacking #Url

On my sites I am seeing requests for what would be valid URLs, but with the path converted to lowercase.

For example, a valid URL is example.com/some-product-CAT12P0.html.

In my Apache logs I'm seeing example.com/some-product-cat12p0.html.

This is happening on several sites I manage, and I cannot see any pattern in the user agent.

An example log entry:


45.55.65.212 - - [24/Jan/2017:06:36:57 +0000] "GET /educational-assessments-cat12p0.html HTTP/1.1" 404 6011 "http://www.example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14"


UA can be Win/OSX/iOS/Android etc and many different browsers.

The sites all run on LAMP stack. I use mod_rewrite to convert CAT12P0.html to a query string to pass to a PHP file.

I have of course checked my source HTML and sitemaps and all links are uppercase at the end as mod_rewrite expects.

Is this a bad bot, or could I be doing something to tell UAs to convert my links to lowercase?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Bryan171

2 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

These are usually hits from bad bots. Unfortunately, it is very common for bots to attempt to lowercase the entire URL. I have a website with mixed case URLs. I get thousands of hits per day for URLs that have been incorrectly lowercased. Here are the top user agents that did so yesterday:

20494 Mozilla/5.0 (compatible; Gluten Free Crawler/1.0; +http://glutenfreepleasure.com/)
312 Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)
281 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
252 Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)
77 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
55 Mozilla/5.0 (iPhone; CPU iPhone OS 10_2 like Mac OS X) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0 Mobile/14C92 Safari/602.1
20 YisouSpider
15 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0
14 ADmantX Platform Semantic Analyzer US - Turn - ADmantX Inc. - admantx.com - support@admantx.com
13 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36


As you can see, there are usually a couple big offenders, but I had 120 distinct user agents that hit all lower case URLs on my site yesterday.

Even Googlebot has gotten some requests in. That isn't because Googlebot itself has this problem, but because it finds all lowercase links somewhere on the web. It isn't exclusively a bot problem. Some people lowercase the whole URL before they link. Some scraper sites post lower case URLs. There is even an occasional content management software package that won't allow posting of mixed case URLs.

In short, while mixed case URLs are allowed by the spec, in practice it takes extra work to support them. Because it is such a common problem, you should be put "301 permanent" redirects from the all lowercase versions to the mixed case version.

10% popularity Vote Up Vote Down


 

@YK1175434

Using capitals in urls are allowed and it works, but the whole problem is that /a and /A are two different urls (just as in your case).

Because this is standard, it's not very hard to imagine that a crawler, or a bot, or anything that indexes calls the url lowercased or tries what happens if they do.

To avoid the situations you're in now, and to simplify urls, it's good practice to make all urls always lowercased. A good rule of thumb is that you should be able to tell an url to another person at the most simple manner possible ("No no, uppercase C. No not the whole word").

Even if you want to use uppercase characters, which is easily done, you should internally redirect it to lowercase so that /aaa & /AaA are treated the same (unless you have a good reason not to)

moz.com/blog/15-seo-best-practices-for-structuring-urls wiredimpact.com/blog/never-use-capital-letters-urls/

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme