: Why are URLs case-sensitive? My question: When URLs were first designed, why was case-sensitivity made a feature? I ask this because it seems to me (i.e., a layperson) that case-insensitivity
My question: When URLs were first designed, why was case-sensitivity made a feature? I ask this because it seems to me (i.e., a layperson) that case-insensitivity would be preferred to prevent needless errors and simplify an already complicated string of text.
Also, is there a real purpose/advantage to having a case-sensitive URL (as opposed to the vast majority of URLs that point to the same page no matter the capitalization)?
Wikipedia, for example, is a website that is sensitive to letter case (except for the first character):
en.wikipedia.org/wiki/StAck_Exchange is DOA.
More posts by @Yeniel560
10 Comments
Sorted by latest first Latest Oldest Best
Case sensitivity does have value.
If there are 26 letters, each of them with the ability to capitalized, that's 52 characters.
4 characters has the possiblity of 52*52*52*52 combinations, equaling 7311616 combinations.
If you cannot capitalize the characters, the amount of combinations is 26*26*26*26 = 456976
The are over 14 times more combinations for 52 characters than there are for 26. So for storing data, Urls can be shorter and more information can be passed over networks with less data transferred.
This is why you see youtube using URLs like
How should one read a "why was it designed this way?" question? Are you asking for a historically-accurate account of the decision-making process, or are you asking "why would anyone design it this way?"?
It's very rarely possible to get a historically-accurate account. Sometimes when decisions are made in standards committees there is a documentary trail of how the debate was conducted, but in the early days of the web decisions were made hastily by a few individuals - in this case probably by TimBL himself - and the rationale is unlikely to have been written down. But TimBL has admitted that he made mistakes in the design of URLs - see www.dailymail.co.uk/sciencetech/article-1220286/Sir-Tim-Berners-Lee-admits-forward-slashes-web-address-mistake.html
In the early days URLs mapped very directly to filenames, and the files were generally on Unix-like machines, and Unix-like machines have case-sensitive filenames. So my guess is that it just happened that way for implementation convenience, and usability (for end-users) was never even considered. Again, in the early days the users were all Unix programmers anyway.
This has nothing to do with where you bought your domain, DNS is not case sensitive. But, the file system on the server you are using for hosting is.
This isn't really an issue and it's fairly common on *nix hosts. Just make sure all links you write on your pages are correct and you won't have a problem. To make it easier, I recommend always naming your pages in all lower case then you never need to double check the name when writing a link.
I stole from the blog an Old New Thing the habit of approaching questions of the form "why is it that something is the case?" with the counter-question "what would the world be like, if it were not the case?"
Say I set up a web server to serve myself my document files from a folder so I could read them on the phone when I was out the office. Now, in my documents folder, I have three files, todo.txt, ToDo.txt and TODO.TXT (I know, but it made sense to me when I made the files).
What URL would I like to be able to use, to access these files? I would like to access them in an intuitive way, using www.example.com/docs/filename.
Say I have a script which lets me add a contact to my addressbook, which I can also do over the web. How should that take its parameters? Well, I'd like to use it like: www.example.com/addcontact.php?name=Tom McHenry von der O'Reilly. But if there were no way for me to specify the name by case, how would I do that?
How would I differentiate the wiki pages for Cat and CAT, Text and TEXT, latex and LaTeX? Disambig pages, I guess, but I prefer just getting the thing I asked for.
But all that feels like it's answering the wrong question, anyway.
The question I think you were really asking is "Why do web servers 404 you just for a case difference, when they are computers, designed to make life simpler, and they are perfectly capable of finding at least the most obvious case-variations in the URL I typed that would work?"
The answer to which is that while some sites have done this (and better, they check for other typos too), nobody's thought it worthwhile to change a webserver's default 404 error page to do that... but maybe they should?
Why wouldn't the URL be case sensitive?
I understand that may look like a provocative (and "devil's advocate") type of rhetorical question, but I think it's useful to consider. The design of HTTP is that a "client", which we commonly call a "web browser", asks the "web server" for data.
There are many, many different web servers that are released. Microsoft has released IIS with Windows Server operating systems (and others, including Windows XP Professional). Unix has heavyweights like nginx and Apache, not to mention smaller offerings like OpenBSD's internal httpd, or thttpd, or lighttpd. Additionally, many network-capable devices have built in web servers that can be used to configure the device, including devices with purposes specific to networks, like routers (including many Wi-Fi access points, and DSL modems) and other devices like printers or UPSs (battery-backed uninterruptable power supply units) which may have network connectivity.
So the question, "Why are URLs case-sensitive?", is asking, "Why do the web servers treat the URL as being case sensitive?" And the actual answer is: they don't all do that. At least one web server, which is fairly popular, is typically NOT case sensitive. (The web server is IIS.)
A key reason for different behavior between different web servers probably boils down to a matter of simplicity. The simple way to make a web server is to do things the same way as how the computer/device's operating system locates files. Many times, web servers locate a file in order to provide a response. Unix was designed around higher end computers, and so Unix provided the desirable functionality of allowing uppercase and lowercase letters. Unix decided to treat uppercase and lowercase as different because, well, they are different. That's the straightforward, natural thing to do. Windows has a history of being case-insensitive due to a desire to support already-created software, and this history goes back to DOS which simply did not support lowercase letters, possibly in an effort to simplify things with less powerful computers that used less memory. Since these operating systems are different, the result is that simply-designed (early versions of) web servers reflect the same differences.
Now, with all that background, here are some specific answers to the specific questions:
When URLs were first designed, why was case-sensitivity made a feature?
Why not? If all standard web servers were case-insensitive, that would indicate that the web servers were following a set of rules specified by the standard. There was simply no rule that says that case needs to be ignored. The reason that there is no rule is simply that there was no reason for there to be such a rule. Why bother to make up unnecessary rules?
I ask this because it seems to me (i.e., a layperson) that case-insensitivity would be preferred to prevent needless errors and simplify an already complicated string of text.
URLs were designed for machines to process. Although a person can type a full URL into an address bar, that wasn't a major part of the intended design. The intended design is that people would follow ("click on") hyperlinks. If average laypeople are doing that, then they really don't care whether the invisible URL is simple or complicated.
Also, is there a real purpose/advantage to having a case-sensitive URL (as opposed to the vast majority of URLs that point to the same page no matter the capitalization)?
The fifth numbered point of William Hay's answer mentions one technical advantage: URLs can be an effective way for a web browser to send a bit of information to a web server, and more information can be included if there are less restrictions, so a case sensitivity restriction would reduce how much information can be included.
However, in many cases, there isn't a super compelling benefit to case sensitivity, which is proven by the fact that IIS typically doesn't bother with it.
In summary, the most compelling reason is likely just simplicity for those who designed the web server software, particularly on a case-sensitive platform like Unix. (HTTP wasn't something that influenced the original design of Unix, since Unix is notably older than HTTP.)
URLs are not case-sensitive, only parts of them.
For example, nothing is case-sensitive in the URL google.com,
With reference to RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
First, from Wikipedia, a URL looks like:
scheme:[//host[:port]][/]path[?query][#fragment]
(I've removed the user:password part because it is not interesting and rarely used)
scheme:
schemes are case-insensitive
host:
The host subcomponent is case-insensitive.
path:
The path component contains data...
query:
The query component contains non-hierarchical data...
fragment:
Individual media types may define their own restrictions on or structures within the fragment
identifier syntax for specifying different types of subsets, views, or external references
So, the scheme and host are case-insensitive.
The rest of the URL is case-sensitive.
Why is the path case-sensitive?
This seems to be the main question.
It is difficult to answer "why" something was done if it was not documented, but we can make a very good guess.
I've picked very specific quotes from the spec, with emphasis on data.
Let's look at the URL again:
scheme:[//host[:port]][/]path[?query][#fragment]
____________________/________________________/
Location Data
Location - The location has a canonical form, and is case-insensitive.
Why? Probably so you could buy a domain name without having to buy thousands of variants.
Data - the data is used by the target server, and the application can choose what it means.
It wouldn't make any sense to make data case insensitive. The application should have more options,
and defining case-insensitivity in the spec will limit these options.
This is also a useful distinction for HTTPS: the data is encrypted, but the host is visible.
Is it useful?
Case-sensitivity has its pitfalls when it comes to caching and canonical URLs, but it is certainly useful.
Some examples:
Base64, which is used in Data URIs.
Sites can encode Base64 data in the url, for example: tryroslyn.azurewebsites.net/#f:r/A4VwRgNglgxgBDCBDAziuBhOBvGB7AOxQBc4SAnKAgczLgF44AiAUQPwBMBTDuKuYgAsucAKoAlADIBCJgG4AvkA URL shorteners utilize case sensitivity: /a5B might be different than /a5b
As you've mentioned, Wikipedia can differentiate "AIDS" from "Aids".
URLs claim to be a UNIFORM Resource locator and can point to resources
that predate the web. Some of these are case sensitive (eg many ftp servers)
and URLs need to be able to represent these resources in a reasonably intuitive fashion.
Case insensitivity requires more work when looking for a match (either in the OS or above it).
If you define URLs as case sensitive individual servers can implement them as case insensitive if they want. The reverse is not true.
Case insensitivity can be non-trivial in international contexts: en.wikipedia.org/wiki/Dotted_and_dotless_I . Also RFC1738 allowed for the use of characters outside the ASCII range provided they were encoded but didn't specify a charset. This is fairly important for something calling itself the WORLD wide web. Defining URLs as case insensitive would open up a lot of scope for bugs.
If you are trying to pack a lot of data into a URI (eg a Data URI) you can pack more in if upper and lower case are distinct.
Closetnoc is right about the OS. Some file systems treat the same name with different casing as different files.
Also, is there a real purpose/advantage to having a case-sensitive URL (as opposed to the vast majority of URLs that point to the same page no matter the capitalization)?
Yes. to avoid duplicate content issues.
If you had for example the following URLs:
example.com/page-1 http://example.com/Page-1 example.com/paGe-1 http://example.com/PAGE-1 example.com/pAGE-1
and they all pointed to the exact same page with the exact same content, then you would have duplicate content, and I'm sure if you have a Google search console (webmaster tools) account, Google will indicate this to you.
What I would suggest doing if you are in that situation is to use all lower-case URLs, then redirect the URLs with at least one capital letter in it to the lower case version. So in the list of URLs above, redirect all the URLs to the first URL.
Though the above answer is correct & good. I would like to add some more points.
To understand better, One should understand the basic difference between Unix (Linux) Vs Windows server. Unix is case sensitive & Windows is non case sensitive OS.
HTTP protocol was evolved or started getting into implementation around 1990. HTTP protocol was designed by engineers working at CERN institutes, most of those days scientist used Unix machines and not the Windows.
Most of the scientist were familiar with Unix, so they might have been influenced with Unix style file system.
Windows server was released after 2000. much before windows server became popular HTTP protocol was well matured and the spec was complete.
This could be the reason.
Simple. The OS is case sensitive. Web servers generally do not care unless they have to hit the file system at some point. This is where Linux and other Unix-based operating systems enforce the rules of the file system in which case sensitivity is a major part. This is why IIS has never been case sensitive; because Windows was never case sensitive.
[Update]
There have been some strong arguments in the comments (since deleted) about whether URLs have any relationship with the file system as I have stated. These arguments have become heated. It is extremely short-sighted to believe that there is not a relationship. There absolutely is! Let me explain further.
Application programmers are not generally systems internals programmers. I am not being insulting. They are two separate disciplines and system internals knowledge is not required to write applications when applications can simply make calls to the OS. Since application programmers are not systems internals programmers, bypassing the OS services is not possible. I say this because these are two separate camps and they rarely cross-over. Applications are written to use OS services as a rule. There are rare some exceptions of course.
Back when web servers began to appear, application developers did not attempt to bypass OS services. There were several reasons for this. One, it was not necessary. Two, application programmers generally did not know how to bypass OS services. Three, most OSes were either extremely stable and robust, or extremely simple and light-weight and not worth the cost.
Keep in mind that the early web servers either ran on expensive computers such as the DEC VAX/VMS servers and the Unix of the day (Berkeley and Ultrix as well as others) on main-frame or mid-frame computers, then soon after on light-weight computers such as PCs and Windows 3.1. When more modern search engines began to appear, such as Google in 1997/8, Windows had moved into Windows NT and other OSes such as Novell and Linux had also began to run web servers. Apache was the dominant web server though there were others such as IIS and O'Reilly which were also very popular. None of them at the time bypassed OS services. It is likely that none of the web servers do even today.
Early web servers were quite simple. They still are today. Any request made for a resource via an HTTP request that exists on a hard-drive was/is made by the web server through the OS file system.
File systems are rather simple mechanisms. As a request is made for access to a file, if that file exists, the request is passed to the authorization sub-system and if granted, the original request is satisfied. If the resource does not exist or is not authorized, an exception is thrown by the system. When an application makes a request, a trigger is set and the application waits. When the request is answered, the trigger is thrown and the application processes the request response. It still works that way today. If the application sees that the request has been satisfied it continues, if it has failed, the application executes an error condition within it's code or dies if not handled. Simple.
In the case of a web server, assuming that a URL request for a path/file is made, the web server takes the path/file portion of the URL request (URI) and makes a request to the file system and it is either satisfied or throws an exception. The web server then processes the response. If, for example, the path and file requested is found and access granted by the authorization sub-system, then the web server processes that I/O request as normal. If the file system throws an exception, then the web server returns a 404 error if the file is Not Found or a 403 Forbidden if the reason code is unauthorized.
Since some OSes are case sensitive and file systems of this type require exact matches, the path/file that is requested of the web server must match what exists on the hard drive exactly. The reason for this is simple. Web servers do not guess what you mean. No computer does so without being programmed to. Web servers simply process requests as they receive them. If the path/file portion of the URL request being passed directly to the file system does not match what is on the hard drive, then the file system throws an exception and the web server returns a 404 Not Found error.
It is really that simple folks. It is not rocket science. There is an absolute relationship between the path/file portion of a URL and the file system.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.