Mobile app version of vmapp.org
Login or Join
LarsenBagley505

: How do you remove invalid characters when creating a friendly url (ie how do you create a slug)? Say I have this webpage: http://ww.xyz.com/Product.aspx?CategoryId=1 If the name of CategoryId=1

@LarsenBagley505

Posted in: #AspNet #Url #UrlRewriting

Say I have this webpage: ww.xyz.com/Product.aspx?CategoryId=1
If the name of CategoryId=1 is "Dogs" I would like to convert the URL into something like this: ww.xyz.com/Products/Dogs
The problem is if the category name contains foreign (or invalid for a url) characters. If the name of CategoryId=2 is "Göra äldre", what should be the new URL?

Logically it should be: ww.xyz.com/Products/Göra äldre but it will not work.

Firstly because of the space (which I can easily replace by a dash for example) but what about the foreign characters? In Asp.net I could use the URLEncode function which would give something like this: ww.xyz.com/Products/G%c3%b6ra+%c3%a4ldre but I can't really say it's better than the original URL (http://ww.xyz.com/Product.aspx?CategoryId=2).

Ideally I would like to generate this one but how can I can do this automatically (ie converting foreign characters to 'safe' URL characters): ww.xyz.com/Products/Gora-aldre.

10.07% popularity Vote Up Vote Down


Login to follow query

More posts by @LarsenBagley505

7 Comments

Sorted by latest first Latest Oldest Best

 

@BetL925

Wikipedia often use non-latin1 characters in their URLs. There is no reason (beyond your webserver not supporting them) that you shouldn't use these URLs.

However; If you have to avoid these characters, I have found that replacing them with their non-diacritic form. Most people who read these can tell (from context) what the word is supposed to be even though the diacritics have been removed.

10% popularity Vote Up Vote Down


 

@Miguel251

Since you post is tagged ASP.Net: look at this site, it contains sample code to replace (most) text with diacritics (invalid characters you call them) with their base character.

As Kris has mentioned, use unique ID in your url, like this site does. If you have no control over the ID's provided to you, you should create a translation table, that contains your unique ID, with the external unique ID's. That way your internal references are also good when the external ID's changes. Together with your unique ID, you store your "Search and Human optimized ID", the one that is not so unique, but looks good.

10% popularity Vote Up Vote Down


 

@Cofer257

The best method IMO is to whitelist characters rather than trying to look for invalid characters. However, accented characters like é are fairly common (and your URL will be odd without them) so you could convert these first.

In PHP you can use the strtr function, but you should be able to modify this for your needs on asp.net:

strtr(
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ',
'aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyrr'
);


Now here's your process:


[optional] Convert the string to lowercase (usually recommended for URLs).
[optional] Convert the accented characters using the above mapping.
Run through your input string character-by-character.
It may be faster to do #1 and #2 per-character instead of on the whole string, depending on what built-in functions you have.
If the character is in the range a-z or 0-9, add it to your new string, otherwise:
a) If you already have a hyphen on the end of your new string, ignore it
b) If not, add a hyphen to the end of the string.
When you get to the end, remove and leading or trailing hyphens and you're done!

10% popularity Vote Up Vote Down


 

@Odierno851

Two things to keep in mind:


URL rewriting generally does not have a positive effect on search engines (and frequently a negative one) -- so you should only do it if you know of a measurable positive effect on user satisfaction (and accordingly: make your URLs useful for the users).
If you do decide to do URL rewriting, you must have the technical details down perfectly. For instance, you should never have more than one unique URL showing the same content. Make sure you use UTF-8 for the encoding of non-ASCII content, use escaped links within your content, and generally test on various browsers to make sure things work as planed. If any of this is foreign to you, then I would strongly recommend not doing URL rewriting for the moment.


FWIW Some of the search engine side issues are covered at googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html

10% popularity Vote Up Vote Down


 

@Harper822

I've come up with the 2 following extension methods (asp.net / C#):

public static string RemoveAccent(this string txt)
{
byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
return System.Text.Encoding.ASCII.GetString(bytes);
}

public static string Slugify(this string phrase)
{
string str = phrase.RemoveAccent().ToLower();
str = System.Text.RegularExpressions.Regex.Replace(str, @ "[^a-z0-9s-]", ""); // Remove all non valid chars
str = System.Text.RegularExpressions.Regex.Replace(str, @ "s+", " ").Trim(); // convert multiple spaces into one space
str = System.Text.RegularExpressions.Regex.Replace(str, @ "s", "-"); // //Replace spaces by dashes
return str;
}

10% popularity Vote Up Vote Down


 

@Heady270

It depends on the language you are using and the technique you want to use. Take a look at this snippet of JavaScript from the Django source, it does exactly what you need. You can easily port it to the language of your choice I guess.

This is the Python snippet used in the Django slugify function, it's a lot shorter:

def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^ws-]', '', value).strip().lower())
return re.sub('[-s]+', '-', value)


I think every language got a port of this, since it's a common problem. Just Google for slugify + your language.

10% popularity Vote Up Vote Down


 

@Angela700

You could add a new field to the Products table that contained an URL safe and unique name for each product. This could probably be automatically generated initially (substituting non-safe characters with closest safe equivalent - gora-aldre?) and then fine tuned as needed.

Since the replacement of non-safe characters is not (always) reversible, it isn't entirely feasible to do this kind of thing on the fly.

Alternatively, you build the URL thusly:
example.com/products/1234/safe-string

Where safe-string is created on the fly replacing unsafe characters as needed. The number 1234 is the product key. You use the key to look up the product, the 'safe-string' is there more for the user and search engines.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme