Mobile app version of vmapp.org
Login or Join
RJPawlick198

: Language+region value of the HTML5 lang attribute I'm working on a website which will offer localized content following the language+region approach as described on this W3.org page (e.g. fr-CA

@RJPawlick198

Posted in: #Html5 #Internationalization #LangAttribute #Language

I'm working on a website which will offer localized content following the language+region approach as described on this W3.org page (e.g. fr-CA for Canadian French content, and fr-FR for "French French" content). As we consider content for each language+region to be unique, it is crucial to us that search engines properly identify and serve the content accordingly.

By looking up on the Internet (e.g. this question), it appears that most people recommend the use of an ISO639 language code in the HTML lang attribute to describe the content language. Following this recommendation, we would en up using <html lang="fr"> which wouldn't enable the differentiation between the aforementioned language+region combinations.

When reviewing the HTML4 specification, it seems that using language+region as a language code would be perfectly OK, as the en-US example is given as one possible value. However I couldn't find any confirmation of this in the HTML5 specification which doesn't seem to provide any example as to the possible allowed values.

From there I tried to get a de facto answer by looking at what the web giants are doing. I looked at what Facebook are doing: they offer Candian French and French French versions of their websites with (slightly) different content, whilst the HTML lang value remains the same:

fr-CA
URL: fr-ca.facebook.com HTML lang attribute: <html lang="fr">
translation of the word 'email': courriel

fr-FR
URL: fr-fr.facebook.com/ HTML lang attribute: <html lang="fr">
translation of the word 'email': Adresse électronique

What is the recommended/standard way of describing content that was localized using the language+region approach in HTML5?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @RJPawlick198

3 Comments

Sorted by latest first Latest Oldest Best

 

@Sims2060225

The W3C provides this very long guide on choosing language tags/subtags.

The important bits:


Language tag syntax is defined by the IETF's BCP 47. In the past it
was necessary to consult lists of codes in various ISO standards to
find the right subtags, but now you only need to look in the IANA
Language Subtag Registry. We will describe the new registry below.

This article provides advice on how to choose the components of a
language tag. For an overview of the concepts defined in BCP 47, see
Language tags in HTML and XML.


...


There are tools available which provide additional help while
searching the registry, such as Richard Ishida's Language Subtag
Lookup tool.


...


Ensure you have the right language. Sometimes, it pays to check a few
alternatives. Mark Davis, co-author of BCP47, writes "Often it is not
clear which language identifier to use. For example, what most people
call Punjabi in Pakistan actually has the code 'lah', and formal name
'Lahnda'. There are many other cases where the same name is used for
different languages, or where the name that people search for is not
listed in the IANA registry."

You could look up language information in the SIL Ethnologue and
cross-reference that information with Wikipedia. The Ethnologue uses
the same three-letter codes as BCP47, but you'll need to convert BCP47
2-letter codes to their ISO 639-3 counterpart to look up a language by
code. (Richard Ishida's tool does this for you.)

There are a small number of cases where different language codes are
available for what many people would regard as the same language, eg.
Filipino and Tagalog, or Twi and Akan. There is no indication in the
registry as to which you should use, but you should try to ensure that
within a single application or context you are consistent.


(Emphasis mine.)

It should be noted that IANA language subtag registry is kinda hard to use. With the exception of grandfathered-in tags (like en-GB-oed), you have to look up the language family tag and the region/variant subtags separately. And the tags/subtags are organized by type rather than hierarchy. So just save yourself the time and trouble and use Richard Ishida's awesome lookup tool.

10% popularity Vote Up Vote Down


 

@Odierno851

Using <html lang="fr-FR"> and <html lang="fr-CA"> is fine, if they correspond to the actual content. But they are ignored by search engines, just as <html lang="fr"> is.

HTML5 does not mean to change the use of language codes. The system of the codes as defined in BCP 47 and extensions to it is very elaborate and lets you specify a language variant at painful accurary. The state of the art is at a much, much simpler levels, and fr-FR and fr-CA represent the best granularity you can achieve these days in software; quite often, just the main code (here, fr) matters.

There is no evidence of search engines actually paying any attention to any declarations of language code, such as lang attributes. Other software, such as hyphenators, spelling checkers, speech synthesizers, and default font selection algorithms may take lang attributes into account. But search engines perform their heuristic analyses based on actual content.

It is difficult to blame them for this, since this produces better results than trusting the lang attributes. For example, many authoring tools automatically generate lang="en" irrespective of the actual content, without telling the author.

10% popularity Vote Up Vote Down


 

@Angie530

[This isn't my strongest area, so I'm just citing documentation here, but it seems you've overlooked something.]

The HTML5 spec requires that the lang value be a valid BCP 47 tag. In that document, the relevant bit seems to be in section 3.4:


For example, an implementation could map the extended language ranges to basic ranges. Another possibility would be for an implementation to return the matching tag that is first in ASCII-order. If the language range were "*-CH" ('CH' represents Switzerland) and the set of tags included "de-CH" (German as used in Switzerland), "fr-CH" (French, Switzerland), and "it-CH" (Italian, Switzerland), then the tag "de-CH" would be returned.


...which when you look at it is basically what you got from the HTML 4 spec citing RFC1766, just in much greater detail.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme