Page MenuHomePhabricator

Output valid language codes in interwikis HTML rather than lang="simple"
Closed, ResolvedPublic

Description

'simple' isn't a valid language code, though we're outputting it for interlanguage links.

We 'could' add in a simple hack here that will make 'simple' output lang="en" instead.

Though I do have a bit of a more interesting idea. Instead of what, how about we swap simple for en-x-Simple and add in a code that lets us create aliases for language codes so that simple: will still be equivalent to en-x-Simple.

Going by bcp47 (https://www.rfc-editor.org/rfc/bcp/bcp47.txt) the code en-x-Simple is valid. It's an 'en' lang code with a private subtag of 'Simple'. bcp47 reserves x-* for private use purposes, things that wouldn't be registered, essentially that's what we're talking about here.


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:59 PM
bzimport set Reference to bz32483.
bzimport added a subscriber: Unknown Object (MLST).

This issue is much wider actually. It applies for all languages listed here: http://en.wikipedia.org/wiki/List_of_Wikipedias#Wikipedia_edition_codes

The solution. We need to add a "language" tag normalization in the core trough which we can put the ll_lang of the Langlinks_table table of the database, before we actually generate a 'real' lang tag.

Such a normalization table (language_mapping ?) would have

ll_lang: the wiki defined interlanguage code
wiki_variant: a wiki language variant code
iso 639-1
iso 639-2
bcp47 code: (includes private codes, variant names, sign, transliteration etc)

Could probably be built upon 'Extension:CLDR'

Created attachment 9591
Use $wgDummyLanguageCodes for getting the right language code

I think it is sufficient to use $wgDummyLanguageCodes (per r103640) for this, since it will contain all code mappings relevant for MediaWiki/Wikimedia. A database like you propose seems overkill to me.

Attached:

This is weird, those attributes were only added just today: r104778. So, what did this bug report refer to? The class="interwiki-simple" on the <li> element?

Since 'simple' is in Language's list of language names, I think it'd be cleaner to have the logic for this living in Language.

Maybe Language::normalizeCode( $code ) ?

That could also normalize a number of fuzzy old things that we still have in our list for compatibility:

  • simple -> en or en-x-simple
  • bat-smg -> sgs
  • roa-rup -> rup
  • fiu-vro -> vro

etc

Note that there are manual language links on [[en:Main_Page]] at the bottom (not in the sidebar) which have 'lang' attributes on spans surrounding the links. The one for 'Simple English' does use 'simple' as the value here, but this can be changed by editing the page or template.

normalize would be ambiguous in this function. Should be something that refers to getting a standards compatible language code.

I was thinking about a Language function as well. Maybe getCorrectCode() or getActualCode()?

We might also use it for other lang="" attributes, like on the html tag.
I see that wgLanguageCode has been changed for several wikis (like 'alswiki' => 'gsw') but not all of them (e.g. fiu-vro not).

Created attachment 9656
Language.php patch, including first go at a mapping table...

Attached:

some comments:

1: We should probably have getBCP47LanguageTag( $code, [$variant] )
2: My patch maps getCode() to use getBCP47LanguageTag(), but that was just to get some quick testing done of course.
3: The table... I'm not entirely sure we want to use wgDummyLanguageCodes. Or alternatively, wether that table should contain qqq qqz in the way that it does now. Perhaps adapt wgDummyLanguageCodes into wgLanguageTagConversion()=wgDummyLanguageCodes ++ qqq+ qqz; or something simliar

other way around of course. wgDummyLanguageCodes=wgLanguageTagConversionTable ++ qqq+ qqz;

See also r105812 and friends.

Nemo_bis renamed this task from en.wp uses lang="simple" for simple: interlang links. to Output valid language codes in interwikis HTML rather than lang="simple".Jul 21 2015, 9:02 AM
Nemo_bis removed a project: Patch-For-Review.
Nemo_bis set Security to None.

Change 226040 had a related patch set uploaded (by Nemo bis):
GlobalFunctions.php: Generate BCP 47 conform language codes

https://gerrit.wikimedia.org/r/226040

Change 442200 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Ensure LanguageCode::bcp47() returns a valid BCP-47 language code

https://gerrit.wikimedia.org/r/442200

Change 442200 merged by jenkins-bot:
[mediawiki/core@master] Ensure LanguageCode::bcp47() returns a valid BCP 47 language code

https://gerrit.wikimedia.org/r/442200

Change 226040 abandoned by Fomafix:
Substitute language codes that are not conform to BCP 47

Reason:
Superseded by I807dd55d49e9bd19443329231326a5b0d3e6c453 with a bijective mapping.

https://gerrit.wikimedia.org/r/226040

Change 460038 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Ensure LanguageCode::bcp47() returns a valid BCP 47 language code

https://gerrit.wikimedia.org/r/460038

Change 460038 merged by jenkins-bot:
[mediawiki/core@master] Ensure LanguageCode::bcp47() returns a valid BCP 47 language code

https://gerrit.wikimedia.org/r/460038