Page MenuHomePhabricator

WMHack19: Add Saami + Romani languages to Wikidata
Closed, ResolvedPublic

Description

= this works. In the case of ULS, it means that it is in the ULS.
blank space = this doesn't work. In the case of ULS, it means that it is not in the ULS.
? = in principle, it should work, but doesn't for some reason.
<not known right now> = I don't know it. Doesn't mean it doesn't exist.

term = Wikidata labels, descriptions, or aliases;
mono = Monolingual fields in Wikidata
uls = Universal Language Selector
auto = Autocompletion suggests in Wikidata
sdc = Language choice in Structured Data on Commons

Saami languages:

lcodelangnameautonymtermmonouls autosdc note
smaSouthern Saamiåarjelsaemien gïele
sjuUmeubmejesámiengiälla
sjePitebidumsámegiella
smjLulejulevsámegiella
seNorthern Saamidavvisámegiella
sjkKemi<not known right now>
smnInarianarâškielâ
smsSkoltnuõʹrttsääʹmǩiõll, sääʹmǩiõll
siaAkkalasia-cyrl: а̄кь са̄мь кӣлл, а̄ххькэль са̄мь кӣлл, а̄кьяввьр са̄мь кӣлл. sia-ipa: ahʲkel kiːlː, ahʲkel sa:mʲ kiːlːsia-cyrl: Cyrillic, sia-ipa: IPA, sia-UPA: UPA
sjdKildinкӣллт са̄мь кӣлл, кӣлтса̄мь кӣллextended Cyrillic
sjtTersjt-cyrl: таррь са̄мь кӣлл. sjt-ipa: tarje kiːlː, tarje sa:mʲ kiːlːsjt-cyrl: Cyrillic, sjt-ipa: IPA, sjt-UPA: UPA

Romani languages:

lcodelangnameautonymtermmonouls autosdc note
rmnBalkan Romani
rmlBaltic Romani
rmcCarpathian Romani
rmfFinnish Kalokaalengo tšimb(supported by ULS but doesn't work as monolingual text or label)
rmoSinte Romani
rmwWelsh-Romani
rmgTraveller Norwegian
rmyVlax Romaniromani čhib(randomly named as "Romani" in ULS; own name according to en-wiki is řomani čhib)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Question number 1 is why southern sami does work in all places, but inari sami does not?

Zache renamed this task from Add Saami + Romani languages to Wikidata by the end of Wikimedia Hackathon 2019 to WMHACK2019: Add Saami + Romani languages to Wikidata.May 17 2019, 10:48 AM
Zache renamed this task from WMHACK2019: Add Saami + Romani languages to Wikidata to WMHack19: Add Saami + Romani languages to Wikidata.

Having the languages show up in Structured Data on Commons is an additional issue. Should that be added to the deficiency table?

i defined that would be out of scope for this hackathon because it is unlikely that we can do anything for it here in this timeframe.

Adding language to labels

  • wmgExtraLanguageNames = labels, descriptions
  • Example: T220118

Adding language to Monolingual allowed values

  • add language code to WikibaseRepo.php
  • Example: T174229
  • broken autocomplete ticket T124758

I will try to advance getting them enabled in Structured Data on Commons. :-) Which languages will we focus on?

Question number 1 is why southern sami does work in all places, but inari sami does not?

grep sje languages/data/Names.php

grep sma languages/data/Names.php 
                'sma' => 'Åarjelsaemien', # Southern Sami

sma has been added to Names.php because it has reached sufficient level of user interface translations.

Having the languages show up in Structured Data on Commons is an additional issue. Should that be added to the deficiency table?

Please add them all. I would hope we could get this issue over and done with for all these languages in one go.

Notes related to structured data on commons labels/descriptions

"sma" which works on Structured data on Commons is defined in three places on commons.wikimedia.org:

  1. jquery.uls.data.js
  2. wbTermsLanguages-variable (in page source)
  3. window.wpAvailableLanguages -variable (in page source)

smj, smn, sms ... are already included to

wbTermsLanguages seems to come from Language::fetchLanguageNames()

Shows wgExtraLanguageNames languages with labels defined in wgExtraLanguageNames

Shows wgExtraLanguageNames languages with labels defined in somewhere else

Edit: Misunderstood whether X means works or not. Maybe use something less ambiguous. like

Testing request for somebody. What Structured data on Commons does if wgExtraLanguageNames is set.

$wgExtraLanguageNames = [

'smn' => 'smn language',     // T220118
'sms' => 'sms language',      // T220118

];

*Question*: Is languages available in followin API queries query1 and quey also is language availabel in photos structured data fileinformation box if user tries to fill it?

Submitted https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/cldr/+/511048/ to complete adding the language codes mentioned in this ticket to the CLDR extension.

Change 511054 had a related patch set uploaded (by Siebrand; owner: Siebrand):
[mediawiki/core@master] Correct autonym for rmy (Vlax Romani)

https://gerrit.wikimedia.org/r/511054

Change 511065 had a related patch set uploaded (by Siebrand; owner: Siebrand):
[mediawiki/extensions/Wikibase@master] Add Saami and Romani language codes

https://gerrit.wikimedia.org/r/511065

Just a reminder: I discussed the missing autonyms with @Nikerabbit, and we agreed that we shouldn't add incomplete language data. Please ensure there are verifiable/sourced autonyms for all language codes you would like added. At the moment, they are missing for rmc, rml, rmn, rmo.

Ok, thanks. I've got that on my list of things to do.

The autonym for rmy could also use a better reference, as I noted in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511054/ .

Other than that, thanks a lot for these efforts—I totally support the general idea.

The autonym for rmy could also use a better reference, as I noted in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511054/ .

The whole autonym requirement doesn't work for dead languages or for languages we only have an oral source for though. So we unfortunately end up with unusable data from languages that we will never be able to find an autonym for in the case of dead languages or a reliable source for both cases.

Change 511140 had a related patch set uploaded (by Zache-tool; owner: Zache-tool):
[mediawiki/extensions/Wikibase@master] T223524 fetching supported monolingual texts with wbContentLanguage API call. Just proof-of-concept for seeing if this should be fixed using http API-calls or internal function calls. Tested only with wikibase docker.

https://gerrit.wikimedia.org/r/511140

The autonym for rmy could also use a better reference, as I noted in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511054/ .

The whole autonym requirement doesn't work for dead languages or for languages we only have an oral source for though. So we unfortunately end up with unusable data from languages that we will never be able to find an autonym for in the case of dead languages or a reliable source for both cases.

It's possible to find a solution for dead languages. This is not a blocker. If any of these languages are dead, don't have an autonym that can be found, or somehow problematic in any other way, just tell me and we'll find a way.

For example, we can decide on some compromise name that will be the most useful for the people who work with this language in any way, as was done with Jewish Babylonian Aramaic: https://github.com/wikimedia/jquery.uls/pull/244

The autonym for rmy could also use a better reference, as I noted in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511054/ .

The whole autonym requirement doesn't work for dead languages or for languages we only have an oral source for though. So we unfortunately end up with unusable data from languages that we will never be able to find an autonym for in the case of dead languages or a reliable source for both cases.

It's possible to find a solution for dead languages. This is not a blocker. If any of these languages are dead, don't have an autonym that can be found, or somehow problematic in any other way, just tell me and we'll find a way.

For example, we can decide on some compromise name that will be the most useful for the people who work with this language in any way, as was done with Jewish Babylonian Aramaic: https://github.com/wikimedia/jquery.uls/pull/244

This is great, thanks! A similar workaround was used for the three Saami languages we don't have autonyms for right now (and in the case of Kemi Saami, probably ever.)

Change 511054 merged by jenkins-bot:
[mediawiki/core@master] Correct autonym for rmy (Vlax Romani)

https://gerrit.wikimedia.org/r/511054

I've retracted the patch to the language data on GitHub. There is too much debate on autonyms, and there is not the right place. All proposed autonyms here should first be fully sourced and approved by @Nikerabbit or @Amire80 before I'll create a patch.

I've retracted the patch to the language data on GitHub. There is too much debate on autonyms, and there is not the right place. All proposed autonyms here should first be fully sourced and approved by @Nikerabbit or @Amire80 before I'll create a patch.

Which languages have now been removed?

I didn't remove any, but I had questions about the autonyms. See my comments at https://github.com/wikimedia/language-data/pull/53

Again, I don't really want to block it just because of the autonyms. I understand that these are very small, dead, or poorly documented languages and sources about them are difficult to find. However, if entering information about them into Wikidata or translating into them in translatewiki is realistic and useful, then it should also be possible to find translators or academic who can suggest an autonym. If it's definitely impossible to find a true autonym as a name of the language in the language itself, it should be some kind of a compromise title that is most useful for the linguists who will work with it.

Ok, thanks.

If I had realized the issues with the Romani languages last week already, it could have been as simple as going to the Department of Linguistics in Prague and asking for help, but they weren't around on the weekend.

I'll contact people later on this week about these languages.

Thank you. An email from linguists who are familiar with these languages will be perfect.

E-mail confirmation from the Helsinki University lecturer on Romani languages that kaalengo tšimb is indeed correct for rmf. Still working on the other languages:

kaalengo tšimb on oikein Suomen romanikielestä Suomen romanikielellä.

Yupik updated the task description. (Show Details)

I've added in the autonyms/endonyms for sia and sjt in IPA as provided by an expert linguist. The suggestion is to use IPA since they don't have any official orthography of their own and that the possibility of using UPA (Uralic Phonetic Alphabet) could also be good.

And two other linguists recommended using Cyrillic:

Under akkalasamiska kan du skriva:
а̄кь са̄мь кӣлл, а̄ххькэль са̄мь кӣлл, а̄кьяввьр са̄мь кӣлл

Dessa beteckningar fick jag av den akkalasamiska språkutövaren som jag jobbade med i 2011.
Akkalasamiska har ingen ortografi, jag har skrivit orden med hjälp av den kildinsamiska ortografin.

För tersamiska har jag bara de kildinsamiska beteckningar som finns i Aleksandra Antonavas ordbok:
таррь са̄мь кӣлл, нуҏҏьт са̄мь кӣлл

So I think that for these two languages, it would be necessary to have all three possible ways of writing the languages (sia-cyrl, sia-ipa, sia-upa) and the same for sjt (sjt-cyrl, sjt-ipa, sjt-upa) since this is what we have data in.

Here's the same in UPA:

Screen Shot 2019-05-24 at 20.21.12.png (94×380 px, 26 KB)

just note for myself. Wikimedia commons requires also local mediawiki:lang/langcode pages for https://commons.wikimedia.org/wiki/Module:Languages

Note: The tables for this ticket need to be updated.

Aklapper subscribed.

@Yupik: Hi! This task has been assigned to you a while ago. Could you maybe share an update? Do you still plan to work on this task, or do you need any help?

We have AFAIK the Saami languages, but the Romani languages/dialects are still not available because the ISO codes don't match the divisions of these.

Yupik removed Yupik as the assignee of this task.May 30 2021, 3:27 PM

Maybe this can be closed and continued in T217430 for Sami and a new ticket for the others.

Change 511140 abandoned by Addshore:

[mediawiki/extensions/Wikibase@master] T223524 fetching supported monolingual texts with wbContentLanguage API call. Just proof-of-concept for seeing if this should be fixed using http API-calls or internal function calls. Tested only with wikibase docker.

Reason:

https://gerrit.wikimedia.org/r/511140

closing as it seems to work in all places where I tested it.

Zache moved this task from Patch for Review to Done on the WMFI board.