The entity selector doesn't work well for a lot of classifying statements. One example is "sex or gender: male". male does not show up in the selector's first page because it doesn't have a sitelink and is therefore ranked low.
We can fix this issue by taking into account also the number of labels for an item to make the ranking. This way the item for male would be ranked considerably higher because it has labels in many languages.
Description
Details
Related Objects
- Mentioned In
- rMEXT1733c61efcfb: Updated mediawiki/extensions Project: mediawiki/extensions/Wikibase…
rEWBAb4ca7b4f04b4: Use max( |sitelinks|, |labels| ) for term weight.
rMEXTcb9b74a66c66: Updated mediawiki/extensions Project: mediawiki/extensions/Wikibase…
rEWBAffd5de3b47d7: Introduce TermSqlIndex::supportsSearchKeys
T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking) - Mentioned Here
- T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking)
Event Timeline
incoming links is something that i think is feasible to do once we use elastic search as a backend. in the short term, considering the number of labels might help?
Hm, I don't think there is much difference between the number of sitelinks and the number of labels of the known queries that cause problems.
@Sjoerddebruin https://www.wikidata.org/wiki/Q6581097 has numerous labels but no site links
For things like male there is quite a difference. https://www.wikidata.org/wiki/Q6581097 has 0 sitelinks but ~90 labels.
Yeah, but that's English. I'm talking about when in search in Dutch for "man" (the Dutch version of male), will it appear on top or will the island be on top?
It'll probably continue to be on top because it has so many sitelinks and roughly as many labels. But at least it will no longer not be on the first page.
- I support the max( number of labels, number of sitelinks ) approach discussed in the sprint start meeting.
- What about considering the number of incoming links (a.k.a. WhatLinksHere) in the internal ranking algorithm? I understand that such a ranking will only be re-calculated when the entity is edited. But this should not be a big problem if the algorithm favors sitelinks and labels and uses backlinks as a minor aspect in the calculation.
Change 202399 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Introduce TermSqlIndex::supportsSearchKeys
https://gerrit.wikimedia.org/r/202399
Change 202399 merged by jenkins-bot:
Introduce TermSqlIndex::supportsSearchKeys
https://gerrit.wikimedia.org/r/202399
@thiemowmde The number of incoming links would be the best indicator, since it directly correlates with the probability of the user wanting to link to the entity. But calculating it is too expensive, even on edit; Cirrus search has a similar problem, and a solution (I don't remember, ask Nik). Once we move term lookup to Elastic, we can use it.
Change 202456 had a related patch set uploaded (by Daniel Kinzler):
Use max( |sitelinks|, |labels| ) for term weight.
https://gerrit.wikimedia.org/r/202456
What if we stored the most used values for each property and updated it once a week? That wouldn't be expensive to calculate, would it? Even if we stored the top fifty values?
If we can do this once we move to Elastic search then should we just shelve this bug till we have Elastic search since basing the recommendations on what is used elsewhere by the property is definitely the way to go.
This is a relatively simple fix and already implemented - just needs review. Elastic will take quite some work and time. I want us to have an improvement now because the current situation is really bad.
I don't think we need to run a maintenance script for this. We can just purge items like male and female.
If for any reason, we desire to update term_weight in the entire table, we still have the rebuildTermsSearchKey.php script. It still works and now uses wfWaitForSlaves so probably is safe to use :)
Change 202456 merged by jenkins-bot:
Use max( |sitelinks|, |labels| ) for term weight.
https://gerrit.wikimedia.org/r/202456