Wikidata:Requests for permissions/Bot/BboberBot
From Wikidata
Jump to navigation
Jump to search
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Withdrawn. --Lymantria (talk) 07:07, 31 August 2022 (UTC)[reply]
BboberBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Bbober (talk • contribs • logs)
Tasks:The "robot" will browse the latest VIAF Dump, select the lines with a Idref (P269) and a Qitem, and add a P269 when it doesn't already exist in Wikidata
Code:
Function details:The task will be completed using Openrefine in several batches. 345000 items will be modified --BboberBot (talk) 10:07, 13 July 2022 (UTC)[reply]
- Support but all VIAF IDs that are present in Wikidata as VIAF ID (P214) and deprecated must be excluded. You can do the 50 sample edits, of course, and link them here. --Epìdosis 17:30, 25 July 2022 (UTC)[reply]
- If the sample edits are those of July 7th (e.g. https://www.wikidata.org/w/index.php?title=Q549726&diff=prev&oldid=1672901839), I think the reference format is to be changed: reference URL (P854) with the IDREF URL is not much useful; the reference IMHO should be like https://www.wikidata.org/w/index.php?title=Q113280193&diff=prev&oldid=1687994245: stated in (P248)Virtual International Authority File (Q54919) + VIAF ID (P214)ID of source cluster + retrieved (P813)date of the dump. Could you do a few edits with this reference format? --Epìdosis 17:34, 25 July 2022 (UTC)[reply]
- Hi, thanks for your comments. I've just made 50 test edits following your recommandations. It seems Ok to me. If it's also Ok to you, I will do the others next week. BboberBot (talk) 07:31, 18 August 2022 (UTC)[reply]
- Not sure how I can link the edits, though... BboberBot (talk) 07:35, 18 August 2022 (UTC)[reply]
- @BboberBot: Thanks! Nearly OK, I would include in the references also the single VIAF ID of the cluster (e.g.). Could you make some edits containing it? Thanks again, --Epìdosis 14:48, 18 August 2022 (UTC)[reply]
- @Epìdosis Great, thanks again! A new test edit batch is ready to be reviewed. Do you think there is an easy way to update my first test edits to add the P214, or will I have to do it by hand? I can't figure how I could do this with Openrefine. BboberBot (talk) 05:58, 19 August 2022 (UTC)[reply]
- @BboberBot: Thanks! Nearly OK, I would include in the references also the single VIAF ID of the cluster (e.g.). Could you make some edits containing it? Thanks again, --Epìdosis 14:48, 18 August 2022 (UTC)[reply]
- Not sure how I can link the edits, though... BboberBot (talk) 07:35, 18 August 2022 (UTC)[reply]
- Hi, thanks for your comments. I've just made 50 test edits following your recommandations. It seems Ok to me. If it's also Ok to you, I will do the others next week. BboberBot (talk) 07:31, 18 August 2022 (UTC)[reply]
- If the sample edits are those of July 7th (e.g. https://www.wikidata.org/w/index.php?title=Q549726&diff=prev&oldid=1672901839), I think the reference format is to be changed: reference URL (P854) with the IDREF URL is not much useful; the reference IMHO should be like https://www.wikidata.org/w/index.php?title=Q113280193&diff=prev&oldid=1687994245: stated in (P248)Virtual International Authority File (Q54919) + VIAF ID (P214)ID of source cluster + retrieved (P813)date of the dump. Could you do a few edits with this reference format? --Epìdosis 17:34, 25 July 2022 (UTC)[reply]
- Support and I agree with the remarks of Epìdosis. Cheers, VIGNERON (talk) 18:48, 25 July 2022 (UTC)[reply]
- Please let the bot perform 50-250 test edits, incorporating the suggestions by Epìdosis. Lymantria (talk) 10:45, 27 July 2022 (UTC)[reply]
- Support, if the changes proposed by Epìdosis are taken into account. Cheers, — Envlh (talk) 18:04, 29 July 2022 (UTC)[reply]
- Comment @BboberBot: I noticed that the reference to the VIAF cluster is added also when it is already present (with a different retrieved (P813), of course); I tend to think it is not necessary. --Epìdosis 08:34, 20 August 2022 (UTC)[reply]
- Yes. I read on the Openrefine documentation that "If you upload edits that are redundant (that is, all the statements you want to make have already been made), nothing will happen.". In this particular case, the dates are not the same, hence the pseudo-duplicate. BboberBot (talk) 07:37, 22 August 2022 (UTC)[reply]
- Comment besides the problem of references differing only in retrieved (P813), which could be avoided excluding items already having a IdRef ID (P269) value referenced with VIAF ID (P214), I noticed that @Emu: correctly reported other problems in the bot edits in User talk:BboberBot#Problematic edits. In these days I'm trying to solve cases of VIAF ID (P214) present in two human items (Property talk:P214/Duplicates/humans), which are often caused by imprecise VIAF clusterisation, so I agree about the necessity of some cautions. I would suggest the following changes to the bot operations:
- if VIAF ID (P214) value in the item has deprecated rank, it must always be ignored (this should be easy)
- if IdRef ID (P269) value is already present in at least one other item, it must not be added (this may be less easy, I fear)
- of course, the two above changes don't prevent a third category of errors: if P214 is not deprecated and P269 is not yet present in another item but effectively refers to an entity different from that of the item (i.e. unnoticed imprecise VIAF clusterization), I think we have no machine-viable way to spot these cases. Of course whichever import basing itself on VIAF (or ISNI) as matching ways has some percentage of mistake, I think somewhere below 5%; if we consider that percentage not acceptable, then probably the entire import is impossible (however, we should take into account that on Mix'n'match we have plenty of catalogues were automatches are made on the bases of the external ID having a VIAF ID in itself ...); I tend to think that, with the two precautions above, the percentage of errors due to VIAF is acceptable and that, in sum, the addition also of imprecise IDs in this way can help users to notice such VIAF problems and manually fix them (removing wrong IDs from the items and deprecating VIAF IDs as conflations). So I still support the bot, with 3 caveats: avoid references differing only in retrieved (P813), avoid using deprecated VIAF ID (P214) values and avoid adding IdRef ID (P269) already present in other items (if possible, create a table somewhere with the IDs skipped in this way; it may be reviewed manually with good results). Thanks, --Epìdosis 12:12, 25 August 2022 (UTC)[reply]
- FYI Vladimir Alexiev Jonathan Groß Andy Mabbett Jneubert Sic19 Wikidelo ArthurPSmith PKM Ettorerizza Fuzheado Daniel Mietchen Iwan.Aucamp Epìdosis Sotho Tal Ker Bargioni Carlobia Pablo Busatto Matlin Msuicat Uomovariabile Silva Selva 1-Byte Alessandra.Moi CamelCaseNick Songceci moz AhavaCohen Kolja21 RShigapov Jason.nlw MasterRus21thCentury NGOgo Pierre Tribhou Ahatd JordanTimothyJames Silviafanti Back ache AfricanLibrarian M.roszkowski Rhagfyr 沈澄心 MrBenjo S.v.Mering Hiperterminal (talk) מקף Lovelano Ecravo Chado07 SoufiyounsNotified participants of WikiProject Authority control --Epìdosis 12:37, 25 August 2022 (UTC)[reply]
- FYI JakobVoss (talk) ClaudiaMuellerBirn (talk) Criscod (talk) Daniel Mietchen (talk) Ettorerizza (talk) Ls1g (talk) Pasleim (talk) Hjfocs (talk) 17:24, 21 January 2019 (UTC) PKM (talk) 2le2im-bdc (talk) 20:30, 24 January 2019 (UTC) Vladimir Alexiev (talk) 16:37, 21 March 2019 (UTC) ElanHR (talk) User:Epìdosis (talk) Tris T7 TT me UJung (talk) 11:43, 24 August 2019 (UTC) Envlh (talk) SixTwoEight (talk) User:SCIdude (talk) Will (Wiki Ed) (talk) Mathieu Kappler (talk) So9q (talk) 19:33, 8 September 2021 (UTC) Zwolfz (talk) عُثمان (talk) 16:31, 5 April 2023 (UTC) M2k~dewiki (talk) 12:28, 24 September 2023 (UTC) —Ismael Olea (talk) 18:18, 2 December 2023 (UTC) Andrea Westerinen (talk) 23:33, 2 December 2023 (UTC) Peter Patel-Schneider[reply]Notified participants of WikiProject Data Quality --Epìdosis 12:37, 25 August 2022 (UTC)[reply]
- FYI
- I agree that the overall accuracy is probably higher than 95%. But there’s a catch: It’s not evenly spread. The accuracy for people with a hyphen in their last name and a birth year between 1930 and 1960 (so their career started after the advent of modern technology but before privacy concerns messed everything up) probably next to 100%. But for others (say, 19. century minor scholars whose first names are Jacques, Johann or John) the error rate is much, much higher. Unfortunately, errors are much more problematic for those people. But if OpenRefine is used anyway, why not check for matching DOB/DOD between Wikidata and IdRef? That would greatly reduce the error rate. --Emu (talk) 12:46, 25 August 2022 (UTC)[reply]
- Hi, back again and trying to sort things out. The number of deprecated VIAF in my project is very low (854 out of 300K, less than 0,3%) but I agree that they should be avoided if it's feasible (and like User:Epìdosis writes, it quite is). It is not that difficult either to spot when a P269 is used elsewhere (a sparql query could do the trick). As for the first caveat about the pseudo-duplicates, all I see is not adding a P269 at all, even if the new value would have corrected a deprecated one. If User:Mahir256 accepts to remove the block I will be able to move on a propose another test edit. Bbober (talk) 07:01, 26 August 2022 (UTC)[reply]
- Sorry but have you even read my response? Please do and explain which additional safeguards you will apply. --Emu (talk) 08:18, 26 August 2022 (UTC)[reply]
- Ok.
- 1. The bot will check whether a P214 is deprecated and will not load P269 for the corresponding items.
- 2. The bot will check whether the same P269 value it's about to load exists for another item, and if it does, it will not load a P269.
- 3 The bot will check whether a P269 value already exists (the original batch I made matched that condition, but thing evolves on Wikidata), and won't do anything if it does.
- Checking matching dates of birth and dates wouldn't be of great use. For minor XIXth century scholars, we won't have them in Idref. As a matter of fact, out of the 6 example you gave, only one has a birth date on Idref (and sometimes the date doesn't exist either on Wikidata). Not including Idref without dates can be a solution.
- If the starting point is "VIAF is not relevant enough for what we want to achieve", well it's a pity and someone will have to find another way to import P269 Idref. Bbober (talk) 09:29, 26 August 2022 (UTC)[reply]
- Checking matching date of birth (P569) and/or date of death (P570) would be of great use if you strive for high accuracy. If you don’t do that, then I don’t think there is much use in your import, to be honest. I might be wrong but I think you are missing the point. You wrote on Twitter: So my bot is maybe biaised because it uses VIAF data, but if VIAF data are wrong and mix different people, shouldn't they not be in Wikidata in the first place ?”—[1]
- No, ideally they shouldn’t. Wikidata should contain accurate information (or inaccurate information with a deprecated rank although this is disputed for authority control data). Mass-imports can of course be of great use and are done regularly, but they should strive for a maximum of accuracy.
- As for the six problematic matches I pointed out: It’s not like I tried to come up with them. The list is a result of me checking your edits for some (now most) of the items on my watchlist and cleaning up after you. It’s an incredibly time-consuming process. Your rather … let’s call it cavalier attitude towards data quality (“some VIAFs mix different pple (yeah it happens)”) makes me question if you do realize that you are causing a lot of cumbersome work for others. --Emu (talk) 11:46, 26 August 2022 (UTC)[reply]
- ok, let's call it off, it's not a big deal. Sorry for having wasted everybody's time and undermined the global quality of Wikidata. It's been an interesting ride Bbober (talk) 12:33, 26 August 2022 (UTC)[reply]
- Checking matching date of birth (P569) and/or date of death (P570) would be of great use if you strive for high accuracy. If you don’t do that, then I don’t think there is much use in your import, to be honest. I might be wrong but I think you are missing the point. You wrote on Twitter:
- Sorry but have you even read my response? Please do and explain which additional safeguards you will apply. --Emu (talk) 08:18, 26 August 2022 (UTC)[reply]
- I agree that the overall accuracy is probably higher than 95%. But there’s a catch: It’s not evenly spread. The accuracy for people with a hyphen in their last name and a birth year between 1930 and 1960 (so their career started after the advent of modern technology but before privacy concerns messed everything up) probably next to 100%. But for others (say, 19. century minor scholars whose first names are Jacques, Johann or John) the error rate is much, much higher. Unfortunately, errors are much more problematic for those people. But if OpenRefine is used anyway, why not check for matching DOB/DOD between Wikidata and IdRef? That would greatly reduce the error rate. --Emu (talk) 12:46, 25 August 2022 (UTC)[reply]
- Vladimir Alexiev Jonathan Groß Andy Mabbett Jneubert Sic19 Wikidelo ArthurPSmith PKM Ettorerizza Fuzheado Daniel Mietchen Iwan.Aucamp Epìdosis Sotho Tal Ker Bargioni Carlobia Pablo Busatto Matlin Msuicat Uomovariabile Silva Selva 1-Byte Alessandra.Moi CamelCaseNick Songceci moz AhavaCohen Kolja21 RShigapov Jason.nlw MasterRus21thCentury NGOgo Pierre Tribhou Ahatd JordanTimothyJames Silviafanti Back ache AfricanLibrarian M.roszkowski Rhagfyr 沈澄心 MrBenjo S.v.Mering Hiperterminal (talk) מקף Lovelano Ecravo Chado07 SoufiyounsNotified participants of WikiProject Authority controlJakobVoss (talk) ClaudiaMuellerBirn (talk) Criscod (talk) Daniel Mietchen (talk) Ettorerizza (talk) Ls1g (talk) Pasleim (talk) Hjfocs (talk) 17:24, 21 January 2019 (UTC) PKM (talk) 2le2im-bdc (talk) 20:30, 24 January 2019 (UTC) Vladimir Alexiev (talk) 16:37, 21 March 2019 (UTC) ElanHR (talk) User:Epìdosis (talk) Tris T7 TT me UJung (talk) 11:43, 24 August 2019 (UTC) Envlh (talk) SixTwoEight (talk) User:SCIdude (talk) Will (Wiki Ed) (talk) Mathieu Kappler (talk) So9q (talk) 19:33, 8 September 2021 (UTC) Zwolfz (talk) عُثمان (talk) 16:31, 5 April 2023 (UTC) M2k~dewiki (talk) 12:28, 24 September 2023 (UTC) —Ismael Olea (talk) 18:18, 2 December 2023 (UTC) Andrea Westerinen (talk) 23:33, 2 December 2023 (UTC) Peter Patel-Schneider[reply]Notified participants of WikiProject Data Quality
- I'm looking at this thread just now, and I am dismayed at the unfriendly and unconstructive attitude that the Wikidata community showed towards Bbober.
- He didn't try anything disruptive: just to import some Idref IDs. The discussion quickly degraded from "sure, just do these minor edits" to rejection of the whole batch.
- I think the assessment of "less than 5% errors" of VIAF is highly negative. In my experience it's less than 0.1% errors.
- If we don't use VIAF links in WD, we'll never expose VIAF errors.
- @Emu seems to suggest that we shouldn't use VIAF in WD, and I quite disagree with that.
- I think we should focus on reporting errors to VIAF and pushing VIAF to fix those errors, rather than discouraging WD contributors from using VIAF.
- Cheers! Vladimir Alexiev (talk) 07:18, 28 August 2022 (UTC)[reply]
- @Vladimir Alexiev Couple of things:
- I am dismayed at the unfriendly and unconstructive attitude that the Wikidata community showed towards Bbober: I beg your pardon? Would you care to describe how we were “unfriendly and unconstructive”?
- He didn't try anything disruptive: Yes, he didn’t try to. What’s your point?
- seems to suggest that we shouldn't use VIAF in WD, and I quite disagree with that: This is complete nonsense. In fact, working on authority data including VIAF has always been one of my main activities on Wikidata.
- If we don't use VIAF links in WD, we'll never expose VIAF errors: You do realize that several users (including me) have been active for many years to deal with VIAF errors?
- I think we should focus on reporting errors to VIAF and pushing VIAF to fix those errors Again, I have been doing this for years. But you don’t seem to realize how VIAF works: It’s automatic matching with very little manual corrections. It’s not authority data, it’s an automatic compilation of authority data. That’s no problem per se, it just shows that you can’t rely on their data. Quite the opposite: They rely heavily on our data for their matching. I have spent many, many hours of my life to correct countless faulty clusters, tracking them over months and sometimes years and coming up with new ways to disambiguate, including writing to dozens of libraries worldwide and creating hundreds or even thousands of items. I’m certainly not an expert but I think I know a thing or two about this.
- I’m very sorry that one well-intentioned new user had a rocky start. I did my best to address the problem before the block and pointed out possible solutions in this discussion. That’s all I can do. --Emu (talk) 12:04, 28 August 2022 (UTC)[reply]
- Hey @Emu, chill man! I don't mean to belittle your contributions to VIAF!
- I mean that the contribution of a well-meaning user (@Bbober) was effectively chased away rather than taking it for its value, and then experienced users like you leading the fixing of its quality (if has significant quality problems).
- I know that VIAF operates with a very small team (like 4 people but not full time), and unfortunately they are very hard to reach or get any feedback from.
- Do you how how WD could write to the VIAF xA and xR files? That's the way VAIF records manual corrections, but I have no idea how we could affect it. I know that VIAF takes MARC (I've looked at it for Getty ULAN) but don't know how to proceed organizationally. Vladimir Alexiev (talk) 16:21, 28 August 2022 (UTC)[reply]
- @Vladimir Alexiev Couple of things: