Wikidata:Requests for comment/Duplicate References Data Model and UI
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Thanks for your supporting comments. Phabricator task T360224 has been opened; any further thoughts on this may be shared there. ArthurPSmith (talk) 18:57, 15 March 2024 (UTC)[reply]
An editor has requested the community to provide input on "Duplicate References Data Model and UI" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.
If you have an opinion regarding this issue, feel free to comment below. Thank you! |
THIS RFC IS CLOSED. Please do NOT vote nor add comments.
From the session on alternative reference models at Wikidata:Events/Data Modelling Days 2023 there emerged two proposals for developers to improve the treatment of duplicated references on Wikidata items: 1. Condense the internal JSON storage so that duplicate references are stored in full only once per item, and 2. Modify the Wikidata UI representation of items with duplicated references to allow editing of all copies of a reference with one edit, rather than separate edits for each statement the reference is on. This RFC is to allow the wider community to comment on these proposed changes prior to creation of detailed requests for the developers (anticipated to be made in January 2024).
Contents
1. Condense internal JSON storage for duplicate references
[edit]Duplicate references can significantly impact the size of items under the current storage format. As an example see Q21481859 which has almost 3000 authors who (should) all have the same reference; the duplicated reference data accounts for over 1 MB of the 4.4 MB size of the item. Wikidata items have a maximum JSON file size of about 4.4 MB so the reference duplication has made this and similar items almost un-editable. Each statement with a reference has a "references" attribute in the JSON that looks something like:
"references": [ { "hash": "51ae109329c13aebb6e83e53e1583cf93312f9e6", "snaks": { "P248": [{ ... }], "P813": [{ ... }], ...}, ...]
It is proposed that these per-statement entries be replaced with an indirect reference using the hash:
"references": ["51a109329c13aebb6e83531583cf93312f9e6"]
(perhaps "references" should be replaced with "reference_hashes", or the entries maintained as JSON objects with attribute "ref_hash"). Then the item would have its own "references" attribute that contains the full duplicated reference entries (once per hash value):
"references": [ { "hash": "51ae109329c13aebb6e83e53e1583cf93312f9e6", "snaks": { "P248": [{ ... }], "P813": [{ ... }], ...}, ...]
This could be implemented by something that translates between the current format and this storage format, and back (so that no change would be needed at higher levels such as UI, API's, etc.) or it could be a more integrated change that could improve performance and size for the other layers also.
Other techniques for shrinking the storage format should also be considered (removing unneeded whitespace characters, a compression solution like gzip, changing the storage format to be closer to the REST API format that is already more compact without loss of information) - but these may have other implications and would need to be considered separately.
2. Modify the Wikidata UI for editing duplicated references
[edit]If a duplicated reference needs to be modified in some way, right now that requires a separate user interaction with each instance of the reference. Whether or not the above storage change is implemented it would be helpful for the user interface to allow simultaneous editing or deleting of all duplicate copies at once. There may be other UI changes that would also be useful for handling duplicate references. Some suggestions follow:
- A. When editing a reference, highlight other statements that have the same reference. Add a checkbox in the reference editing area with label
apply changes to all copies of this reference □
and then update all matching reference entries with the "publish" action (if box checked).
- B. When adding a reference to a statement, allowing adding an existing reference on the item (maybe the DuplicateReferences gadget is sufficient?)
- C. Add a new section "References" under "Statements" and "Identifiers", with each reference listed only once, and indicate references on the statements with something like the standard wikitext format.[1] Editing references on each statement would bring in the full reference text for editing, but editing references in the Reference section would make the change applicable to all copies of the reference at once.
References
[edit]- ↑ Like this
Officially created as an RFC - ArthurPSmith (talk) 20:39, 7 December 2023 (UTC)[reply]
Please indicate support for the above proposals, or add any comments on how you think they should be adjusted or other considerations that may need to be reviewed.
- Support in general. With regard to the use of the Duplicate References gadget, I often use User:Bargioni/UseAsRef.js multiple times when inserting a reference for many statements, so we should consider a solution for that process as well (which might be to recommend using UseAsRef once and then using Duplicate References for the rest). - PKM (talk) 20:48, 5 December 2023 (UTC)[reply]
- Support; it always bugs me when I see how inefficient this is implemented right now. However, a migration is not easy, and it might be worth to provide both current and a potential future serialization for quite some time for data users. Besides that, the essence of the second proposal (Modify the Wikidata UI for editing duplicated references) should be considered in the context of bot editing as well, i.e. some API functionality should be available that allows to modify an existing reference either per item or per claim. —MisterSynergy (talk) 20:59, 7 December 2023 (UTC)[reply]
- Support (to both proposals); I add that, when using the gadget Duplicate References and also when using User:Bargioni/UseAsRef.js, all the references are added with the same hash; so using UseAsRef for the first and Duplicate References for the others, or using UseAsRef for all the references, are two methods which produce exactly the same result. --Epìdosis 15:48, 11 December 2023 (UTC)[reply]
- Support in general too. I don't have a clear picture on how tools like Quickstatements and Wikidata Integrator would need to be changed to comply with the new schema, but I guess it is doable. Would be good to see a feasibility analysis of some kind to get an idea of the workload. TiagoLubiana (talk) 17:04, 11 December 2023 (UTC)[reply]
- Support I really like these proposals. I welcome this improvement.--So9q (talk) 08:35, 13 December 2023 (UTC)[reply]
- Support in general. I'm just now searching for how to use the same reference twice. I'm very much in favor of #1, and okay with #2 as well. KarenJoyce (talk) 02:36, 13 February 2024 (UTC)[reply]
- @KarenJoyce: for duplicating a reference on a given item, you can activate in your preferences the gadget DuplicateReferences. --Jahl de Vautban (talk) 15:18, 14 February 2024 (UTC)[reply]