User:Salgo60/ExternalIdentifiers

From Wikidata
Jump to navigation Jump to search

One way to design a system to be a good external identifier in Wikidata

[edit]

A small try to write down a checklist / Best practise - Salgo60 (talk) 10:38, 14 November 2020 (UTC)

  1. have persistant unique IDs for things like parish, country, places, people, events, genre, "instance of" that are containers that Wikidata can connect as "same as"
    1. container objects shall have a landing pages and the persistent unique ID should be visible for the user in the user interface compare Alvin Söderala
    2. all landing pages should be supported by GET i.e. you can address that page with an URL and dont need POST we have that problem with SCB Regina database see T200700
    3. see Swedish example of persistant identifiers in a book from 1750 Bautil
  2. link your data to other data to provide context - to be a good member and on level 5 they should have same as external authorities visible. A small step is same as Wikidata Q-number
    1. I hope museums will have better place identifiers we are trying to connect to Gotlands museum entity ID (P7068), Malmö Museer ID (P8773) what we see is that they have a place but dont say same as so we dont understand what Administrative level we speak about e.g. if they say "Söderala" we dont know if it is Söderala (Q2673411) Söderala parish (Q10688474) Söderala church parish (Q10688470) Söderala (Q21779139)
      1. identifiers for streets. My understanding is that en:Lantmäteriet has no persistent unique id for objects like streets see blog - Update they have identifiers in a non open product but looks like they dont have "same as" 5-star data
  3. have version history & support for merges by supporting redirects. merges by supporting redirects from the old item to the "new" item
    1. this should also be supported by the API compare Wikidata 'owl:sameas and a query merges
  4. have a SPARQL endpoint and/or JSON access - we can easy check differences easy between Wikidata and the external system see e.g. SKBL Notebook, Nobel prize notebook...
    1. Good documentation of the API like using Swagger see ISOF, JobTech, Nobelprize
    2. See problems we get with KulturNav (Q16323066) and just returning http codes and has no tumbstone pages (in swedish)
  5. have timestamps for created and changed
  6. link back to Wikipedia pages/Wikidata e.g. VIAF, Swedish Litteraturbanken
  7. deleted items should be easy to find compare problems Europeana has with Wikidata that gets deleted
  8. support for more languages
    1. SKBL has support for Swedish and English by changing url e.g Greta Garbo json sv en ---> we now support both templates in en:Wikipedia and in sv:Wikipedia --> 9 million visitors to those Wikipedia pages this year sv / en
  9. describe your data with a schema
    1. Example how we do that in Wikidata using schemas EntitySchema:E305 Svenska badvatten,EntitySchema:E395 Swedish member of parlament - explained, EntitySchema:E290 Swedish Runestones at the Swedish Literature Bank, E408 Entity schema Svenska Grillplatser - explained
  10. set a Creative common license on your data so other understand if and how they can reuse the data
  11. create GITHUB repositories with code examples how to access your data
    1. add your open data portal to WIkidata as an object and connect it with open data portal (P8402)
    2. add GitHub topic (P9100) to your GITHUB example repository
    3. create a discussion in the Repository see example European bathwaters, Swedish Bath Waters and "issues" with Swedish Agency for Marine and Water Management (Q10518645) link
    4. support for a Query language like WDQS
  12. Public prioritized backlog the best pattern for success is to have a prioritized backlog open for questions and subscription see the usage of Phabricator for the Wikidata project - video about active product management
    1. have public standups were "external stakeholders people" can take part and present input and also learn what scope/user cases will be supported
  13. nice to have
    1. a change API see example API:Recent changes stream used in this app --> changes Wikidata - I guess Google use it see tweet how they updated the Google search product 20 minutes after a change in Wikidata

Example of possibilities and problems we find

[edit]
  1. Persistent
    1. Graves at www.svenskagravar.se. Quote svenskagravar "they are persistant IF we dont reload all graves". They have now reloaded more times --> i.e. its not persistent
    2. A Swedish site containing local history material "Sveriges Hembygdsförbund" upgraded to a modern plattform and also "upgraded all ids" --> we needed to delete all linked items see T248875 and start from scratch
    3. Europeana was an external identifier in P727 (P727) but lesson learned was that it was not persistent so then they implemented a new approach and I created Europeana entity (P7704). Lesson learned was as you see below that the new approach has quality problems
  2. Quality Europeana did copy 160 000 items from dbpedia/Wikidata for artist BUT they havnt done the homework connect the right objects to the right artists instead used text strings and guessed --> bad quality see T243764
    1. one "solution" to bad quality is to have design patterns telling who is Single Source of Truth (SSOT) see discussions/113 --> I guess we then get better quality and also semantic interoperability
    2. support the "observe pattern" --> so you get control over who is using your data and can inform them directly about changes - in Swedish about how to handle Inactive PID / unpublished material" using tumbstone pages - DataCite "Best Practices for Tombstone Pages"
  3. Things not strings should we be able to link from Wikidata we need things
    1. anti-pattern seen in the Swedish LIBRISXL project that they publish RDF but has a lot of strings inside the RDF --> less usable data we need things
      1. We have tried communicate with the project (examples in Swedish) and also went to SWIB18 in Bonn to see if they are interested. Lesson learned is not so interested.... for me that is an early warning that this is an organisation not mature enough for doing Linked data
  4. Error reporting When connecting two domains you find problems/errors e.g. Wikidata has indication on many duplicates in Uppsala University Alvin database but we have no easy way to report errors/ or they dont use Wikidata / Phabricator were we track issues see list duplicates or Task T243764
    1. same problem with coordinates of bathing waters data from Swedish Agency for Marine and Water Management (Q10518645) they are not good enough for Wikidata.
      1. I have started to use issues in a GITHUB repository to track errors and have no way today of reporting to Swedish Agency for Marine and Water Management (Q10518645) errors we find... I guess this is even more complex as it looks that responsible for the data is 290 municipality of Sweden (Q127448)... the solution to this problem is that we get better design patterns for Open Data as Single Source of Truth (SSOT) see discussions/113 and use data from one source that has proven better quality...
      2. as Wikidata supports ranking a property value we can also give errors lower rank with a reason reason for deprecated rank (P2241) and find them using SPARQL
  5. Uniqueness - the Swedish National archives has NAD i.e. id for archives. We have reported that they are not unique and now we see some redesign using en:GUID to fix this see also Task T200046
    1. disambiguation page if a name space gets more items that can be described with the same names create disambiguation pages. In the new design of NAD from the Swedish National Archive it looks like they skip this, which from an user perspective is a nightmare if you just have the old ID see DataCite "Best Practices for Tombstone Pages"
  6. Dates and calendar the Nobel prize people developed a SPARQL endpoint using a product from Metasolution. It didnt support calendar format so Nobel prize winners born in Russia got odd dates see SPARQL Federation Wikidata <-> Nobelprize. The Nobelprize people has stopped maintain this project and focus on an API se nobelprize.org developer-zone - see Wikidata datesupport in JSON
  7. Lack of a helpdesk were we get an unique helpdesk id when we ask a question / report an issue . We have this problem with the Swedish National archives(they are now on GITHUB), SCB, ISOF, "Lantmäteriet" .... Swedish "Naturvårdverket" has unique numbers but no easy way to see the status e.g. 2018NV38321
    1. workarounds
      1. be active on Wikipedia ==> then we can ping them and discuss issues and agree how we solve things and get feedback of errors in Wikipedia/Wikidata
      2. GITHUB Litteraturbanken and SKBL are active on GITHUB and have issue trackers we use
        1. Litteraturbanken spraakbanken/littb-frontend/issues
        2. SKBL spraakbanken/skbl-portal/issues
        3. Svenska badvatten Swedish GITHUB repository were we track errors in data and reference the Wikidata object
      3. Create tasks in Phabricator.wikimedia.org see my backlog
        1. Phabricator task graph related to tasks with Europeana
  8. Easy way of ask questions and see what questions other have asked. Good example is Libris (they have closed this so they are not a good example any more see internet archive, I guess all the metadatadebt was shown and they got scared see answer about copyright in swedish/ bad semantics 2019 / dålig-oproffsig release info 2018) most other institutions dont have this
  9. Easy way to subscribe on an issue and get a notification when its moved to production. We have this in Phabricator used by Wikidata and also see change stream
  10. Dataroundtrip as we now support linked data on pictures its getting more and more important to have a data roundtrip approach i.e. changes in Wikidata needs to be tracked and taking care of in both systems we can keep booth systems in synch. Today we try to fix that ad hoc but it would be better if we agreed on a "framework"/"model" examples hat we do today
    1. JSON and structured data
      1. Nobelprize.org Notebook
      2. Swedish female biographies Notebook
      3. The Swedish Literature Bank Notebook
    2. Webpages no API we Webscrape and compare with Wikidata
      1. Swedish National Archive SBL Notebook
      2. Graves Uppsala Notebook
      3. Swedish Academy Notebook
    3. WikiTree a genealogy site with 180 000 connections to Wikidata WikiTree person ID (P2949) and 22 million profiles
      1. they check the quality of the family tree every against > 250 rules were Wikidata is a number of checks see Data doctors report
    4. Good Examples
      1. Runestones: we connect Rune stones in Wikidata with the Swedish Literature bank and use SPARQL Federation RAÄ etc. can easy retrieve pictures from Wikicommons see article "Structured data for GLAM-Wiki/Roundtripping/KMB"
      2. All "Swedish Official" bath locations are added to Wikidata see project github/Svenskabadplatser we now support same as and SPARQL federation with museum pictures
  11. GET/PUT we need an easy way of linking using an URL. E:g. SCB Regina is designed for just access a record using post which dont work with WIkipedia see T200700
  12. Clean URLs not using redirects in a perfect world everything is Linked data and WEB 2.0 and data is presented as data. As a workaround to use the power of Wikidata an Australian researcher has created d:Wikidata:Entity_Explosion --> we can get old platforms like SBL, LibrisXL... to use Wikidata for finding "same as". If we install this Webbrowser extension see video we get the magic of Wikidata and how we get problems with e.g. Alvin that has a redirect and a rather "noisy" URL
  13. Active agile product management and easy way to discuss/ get updated of changes (see video about agile product owner). In Wikidata we have
    1. Prioritized open backlog everyone can register and ask question/ subscribe
    2. Weekly status updates Wikidata:Status_updates/2020_11_16 / all
    3. Telegram groups Wikidata and Wikidata Sweden.....
    4. Project chats Wikidata:Project_chat / Wikidata Swedish plus on all pages e.g. property Dictionary of Swedish National Biography ID (P3217) you have Property_talk:P3217
    5. Every 2nd year meeting that are available online e.g. Wikidata:WikidataCon_2019/Program example featured talks 2017 / 2019
    6. We have more research oriented meetings like wikidataworkshop 2 nov 2020, key note
      1. Research papers about Wikidata
  14. Missing vision statements and sharing your future development example we try to connect to the Europeana network and we see [lack of quality it would be of great help if they shared the next step they will take. Without information it looks like they have given up. We have the same "challenge" with the Swedish Riksdagen blog/video were we have no understanding of the vision of classification and small things if they will support who is the substitute of a position, today we have heard they move in direction using Eurovoc and we need to read documents to find who is the substitute for a specific position is that the vision?
    1. Public prioritized backlog the best pattern for success is to have a prioritized backlog open for questions and subscription see the usage of Phabricator for the Wikidata project - video about active product management
    2. EPICS share your Epics example Wikidata
      1. Improve Search Suggestions with NLP
      2. Growth: Newcomer tasks 1.0
      3. Better support for References in Content Translation
      4. Structured data backlog
      5. Feedback processes and tools for data-providers
    3. Newsletter to inform whats going on compare Wikidata:Status_updates
  15. good tools for measure uptime of service and the usage compare Wikidata Grafana Dashboard
    1. tools for measure Wiki pageviews eg. article Greta Garbo, sv:Wikipedia articles linking Svenskt kvinnobiografiskt lexikon same for en:WIkipedia
    2. page views, number active users, edits, new registered users
      1. All wikis 23 B views per month, en:Wikipedia, sv:Wikipedia, Wikidata
  16. New technologies needs new skills
    1. The Swedish LIBRISXL project started 2012 a project to build a linked data library system see video
      1. 2019 they reported they see no gains of Linked data se report "Leaving Comfort Behind: a National Union Catalogue Transition to Linked Data"
        1. I have tried to asks question why they have odd "same as" or if they have a vision about keywords of books and my feeling is that they havent educated librarians about linked data. Maybe a better approach I think is hire people with semantic knowledge see Getty Semantic Architect and also skilled data scientist can be needed... the odd thing with LIBRISXL is that have one very skilled person Niklas Lindström see interview but that is not enough...
        2. Now 2020 it looks like they dont develop the project and also close down forum tools....
    2. Europeana started 2012 with Linked data and has today 2021 bug problems with metadata quality
      1. Lesson learned is that quality Linked data should be created at the source. Europeana tried the approach of guessing "same as" see blogpost --> even if its free the quality is so bad that en:Wikipedia dont link them T243764.
        1. The interesting approach they did was adding WIkidata Q numbers to the artists in Europeana but has very big problems with the challenge of connecting the right object with the correct artists.... root cause is they lack en:Semantic interoperability and move things as "strings not things"
    3. The Nobel prize organisation have redesigned the API and now have Wikidata same as also for location etc. see API and see also how we redesign Wikipedia/Wikidata to get less link problems
      1. see also Notebook how we sync the winners with Wikidata and some tests with the api
    4. Svenskt kvinnobiografiskt lexikon (Q50395049) we have a good dialogue and use a Notebook to check same as, BUT moving SKBL to level 5 stars seems to be a problem they use **strings not things** see API, they said they will do a try 2019 sep but nothing has happened....
      1. Tasks done see Phabricator graph and GITHUB salgo60/SKBLWikidata
    5. Uppsala University Alvin has a property on Wikidata Uppsala University Alvin ID (P6821) that we have connected > 30 000 objects and found a lot of errors
      1. Three times have I told them that and said maybe we should meet and see how you can gain from that see task T226099 / T225522
      2. so far no reaction I guess they are not focusing on having good metadata and/or dont understand the benefits of en:Linked data
  17. Organisations not understanding the value of Metadata and knowledge graphs
    1. Lesson learned is that organisations like Google Amazon Über understands the value of good metadata and having a knowledge graph
    2. SVT has Wikidata property SVT Play ID (P6817) see Phabricator T225394
      1. they did a decision closing down Öppet arkiv (Q20746676) and move it to another platform SVT Play (Q3444523) --> they loose metadata regarding person part in a program instead it looks like they hope text retrieval is good enough - which is sad... my guess they see not the importance of good metadata and the product they try to deliver

How to make Linked data possible

[edit]

I suggest that we start to define the process maturity for handling en:linked data. I guess it will be the same as Capability Maturity Model see video

As documented above getting no helpdesk tickets or having no process for asking Change requests ==> we are on the initial level were we hope finding "heroes" in the organisation that will help us. When we agree that we have a problem and a common vision then we can take step 1 together is my experience doing things like this before in organisations with a requirement to deliver a good result.....

My personal believe with such poor change management shown above people will not trust you and will not invest in time/money to do solutions with your data see tweet

Characteristics of Capability Maturity Model