Wikidata:Property proposal/Internet Dictionary of Polish Surnames ID
Internet Dictionary of Polish Surnames ID
[edit]Originally proposed at Wikidata:Property proposal/Authority control
Description | identifier in the nazwiska.ijp.pan.pl dictionary of surnames used in Poland |
---|---|
Represents | Internet Dictionary of Polish Surnames (Q118130420) |
Data type | External identifier |
Domain | property, instance of family name (Q101352), masculine family name (Q18972245), and any other entity that is some form of a surname (male, female, toponymic and whatnot) |
Allowed values | [-A-Z]+ |
Example 1 | Kowalski (Q3199417) → KOWALSKI |
Example 2 | Nowak (Q15073902) → NOWAK |
Example 3 | Czeszejko-Sochacki (Q121880021) → CZESZEJKO-SOCHACKI |
Source | https://nazwiska.ijp.pan.pl/ |
Number of IDs in source | 29,997 (as of 2023-09-12) |
Expected completeness | eventually complete (Q21873974) |
Formatter URL | https://nazwiska.ijp.pan.pl/haslo/show/name/$1 |
Robot and gadget jobs | Yes, it would be perfect to let a bot simply populate this field with data I've collected. |
Motivation
[edit]https://nazwiska.ijp.pan.pl is an online dictionary of surnames used in Poland (chiefly "Polish" surnames). For each surname in the database (~30,000 items as of August 2023), there's information re: origins of the surname, its popularity, and alternative spelling variants. This is now a freely available "ground truth" version of what we know about surnames in use in Poland. The URL formatter is pretty straightforward. However, there are multiple links concerning the same surname that will resolve successfully for the same surname: https://nazwiska.ijp.pan.pl/haslo/show/id/4910 https://nazwiska.ijp.pan.pl/haslo/show/name/TARCZYŃSKI https://nazwiska.ijp.pan.pl/haslo/show/name/TARCZY%C5%83SKI And all the above resolve successfully no matter the capitalization. Then there's a link to a full-page printout, too: https://nazwiska.ijp.pan.pl/haslo/print/id/4910 The printout version is in HTML and is actually more convenient to read than the standard page version. I don't know what the unique and permanent URL structure is and there are no indicators on the website itself. I think it would make most sense to go by the actual surname string and default to all caps (that's how surnames are listed in the body of the page; this is how data in the PESEL registry is maintained; and this would also avoid dilemmas related to automatic injection and capitalization of compound surnames. It's my understanding that the ID would be more "unique" than the character string, but it would entail and extra step (i.e., go to the search box, look up the surname, copy the ID). Given the database is high-quality, there's little reason to believe IDs would be more stable than surname strings. Finally, I went through the whole site and only found two instances of double entries – and there the IDs were different, so it seems that if there's potential for duplication, it'll happen no matter the ID or surname trying. TL;DR I suggest we go with https://nazwiska.ijp.pan.pl/haslo/show/name/TARCZYŃSKI ([A-Z])
(moved from description ArthurPSmith (talk) 19:11, 5 September 2023 (UTC))
This would provide readers with a certain "ground" truth" on surnames in Poland. (Add your motivation for this property here and your signature)
Discussion
[edit]I'm very sorry if this isn't sufficient. First time posting something like this; it's a bit overwhelming. Update: the path can contain - (hyphens).
--Itorokelebogile (talk) 16:07, 27 August 2023 (UTC)
- @Itorokelebogile: I fixed up the proposal a bit. The examples should look like what I did for example 1 - can you add 2 more to flesh this out a little? Otherwise it looks good to me. ArthurPSmith (talk) 19:16, 5 September 2023 (UTC)
- Bless you, kind Wikidata soul! Updated the remaining 2 examples. Itorokelebogile (talk) 09:51, 6 September 2023 (UTC)
- Support ArthurPSmith (talk) 20:30, 6 September 2023 (UTC)
- Support --Gymnicus (talk) 16:34, 10 September 2023 (UTC)
- I think this should use the numeric ID. That's what is used by the links on the site's main page (where do the URLs with the name even come from?). There are ~6000 links using the numeric ID already, using described at URL (P973) (query). The template Template:R:pl:ISNP (Q122438370) and links on the Polish Wikipedia also use the numeric ID. You could use subject named as (P1810) as a qualifier to store the name as shown on the site.
For "PACAN" (20266 and 30045), https://nazwiska.ijp.pan.pl/haslo/show/name/PACAN doesn't work at all. For "KRASNY" (10237, which seems to be the Czech name "Krásný", and 20797, which seems to be the Polish name "Krasny"), https://nazwiska.ijp.pan.pl/haslo/show/name/KRASNY shows data matching the first link. If we use the name, how would we link the entries 20266, 30045 and 20797?
- Nikki (talk) 14:51, 12 September 2023 (UTC)- I'm totally open to relying on IDs, here's a bit of context as to why I pitched the hack-y approach:
- The URLs are inconsistent across the site. When you run lookups on the home page, you get sent to pages with IDs in the URLs. However, if you start clicking through the descriptions, you'll notice that those links use "name" URLs (title capitalization, e.g., Nowak). Although the IDs are technically unique, there are two surnames that have double entries (more on that later). The URLs aren't case-sensitive, so—seeing the setup other similar tools have on Wikidata—going with all caps would be consistent with those, easy to read, and map 1-to-1 with entries in the PESEL registry, I think.
- The website's code base is a bit crappy, but it has its advantages: you can basically get all the URLs and basic information by accessing the search API:https://nazwiska.ijp.pan.pl/public/backend/nwp/ajax/hasla-tab.php and https://nazwiska.ijp.pan.pl/public/backend/nwp/ajax/hasla-tab-slownik.php (1) if you clean the output up a bit, you'll notice the dictionary has 29,997 entries; (2) like I said, two pairs are doubled—which is a confusing way of saying these guys are a bit of a problem: KRASNY (ID 10237) and KRASNY (ID 20797) as well as PACAN (20266) and PACAN (30045). I'm not sure if KRASNY was supposed to be written up twice (https://nazwiska.ijp.pan.pl/haslo/show/id/20797 and https://nazwiska.ijp.pan.pl/haslo/show/id/10237—I doubt two entires were created because of different etymologies; see https://nazwiska.ijp.pan.pl/haslo/show/name/BACH—it's got three etymologies listed, two Polish, one German; in the case of Pacan, there are no real differences other than the rank order of 4744 vs 4745 and the fact one of the entries just looks altogether incomplete). If you try to access info on these surnames via the "name in the URL" approach, you'll fail; but then again, it's not clear to what ID we should be linking anyway, so it's a matter of choosing the problem we prefer and dealing with the consequences. To be frank, 2 "unstable" links out of 29,997 doesn't feel bad. For those two guys, we could have the number ID URL added somewhere; or simply raise this issue with the authors of the dictionary, so that they can fix things on their end.
- I agree that IDs are technically safer, i.e., we can assume that IDs being numeric and assigned (arguably) automatically should be more stable, but I'd quite enjoy the convenience of understanding that—if I type in a surname at the end—I'll get a result (rather than force people to always go to the home page to look stuff up).
- I'm new, so I'll run with whatever you guys feels more appropriate; just thought I'd give you a bit of context. General convenience vs assuming the website developers always know better (which, even if not true, is a solid approach that's typically a bit more future-proof, so I'm down with this approach).
- Thanks for the feedback, and, again, totally open to IDs. Itorokelebogile (talk) 15:37, 12 September 2023 (UTC)
- Oh, forgot to add: is you look at this JSON file, https://nazwiska.ijp.pan.pl/public/backend/nwp/ajax/hasla-tab-slownik.php you'll notice that the listing there uses all caps for the surnames. I know it's not a "+1" in favor of links, but it does, I assume say something about their style guide (all surname-oriented entries have headings in all caps). Itorokelebogile (talk) 15:41, 12 September 2023 (UTC)
- I'm totally open to relying on IDs, here's a bit of context as to why I pitched the hack-y approach:
- Question to Itorokelebogile: Can you render the three examples above with their numeric ID? And can you point out how to find out this numeric ID in the first place? Looking at your three examples I did not find any hint at the ID, so I doubt its use is encouraged by the website developers. Of course, even if the IDs are not canonical, we can still work with them if they're stable and retrievable. Jonathan Groß (talk) 16:38, 18 November 2023 (UTC)
- @Itorokelebogile:, could you please answer the questions. Regards, ZI Jony (Talk) 11:48, 20 January 2024 (UTC)
- Conditional support Dear all, I have sent a message concerning the application of identifiers to the Institute of Polish Language Polish Academy of Sciences which is responsible for this dictionary. However, I still have not received any response. The dictionary is a credible source of information about Polish surnames. It consists of names with relatively high frequency. Their project ended in 2022 but they plan to include other surnames. Duplicate entries mentioned by Itorokelebogile were removed. I think we should rely on numeric IDs. Their search engine returns URLs by default with a numeric ID. Therefore, I think that the proposal should be updated with numeric IDs for Wikidata items, together with allowed values and Formatter URL: https://nazwiska.ijp.pan.pl/haslo/show/id/ M.roszkowski (talk) 14:08, 19 December 2023 (UTC)
- @M.roszkowski:, would you like to give your final opinion based on their response Regards, ZI Jony (Talk) 11:51, 20 January 2024 (UTC)
- @ZI Jony: I have not received any response from the contact mail on their website. But I've sent a direct message to one of the members of the editorial board. I will let you know as soon as possible. M.roszkowski (talk) 09:57, 23 January 2024 (UTC)
- @ZI Jony: I received the response from the member of the editorial board. Their project will be continued and the dictionary will include Polish surnames with a lower frequency. However, they decided for URL schema with capitalized surname (e.g. https://nazwiska.ijp.pan.pl/haslo/show/name/KOWALSKI) Therefore, the numeric ids, although supplied, are not the main schema for the identification of surnames. We can find the dataset with URLs here https://nazwiska.ijp.pan.pl/sitemap.xml Considering the source's credibility and the editorial board's response, I support the proposal. M.roszkowski (talk) 11:12, 1 February 2024 (UTC)
- @M.roszkowski:, would you like to give your final opinion based on their response Regards, ZI Jony (Talk) 11:51, 20 January 2024 (UTC)
- Support, an important property for Poland.--Arbnos (talk) 20:12, 16 January 2024 (UTC)
- @Itorokelebogile, ArthurPSmith, Gymnicus, Arbnos, M.roszkowski: Done Internet Dictionary of Polish Surnames ID (P12401). Regards, ZI Jony (Talk) 07:38, 3 February 2024 (UTC)