Wikidata:Property proposal/subject facet

From Wikidata
Jump to navigation Jump to search

subject term (OR keyword)

[edit]

Originally proposed at Wikidata:Property proposal/Creative work

   Not done
Descriptionterm or concept used to contextualize this item, for example via an external vocabulary, catalog, or other datasource
Representssubject heading (Q1128340)
Data typeItem
Domaincreative work (Q17537576)
Allowed valuesany
ExampleA systematic arrangement of British plants. 4th edition (Q51423679)Great Britain (Q23666) + plant (Q756)
Planned useUpload a dataset of 225,000 subject keywords for books etc that we have items for from the Biodiversity Heritage Library (Q172266)
See alsomain subject (P921)

Motivation

(See also the later comments in this discussion at WikiProject Books, in particular the comment by User:Valentina.Anitnelav, 10:56, 4 May 2018)

The Biodiversity Heritage Library (Q172266) (see Wikidata:WikiProject BHL) includes a dataset with 225,000 subject keywords for the books etc we have items for, that I would like to add to the relevant items here.

However, it seems to me that main subject (P921) is not the right property for the job. Consider a book like A systematic arrangement of British plants. 4th edition (Q51423679). Its "main subject" might be classified in library catalogues as "Botany -- Great Britain" and "Botany -- Ireland" (in fact, that is exactly what is stated for its subject at OCLC Worldcat).

But the keywords file gives keywords "Great Britain", "Ireland", and "Plants". However "Great Britain" is not the main subject of the book. Anyone looking for a book on "Great Britain" generally would be disappointed if a search returned this book, because it is not a book on Great Britain generally. Therefore IMO it would be wrong to record main subject (P921) = Great Britain (Q23666).

Instead, "Great Britain" is an aspect or a facet of the main subject of the book.

It's very useful to be able to record this, because one can then ask for the set of books which have "Great Britain" as a facet of their subject, and then what other facet-topics it is found in combination with, and thus progressively derive a narrower and narrower, more and more refined final solution set - a process known as faceted search (Q1519370).

This is therefore something valuable to be able to record. But these are not "main subject"s of the books, so P921 is not appropriate, and we would therefore benefit from having a new property for statements of this kind instead. Jheald (talk) 23:10, 10 June 2018 (UTC)[reply]

Discussion

 Comment A subject keyword is a main subject (P921). So what you propose is a "Schlagwortkette" (subject string / subject chain)? For creating "Keyword: Great Britain (Q23666) – subfield: plant (Q756)" we would need to add the qualifier series ordinal (P1545). Otherwise subject keyword + subject keyword = main subject (P921) + main subject (P921). --Kolja21 (talk) 01:49, 11 June 2018 (UTC)[reply]

@Kolja21: No, I don't think so. As I understand it, a P921 value should correspond to a complete "Schlagwortkette". This is why in English property P921 is called "main subject". An appropriate P921 for the book above would be a non-category item corresponding to Category:Flora of the United Kingdom (Q6324043).
That is why I am proposing this new property, for an individual Schlagwort from that Schlagwortkette.
Your suggestion of how series ordinal (P1545) might be used as a qualifier is interesting, although that information will not necessarily always be available (eg when just a list of keywords is supplied), nor necessarily unique (eg if both "Great Britain -- Botany" and "Botany -- Great Britain" were given as Schlagwortketten). Jheald (talk) 07:09, 11 June 2018 (UTC)[reply]

 Support To summarize the discussion at Wikidata_talk:WikiProject_Books#subject_areas_and_genres: There is a need to express that a book is written in (or of interest for) a certain academic discipline/subject area. Sometimes genre (P136) is used, but it is at least debatable if mathematics is a genre. To use main subject (P921) would lead to problems, too, as the academic discipline is rarely the main subject of a book written in it, leading to many false positives. This property could be used to indicate this information without "dumping" it into main subject (P921) (and making it more or less useless for some topics). - Valentina.Anitnelav (talk) 15:43, 11 June 2018 (UTC)[reply]

  •  Question As usual with those kind of « fit all property », « a facet » is something undefined. What does this mean? The proposal does not say this. It just explain, totologically, « "Great Britain" is an aspect or a facet of the main subject of the book. » (it’s a facet because it’s a facet) . Actually, the book seem to be about the botany of some places in the world. « plant of great britain » seem to me a legitimate item, and instead we can use location and usual more precisely defined properties to describe this item. Why not doing this? This is the Wikidata way. But I guess this kind of argument is bound to fail when I’ll get tired of saying this again and again… Put another way, if we want to explain what a facet it, we have to put a list of what a facet could be? We know currently « a facet can be the location of the stuff the book describes ». Or pretty much anything else? This is just a proxy for « the book has a vague relation with this topic ». Same problem with categorisation. author  TomT0m / talk page 15:23, 12 June 2018 (UTC)[reply]
@TomT0m: "Location" is not what we are looking for here. We are not trying to say where the book is located, we are trying to indicate what it is about. The whole point of keywords is that they can be almost anything, and the mechanism of combined keyword search still works the same. If one wants to look up what kind of thing a keyword is, one can look up its P31 -- one doesn't need (or want) a separate property to indicate keywords relating specifically to location.
I am all in favour of adding main subject (P921) to books as well, if the data is there, and items exist for the relevant values. But I don't have a dataset of 225,000 "main subjects", what I have is a dataset of 225,000 keywords -- which I think it would be useful to be able to add. Jheald (talk) 18:24, 12 June 2018 (UTC)[reply]
@Jheald: I did not mean put « location » on the book item, I meant having an item « flora of GB » which would held the value. « one doesn't need (or want) a separate property to indicate keywords relating » this is exactly what « facet » is, it seems, « related to ». I guess it’s something we always avoided in Wikidata, putting « related to » which basically could be a superproperty of any property, conceptually, as the all point of property is to establish a relation. I’d actually be in favor of having a « keyword » property (specific to your dataset) that links our items to the used lexeme, this would be interesting, but I’ll always favor more semantic relationships, especially if the (raw) datas are available elsewhere, over importing raw but less wikidataish datas such as « keywords ». There is probably a service somewhere on the Internet where these documents can be queried by keyword, isn’t it?
Having a location as a keyword does not explain how the topic of the book is related to the location itself. Can be location of the intrigue, or a book entirely dedicated to the location itself, or a book about something that is located in this location, … author  TomT0m / talk page 18:46, 12 June 2018 (UTC)[reply]
If the book is "entirely dedicated" to the location, that should be indicated with main subject (P921). What keywords allow is a more exploratory search style: having searched eg for a keyword that is a location, one can then see what different other sorts of keywords are associated with that set of items -- eg "crime", "plants", "history", etc -- and make a choice at that point as to how to further narrow down the selection, rather than having to know and precisely specify the full subject at the start. It's a powerful (and quite simple) technique, which is precisely why so many information retrieval systems find keyword indexing a useful thing to include. Indeed, it's exactly what the Structured Data project is hoping to create for Commons -- for images we already have the depicts (P180) property which works like this. This equivalent for other kinds of creative works would be useful. Jheald (talk) 19:08, 12 June 2018 (UTC)[reply]
In a keyword system, the main topic should hopefully be reflected in the keyword :) so that does not exactly seem like a good argument as you plan to mass import the keywords without verifying if the keyword is actually a main topic. For images, the semantic is pretty clear, there is a representation of the item’s topic in the picture. But if your proposal is structured datas, then wikipedia categories are as well :) But it is not: it’s not, by current definition, it’s not a set of relationship beetween the topic of the book and the topic of the keywords according to a data model, it’s just a bag of keywords with no hint on the relations. author  TomT0m / talk page 19:22, 12 June 2018 (UTC)[reply]
A bag of keywords can actually be a useful model. In fact, as many information retrieval and library systems have found, it can actually be the most useful retrieval model -- more useful than old-school subject-string models. I'm not against adding detailed "main subjects", but stop getting in the way of this, which is (i) available, (ii) useful, (iii) very widespread across the board as keywords given for articles, papers, journals, books, etc, etc. It's data we ought to be able to reflect. Jheald (talk) 20:08, 12 June 2018 (UTC)[reply]
@Jheald: I don’t really like that tone. Please don’t make this personal or give orders. « It's data we ought to be able to reflect » As it’s an argument I already think I gave my opinion on, this information is already queryable on the web, I don’t really see how it’s really important to have it in Wikidata in the same exact form. Also you try to fit different indexation systems into one database, Wikidata, whereas those different database may have different policies on keywords and index the same documents. How to deal with this with only one property? If keywords are structured by an ontology as you seem to imply without really explaining clearly how, are they useful without importing that ontology as well? Also I don’t really understand the relation with faceted search, which rely on faceceted classification. We can already do faceted classification with metaclassification for example using instance of and subclass of, but faceted search means adapting the presentation of the search interface or the result to the topic of interest, for example what resonator does by displaying differently different kind of items. Or displaying a human which is both a scientist and an artist as a scientist is displayed or as an artist is displayed. But to do this, you need to have informations about the properties, not keywords! On does that mean that having a « scientist » facet for a book means you can search using a search interface which is specialized to find scientists?? for example a field the search for the university in which he graduated? Sorry but I’m deeply confused by your approach of the topic and your arguments. author  TomT0m / talk page 21:18, 12 June 2018 (UTC)[reply]

The Source MetaData WikiProject does not exist. Please correct the name. Jheald (talk) 17:59, 12 June 2018 (UTC) The Source MetaData/More WikiProject does not exist. Please correct the name. Jheald (talk) 18:00, 12 June 2018 (UTC) [reply]

  • The topic of a book could be qualified. The subject could be plants (or ferns, trees ..) location Great Britain.. When this is done right, these would be the ingredients for a query. A similar pattern is used for "Catalog contains" where "human" combined with "position held" "position" is used extensively. So no, this would not work for me. Thanks, GerardM (talk) 18:22, 12 June 2018 (UTC)[reply]
    • @GerardM: I have 60,000 titles. I don't have the means to work out how all those keywords relate to each other, all I have is the dataset of the keywords without relationships given between them. The whole world uses keyword search and finds it useful. (cf category combines topics (P971)). The data is there for us, all nicely made ready on a plate. All we need to add, to be able to upload it and use it, is a property to link a title to a keyword. Jheald (talk) 18:33, 12 June 2018 (UTC)[reply]
      • @Jheald: When I get the time, I need a few days to achieve this, I will be able to have all the books and authors of the BHL organised so that it becomes easy to link them through identifiers and insert both books and authors to Wikidata. When there is a book with keywords: "Great Britain" and "Ferns" then with some simple logic it is easy to understand those relations. So no, we do not need this. Thanks, GerardM (talk) 19:22, 12 June 2018 (UTC)[reply]
        • @GerardM: No, Gerard, please don't upload any more BHL titles for the time being. It is enough trying to digest the first 60,000, de-duplicate them, identify authors properly, work out which ones should be editions, periodicals, etc., put them into proper work -- edition relationships, etc, etc. Until we can get this first batch properly cleaned up, more items would not be helpful right now.
          And please also don't sabotage the possibility of straightforward keyword searching. Jheald (talk) 19:59, 12 June 2018 (UTC)[reply]
          • @Jheald: In a previous life I earned a considerable amount of money as a database programmer. When I have the database downloaded and properly organised on my computer, I will perform disambiguation using database tools. Many of the books of the BHL have been identified with LoC identifiers. The BHL content resides typically on the Internet Archive. I have my contacts there. I will prepare my database in a way that allows me to ingest new content from the BHL and know they are new. When I can get an export from Wikidata, I will be able to associate books with authors per the existing links. Now you can tell me nothing what I do at home. What I will do when I am done is explain the process I have performed.
As to this proposal apparently people are only allowed to agree with you. I do not. Thanks, GerardM (talk) 20:46, 12 June 2018 (UTC)[reply]
@GerardM: Beware. I was looking at the LoC identifiers this afternoon. It's not that good a dataset. Of the 60,000 BHL 'title' items we currently have, about 28,000 have LoC identifiers, but many of them are wrong. So it's important to check them against the LoC itself, to make sure they are actually correct.
Secondly beware that not all books here have proper instance of (P31)s as books or as editions, nor do they necessarily have author (P50)s rather than author name string (P2093)s.
Thirdly, please make sure that any matching is rather better than your often-proclaimed 4% error rate. In my view, to be acceptable, an error rate ought to be nearer 0.01% -- ie no more than 10 mismatches or duplications in 100,000 entries. That's the standard a data upload should be trying to achieve.
Finally, I don't mind people disagreeing, or having vigorous discussions to find the best data modelling. What I do object to is people blocking whole approaches entirely to the extent of preventing whole data retrieval structures and delaying particular data uploads indefinitely. It may be that you want to be able to represent things differently, I'm not stopping you, but get out of the way of people who do want to be able to do things this way, and accurately represent data (in this case keyword information) in the way it is/was presented in the sources it was derived from. Jheald (talk) 21:16, 12 June 2018 (UTC)[reply]
@Jheald: You are telling me what to do and what not to do. Let me do my own research and let me consider what I think is reasonable. I do not subscribe to your notions. I do not agree with your approach because your argument is basically "it is the same shit, so who cares it is shit". Your notions about error rates are unrealistic it is not achieved in our data at all. Thanks, GerardM (talk) 01:08, 13 June 2018 (UTC)[reply]

 Comment Hi, I'm a librarian, I leave out the last part of the discussion: I think that generic "keyword" could be good. However, it is very important to always add references to the thesaurus or ontology of origin of the terms. Nonoranonqui (talk) 08:12, 13 June 2018 (UTC)[reply]

 Oppose Just creating complexity for nothing. Why is it a problem to say that the subject of the book is about botanic, Great Britain and Ireland ? The generation of a huge list of possible items if looking for all books having Great Britain as subject ? But this is the idea. Botanic of great Britain is part of Great Britain, so I don't understand why we have to consider it in a different way. The main problem with the proposal is the lack of rule specifying what should be in the defined by the subject property and what should be defined by the subject facet property. This is completely arbitrary so if people will do as he wants and at the end it would be impossible to propose a correct way to build query using two different sets of data. Snipre (talk) 23:00, 13 June 2018 (UTC)[reply]


 Support, as per Valentina. But: is there a way to use a more general 'keyword' property? Is tihs really specific to 'subjects' or is it about facets of any sort of work more generally? Sj (talk) 19:18, 8 July 2018 (UTC)[reply]

Discussion linked to from wikicite-discuss mailing list. [1] Jheald (talk) 07:09, 27 July 2018 (UTC) [reply]

 Comment I'm generally in favor with this proposal and in agreement with the motivations, but I wonder how external data providers will feel about a controlled vocabulary being casually mapped to Wikidata items by contributors. Case in point: a bibliographic catalog maintains a vocabulary of keywords that are not yet mapped to Wikidata. Bibliographic data from this catalog is ingested into Wikidata and the community start manually adding subject facet statements matching keywords in the original record (or creating new topical items when needed). At some point, the organization behind the catalog introduces a formal mapping of their keywords to Wikidata items. How is this situation going to be handled in the case of conflicts and is this going to be a common scenario? Basically a community member and the original authority behind the data may disagree on the proper target of this property when interpreting the meaning of a keyword. I think the proposal should also clarify if the values should be restricted to those specified by the authority for this record (in which case provenance information should be made mandatory), or can be freely added by Wikidata contributors.--DarTar (talk) 23:09, 26 July 2018 (UTC)[reply]

@DarTar: You may be more familiar than I am with organisations taking on to map their own controlled vocabularies to Wikidata themselves. I am sure that does happen, and in future I hope it will happen more. But in most cases (as has been the case with eg the matching of external thesauruses using Mix'n'match), I would think the values will have to be matched and added by the community (or more likely by one or two motivated individuals, perhaps using OpenRefine, and then batch-adding the relevant 'subject facet' statements), because that's how I think this will get done.
But you raise an important point, about transparency and traceability and contestability and improveability of the matching of the controlled vocabulary -- because (unlike matching a thesaurus to an external ID), the matching will not have a visible localised nexus in a single particular statement on a single particular item, but will be embodied implicitly in statements across the dataset. So how to retrieve and review and if necessary correct or improve such matches, after the event? Well, I'm a big fan of broad use of the object named as (P1932) qualifier. Using this one can attach the original string to a statement, as well as the Wikidata Q-number. If this is systematically used, then that gives us the traceability, and it becomes straightforward to produce a query or a Listeria page, matching the original controlled-vocabulary string to the Q-number (or Q-numbers plural) that have been used to represent it. If the data-source has a Wikidata project page (eg Wikidata:WikiProject BHL for the Biodiversity Heritage Library), the matching of various strings could then be discussed on the talk page, or perhaps just a note left to say that some of the matching had been updated, and was everybody okay with that. If there was a Listeria page, then that would give an audit-trail history of precisely what changes had been made over time.
P1932 can thus be used for subject keywords that are given as a string. If the subject keyword is given as a coded thesaurus identifier, then it may make more sense to give that as the qualifier instead (or as well). These qualifiers would be queryable in exactly the same way.
If keyword statements are coming from an external source, then yes the source should definitely be referenced. That should indeed be a mandatory thing. But there may be cases, for example, where there isn't any external source that has subject-indexed a particular set of works. Then editors should be free to add subject keywords from their own assessment; and it's probably also important to be able to add additional subject keywords, even when a set is available from an external source. How should this be indicated? One alternative would be the keywords simply not having a reference statement in such cases. A different approach, that might be preferable, could be to require a reference statement (subject to a constraint check), but allow people to give stated in (P248) = <no value>, or inferred from (P3452) = title (Q783521), or inferred from (P3452) = "Schlagwortkette". That would then clearly distinguish keywords that were editor-supplied, as against cases where the reference had simply been forgotten. Jheald (talk) 08:28, 27 July 2018 (UTC)[reply]

 Comment Not sure the current framing of the property is useful/ actionable as it stands. I agree that we need something besides main subject (P921), but that something and its use should be better defined (and delineated from P921) than what we currently have in this property proposal. Something like "keyword" also seems to be a more promising way of framing this than the very nebulous "facet". --Daniel Mietchen (talk) 11:32, 27 July 2018 (UTC)[reply]

 Comment Agree with Daniel that this proposal needs a little bit more thought before we proceed. It's especially important that we first think of some way to source the values so that people won't just start adding random values to items. Husky (talk) 12:09, 27 July 2018 (UTC)[reply]

  • @Jheald: it seems people would be happier with "keyword" (or "subject keyword"?) as the English label of this property, is this ok with you? Regarding the discussion on sourcing above - how is this different from any other data in wikidata? ArthurPSmith (talk) 17:07, 27 July 2018 (UTC)[reply]
    • @ArthurPSmith: Yes, absolutely, whatever people want to call it. Keyword is obviously the much more familiar term. The one reason I didn't go for it is a slight mantra that Wikidata items represent things, not words or terms; also "keyword" might suggest something language-dependent, whereas Wikidata's aspiration is to represent things in a way that is as language-independent as possible. But I am not going to get on any horse about what the property is called -- what it operationally does is what matters. As for sourcing: agree completely, how or why is this different to anything else that goes or doesn't go on Wikidata. Jheald (talk) 19:31, 27 July 2018 (UTC)[reply]
      • Yea, it seems like a fine + useful use of this for anyone to add a keyword -- as long as we can distinguish a definitive map stated by an existing ontology, from the analysis of an individual editor. Community norms would govern what constraints if any there are on sourcing, a norm that might change by field or topic. Sj (talk) 21:29, 27 July 2018 (UTC)[reply]
    • +1 for using "keyword". As to whether that implies a word vs. a concept, I note that FRBR-aligned Bibliographic Ontology (Q44955004) has "subject term" <subclass of> "concept" defined as "A concept that defines a term within the controlled vocabulary of a particular classification system ... used as an annotation to describe the subject, meaning or content of an entity." I'd say the property fabio:hasSubjectTerm is an exact match for what is proposed here (and perhaps "subject term" would be a good alias for "keyword" on that basis). Details of "subject term" in FaBiO here, p. 7. Perhaps we should change the description of this property to something like "term or concept used to contextualize this item in an external vocabulary, catalog, or datasource". I'd like to include catalog since publisher's catalogs frequently have a list of keywords for books they publish. - PKM (talk) 21:21, 28 July 2018 (UTC)[reply]
  • @Jheald, PKM: I updated the label and description based on PKM's comment above. The "OR" is to indicate one of these should be label, the other an alias, as our property templates don't have a spot for aliases. Any further thoughts? @Valentina.Anitnelav, TomT0m, GerardM, Nonoranonqui, Snipre: @Sj, DarTar, Daniel Mietchen, Husky: please note "facet" is no longer part of label or description here. ArthurPSmith (talk) 13:45, 30 July 2018 (UTC)[reply]
  •  Support as amended, and happy with either “subject term” or “keyword” as the label. Very useful in my work for books used as references where the publisher may choose keywords like “costume” and “textiles” for cataloging options but where WD can support a more specific “main subject” like “medieval costume” or “costume of Spain”. - PKM (talk) 17:39, 30 July 2018 (UTC)[reply]
So the argument is to allow importing publisher keywords of low quality such as “costume” and “textiles” and keeping such broad keywords instead of grudually improving them with more specific terms such as “medieval costume” or “costume of Spain”?! -- JakobVoss (talk) 05:50, 8 August 2018 (UTC)[reply]
  •  Support in the amended form. Slight preference for 'subject term' because 'keyword' seems to suggest you can use arbitrary words, while 'term' feels a little bit more strict and defined. Husky (talk) 20:51, 30 July 2018 (UTC)[reply]
  •  Comment Hi, sorry for joining in late. I recognize the problem, that the wording "main subject" does not always capture is, besides just suggesting there should be a "minor subject" too. To me, this proposal sounds a bit like the "minor subject", and I agree that could be a nice complementary property. But I would like to make people aware that we already have another granular mechanism to say what a paper is about: using that paper as "reference" for Statements. For example, if we have some paper as reference on the statement that the chemical compound acetic acid (Q47512) pKa (P1117) 4.74 (the reference paper actually being Small Scale Determination of the pKa Values for Organic Acids (Q23571464)), this also provides topics the paper is about, one perhaps a "main subject" (the compound) and the other (the concept of acid dissociation constant (Q325519) via the property) perhaps being a "subject facet"... --Egon Willighagen (talk) 05:38, 7 August 2018 (UTC)[reply]
  •  Comment What's to stop someone adding all the keywords from the index of a book, to the item about that book? We all know that, if someone can do so, they will. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:39, 7 August 2018 (UTC)[reply]
  •  Oppose A new property this broad should be justified by more and independent use cases, in addition to the dump of keywords to be imported into Wikidata. The proposal only gives a single example and wrongly assumes a narrow use of main subject (P921). The existing property is also used as proposed with "main subject" as "one of the most relevant topical aspects or facets" instead of "the primary and only subject". -- JakobVoss (talk) 20:44, 7 August 2018 (UTC)[reply]
  •  Oppose. This would be insufficiently structured. "Botany in Great Britain" could be a valid main subject, which could then be properly informative and filled with useful statements. A set of keywords tells us nothing. --Yair rand (talk) 06:24, 20 August 2018 (UTC)[reply]
@Yair rand: A set of keywords is very helpful for retrieval -- so "it tells us nothing" is simply not true. The question here is whether we use main subject (P921) = "botany", "Great Britain" for such keywords (per User:JakobVoss above), or whether we use a different property and reserve P921 for subject-strings that are more encompassing. Jheald (talk) 10:28, 20 August 2018 (UTC)[reply]