Wikidata:Property proposal/title match pattern
website title match pattern
[edit]Originally proposed at Wikidata:Property proposal/Authority control
Description | a regular expression extracting a probable label from the <title /> of a website |
---|---|
Data type | String |
Domain | property |
Allowed values | regular expression with a single capture group |
Example 1 | IMDb ID (P345) → URL match pattern (P8966) → ^https?:\/\/(?:(?:www|m)\.)?imdb\.com\/(?:(?:search\/)?title(?:\?companies=|\/)|name\/|event\/|news\/|company\/|list\/)(\w{2}\d+) → title match pattern → ^(.*)\s-\sIMDb$
|
Example 2 | X username (P2002) → URL match pattern (P8966) → ^https?:\/\/(?:mobile\.)?twitter\.com\/(?:intent.+screen_name=)?(?!home|hashtag|explore|settings)([0-9A-Za-z_]{1,15}) → title match pattern → ^(.+)\s\(@[^\)]+\)\s\/\sTwitter$
|
Example 3 | MusicBrainz artist ID (P434) → URL match pattern (P8966) → ^https?:\/\/musicbrainz\.org\/artist\/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})
→ title match pattern → ^(.+)\s-\sMusicBrainz$
|
Motivation
[edit]Wikidata for Web (Q99894727) is a browser extension that recognises websites that have the value external id property on wikidata from it's url using the URL match pattern (P8966) property.
It is also able to create external id statements and add them to a user defined (or new entity)
the user, however has to enter the label of an existing entity manually. Most websites already carry an appropreate label in their <title/>
. it is however usually diluted with some website specific words, that are most likely not part of the label.
The Twitter Profile of Tim Berners-Lee for example has a title element that looks like this:
<title>Tim Berners-Lee (@timberners_lee) / Twitter</title>
In order to find the wikidata label we only need what ever precedes the opening bracket. A regular expression to extract that string could be ^(.+)\s\(@[^\)]+\)\s\/\sTwitter
This property would be meant to be used as a qualifier for URL match pattern (P8966) (see examples) --Shisma (talk) 12:53, 18 June 2022 (UTC)
Discussion
[edit]- Comment I'm not sure why you need this for an existing entry - they should already have a label? But I could see this being useful for new entities - is that what you meant here? ArthurPSmith (talk) 16:09, 20 June 2022 (UTC)
- for existing items this would be merely a convenience feature. Often the title (the relevant part im trying to extract) of the thing is a 1:1 match to some wikidata label/alias. I could also use this property to add subject named as (P1810) to each statement or to add aliases-Shisma (talk) 06:37, 21 June 2022 (UTC)
- Support it would be very useful for bots. --Tinker Bell ★ ♥ 00:02, 7 July 2022 (UTC)
- Support Can I suggest possibly renaming the property to "website title match pattern" to make it clearer this is a website related property (and not a match pattern for titles of books or something else with titles). --Dhx1 (talk) 12:56, 14 July 2022 (UTC)
- Yes agreed - Shisma (talk) 15:22, 28 July 2022 (UTC)
- Oppose It's not clear to me who is going to use this other than you. Twitter and IMDb have JSON+LD in their pages, MusicBrainz has an API, I don't see any reason to be trying to extract the name from the page title in any of the examples provided. - Nikki (talk) 23:12, 28 July 2022 (UTC)
- While a dedicated API or JSON+LD snippeds require very specific code to be written to mine even the name of a thing, most websites have and use the
<title/>
in a uniform way, and all it needs is a simple expression to get it. – Shisma (talk) 05:01, 29 July 2022 (UTC)
- While a dedicated API or JSON+LD snippeds require very specific code to be written to mine even the name of a thing, most websites have and use the
- Support This would helpful in sorting the useful information in the title tag from name of the site of whatever they put in there, this seems a simple way to improve the quality of data in things like wikidata's title and label tags Back ache (talk) 09:55, 11 August 2022 (UTC)
- @Shisma, Tinker Bell, Dhx1, Nikki, Back ache: Done ArthurPSmith (talk) 16:56, 26 August 2022 (UTC)