Wikidata:Property proposal/Imagehash perceptual hash
Imagehash perceptual hash
[edit]Originally proposed at Wikidata:Property proposal/Commons
Description | Imagehash perceptual hash is perceptual hash which tells whether two images look nearly identical. |
---|---|
Represents | ImageHash perceptual hash (Q104884110) |
Data type | String |
Domain | mediainfo (Commons only) |
Allowed values | [a-z\d]{16} |
Example 1 | M68454019 → 878ed95a53a065e5 |
Example 2 | M68456558 → 8589da64f0b9599e |
Example 3 | M68455617 → 86c2b43c5f956e32 |
Example 4 | M68456184 → c0d09f3524e5ef19 |
Source | |
Planned use | First I would populate hash values for photos uploaded by user:FinnaUploadBot, but generally hash could be added to all of the Commons files |
Number of IDs in source | currently there is 68M files in Commons and checksum can be calculated to all photos |
Expected completeness | eventually complete (Q21873974) |
Robot and gadget jobs | checksum should be generated by bot |
See also |
|
Motivation
[edit]I am using the pHash checksums for detecting duplicate photos in Commons. I am also using pHases to confirm if the photo in Commons and Finna repository are same. However, it would be useful if hashes could be share so they could be queried by any user. Pre-generated perceeptual hashes of files could be also fetched from SDC as a lists without a need to download actual files. . Zache (talk) 20:15, 22 January 2021 (UTC)
Also note there are different implementations of phash-algorithm which are generating different hashes (example: https://phash.org or https://github.com/KilianB/JImageHash). I renamed the proposal so that proposal refers to Imagehash version. --Zache (talk) 04:22, 23 January 2021 (UTC)
Discussion
[edit]- Comment see also T167947. Abbe98 (talk) 14:56, 26 January 2021 (UTC)
- Yes, I checked checked the tickets before I made the request. (also direct link to c:User:Fæ/Imagehash) It seems that they are currendly stalled. General I think that there should be support for storing multiple different hashes in Commons like it is possible to store multiple different identifiers in Wikidata. SDC would work as storage platform for the values. Next phase could be use data with bots and user code and then how to implement it in scalable way as part of the commons upload or doing similarity searches using SPARQL. --Zache (talk) 15:11, 26 January 2021 (UTC)
Opposejust use checksum (P4092). Checksum and hash are synonyms in this context. Multichill (talk) 23:03, 26 January 2021 (UTC)- We can use the P4092 like I was doing it in my example mediainfo pages. However, the hashes, checksums, digital signatures or fingerprints, etc (no matter how we call them) in the general sense are very common data for files. Using specific properties in generall will make the use of that data (via Lua, API, or SPARQL) a bit more straightforward. My feeling is that because of this it would be a good idea to model them using specific properties in the long run. The specific properties would allow also the validation of the values and formatter URLs which would be nice features too. --Zache (talk) 03:46, 27 January 2021 (UTC)
- Ok, I'm willing to give this a shot. So Support. Multichill (talk) 18:58, 3 March 2021 (UTC)
- We can use the P4092 like I was doing it in my example mediainfo pages. However, the hashes, checksums, digital signatures or fingerprints, etc (no matter how we call them) in the general sense are very common data for files. Using specific properties in generall will make the use of that data (via Lua, API, or SPARQL) a bit more straightforward. My feeling is that because of this it would be a good idea to model them using specific properties in the long run. The specific properties would allow also the validation of the values and formatter URLs which would be nice features too. --Zache (talk) 03:46, 27 January 2021 (UTC)
- Support (please add Commons only as domain) --- Jura 12:57, 5 February 2021 (UTC)
- like this: diff? --Zache (talk) 15:27, 5 February 2021 (UTC)
- Somehow "mediainfo" already specifies it, but adding both might make it clear to the casual reader. I edited it slightly --- Jura 15:43, 5 February 2021 (UTC)
- like this: diff? --Zache (talk) 15:27, 5 February 2021 (UTC)
- BTW, maybe datatype should be external-id as it should be unique. --- Jura 15:45, 5 February 2021 (UTC)
- It is likely that there is multiple files in commons with same hash-value as it tries to match to similarity of the content. After that it is up to commons community to decide what to do the duplicate images but afaik it is possible that there is legit reasons to have multiple versions of same image. --Zache (talk) 16:32, 5 February 2021 (UTC)
- Interesting. Let's stay with string datatype then. I'm curious how many there will be. In any case, using P4092 (as suggested above) wouldn't simplify it, as one would need to check the determination method in addition. --- Jura 07:13, 6 February 2021 (UTC)
- Support --Tinker Bell ★ ♥ 20:14, 6 February 2021 (UTC)
- Done @Zache, Tinker Bell, Multichill: please make good use of it. It would probably good to add two samples were the hash is the same, but the image somewhat different. --- Jura 20:53, 15 March 2021 (UTC)