Page MenuHomePhabricator

Discard URLs from overoptimistic Dissemin papers suggestions
Closed, ResolvedPublic

Description

https://dissem.in/p/80133892/nomenclature-for-factors-of-the-hla-system-2004 and https://dissem.in/p/70990242/rheumatic-heart-disease have merged dozens of links for over ten DOIs each. This results in interesting but ultimately unhelpful suggestions for OA links such as this PMC ID: https://en.wikipedia.org/w/index.php?title=User_talk:OAbot&diff=907224614&oldid=891940262

In that case the title was different by one word, while journal and authors were the same and the years were different. We could repeat the entire matching on OAbot side but this should really be fixed at https://github.com/dissemin/dissemin/issues/512 (and certainly will, sooner or later).

For now we can just ignore such papers. I found that anything with more than 2 DOI links is at risk.

Event Timeline

Nemo_bis triaged this task as Medium priority.Jul 22 2019, 2:31 PM
Nemo_bis created this task.

Pintoch pointed out that Poudou found some suggestions of PMC records for reviews of the work, rather than the work itself. I believe an example is https://tools.wmflabs.org/oabot/review-edit?edit=504612102b37ecb60d8cf5d20ece1490&name=Eric+Kandel where this citation:

{{Citation |last=Kandel |first=Eric R. |year=2012 |title=The Age of Insight: The Quest to Understand the Unconscious in Art, Mind, and Brain, from Vienna 1900 to the Present |publisher=Random House|location=New York |isbn=978-1-4000-6871-5}}

got as suggestion https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3516903/ which is https://dissem.in/p/29216724/the-age-of-insight-the-quest-to-understand-the-unconscious-in-art-mind-and-brain-from-vienna-1900-to-the-present . This record has nothing wrong, it's just the wrong match.

Nemo_bis renamed this task from Discard URL suggestions from overmerged Dissemin papers to Discard URLs from overoptimistic Dissemin papers suggestions.EditedJul 22 2019, 5:32 PM

And just to confirm that the mismatch comes from the title search on Dissemin:

>>> import mwparserfromhell
>>> from wikiciteparser.parser import parse_citation_template
>>> parse_citation_template(mwparserfromhell.parse("{{Citation |last=Kandel |first=Eric R. |year=2012 |title=The Age of Insight: The Quest to Understand the Unconscious in Art, Mind, and Brain, from Vienna 1900 to the Present |publisher=Random House|location=New York |isbn=978-1-4000-6871-5}}").filter_templates()[0])
{u'PublisherName': u'Random House', u'Title': u'The Age of Insight: The Quest to Understand the Unconscious in Art, Mind, and Brain, from Vienna 1900 to the Present', u'Authors': [{u'last': u'Kandel', u'first': u'Eric R.'}], u'ID_list': {u'ISBN': u'978-1-4000-6871-5'}, u'PublicationPlace': u'New York', u'Date': u'2012'}
>>> args = {'title': u'The Age of Insight: The Quest to Understand the Unconscious in Art, Mind, and Brain, from Vienna 1900 to the Present', 'authors': [{u'last': u'Kandel', u'first': u'Eric R.'}], 'date': u'2012', 'doi': None}
>>> req = requests.post('https://dissem.in/api/query', json=args)
>>> req.json()
{u'status': u'ok', u'paper': {u'classification': u'UNK', u'title': u'The Age of Insight: The Quest to Understand the Unconscious in Art, Mind, and Brain, From Vienna 1900 to the Present', u'pdf_url': u'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3516903', u'records': [{u'splash_url': u'https://www.researchgate.net/publication/274780981_The_Age_of_Insight_The_Quest_to_Understand_the_Unconscious_in_Art_Mind_and_Brain_From_Vienna_1900_to_the_Present', u'contributors': u'', u'abstract': u'', u'pdf_url': u'https://www.researchgate.net/profile/Zhihao_Zhang4/publication/274780981_The_Age_of_Insight_The_Quest_to_Understand_the_Unconscious_in_Art_Mind_and_Brain_From_Vienna_1900_to_the_Present/links/55b1ba7d08aec0e5f4311d23.pdf', u'source': u'researchgate', u'keywords': u'', u'identifier': u'oai:researchgate.net:274780981', u'type': u'journal-article'}, {u'splash_url': u'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3516903', u'contributors': u'', u'abstract': u'', u'pdf_url': u'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3516903', u'source': u'base', u'keywords': u'Book Review', u'identifier': u'ftpubmed:oai:pubmedcentral.nih.gov:3516903', u'type': u'other'}], u'authors': [{u'name': {u'last': u'Zhang', u'first': u'Zhihao'}}], u'date': u'2012-12-13', u'type': u'journal-article'}}

I have deployed the change, removed the old suggestion, restarted the webservice, checked the page at https://tools.wmflabs.org/oabot/process?name=Eric_Kandel (no suggestion) and prefilled it on the command line (same). Seems fine. Will still need to regenerate the suggestions when the current run is over.

A first "victim" is the citation I'm testing for the other feature request:

{{cite journal | vauthors = Angeloni D, Lee JD, Johnson BE, Teh BT, Dean M, Lerman MI, Sterneck E | title = C306A single nucleotide polymorphism in the human CEBPD gene that maps at 8p11.1-p11.2 | journal = Molecular and Cellular Probes | volume = 15 | issue = 6 | pages = 395\u20137 | date = Dec 2001 | pmid = 11851384 | doi = 10.1006/mcpr.2001.0377 }}

the unstructured author list means that the first author "Angeloni" doesn't match with https://dissem.in/p/26445613/c306a-single-nucleotide-polymorphism-in-the-human-cebpd-gene-that-maps-at-8p111p112 . That may be an acceptable price to pay (especially as sooner or later Unpaywall will match this too).