Wikipedia talk:OABOT: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 20:51, 20 April 2018

Question

Naive question(s) from a non-wikipedia-person (JTW):

How much standardization is there, and how many edge cases are worth pursuing? I'm trying to figure out what tags to search for but it seems like there are layers of deprecated standards at this point.
A first pass is to just worry about references using the Cite Journal syntax. That's pretty standardized and easy to match. The simplest script that's worth writing is something like: find all cite journal tags, look for doi/pmid/pmc IDs, and look up an OA link to that paper if it's not present — Preceding unsigned comment added by Jamestwebber (talk • contribs) 18:27, 11 September 2015 (UTC)[reply]

Actually, the more I look into this the more confused I get. Can we establish a set of test-cases that this bot should handle?

[edit: as is probably obvious, I've never done anything on wikipedia before. Will sign things in the future] James Webber (talk) 19:44, 11 September 2015 (UTC)[reply]

Getting a free copy of an article by DOI / PMID

To get a free to read copy of articles by DOI, we could use the CORE search engine via its API. It accepts DOIs and other identifiers as search parameters. Note however that the indexing looks a bit faulty to me: for instance, this arXiv document is associated with a DOI, and CORE harvests arXiv, but searching for this DOI from the CORE interface does not return anything. The metadata tools we have developped for the Dissemin project overcome this issue and it should not be too hard to provide a similar API to be used by this bot. Pintoch (talk) 20:17, 11 September 2015 (UTC)[reply]

Wikidata should be involved/merged with this

Hi. I was at the recent Wikipedia Science Conference. At this, Dario Taraborelli had a great suggestion that Wikidata could house ALL the mappings of useful literature: DOI <-> PMID <-> PMC <-> arXiv ID

NCBI has a useful API to help with some of this: http://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/ I'd just tack on free full text access URLs as another useful mapping in this. But this should all be done on Wikidata and used in Wikipedia from Wikidata Metacladistics (talk) 20:20, 14 September 2015 (UTC)[reply]

Coordination with OA Signalling project

Hi, the OA Signalling project is doing something very similar, just for openly licensed references. A short sketch of the workflow sits here, and it includes Wikidata's WikiProject Source Metadata (alluded to above) as well as a gadget to display information from the OA Button. It would be great if we could join forces on those aspects that are independent of paywalls and licensing. -- Daniel Mietchen (talk) 00:33, 24 November 2015 (UTC)[reply]

Indeed, please merge this page to Wikipedia:WikiProject Open Access/Signalling OA-ness or at least move under Wikipedia:WikiProject Open Access. ALL CAPS page titles aren't standard, but are ok as redirects. Nemo 14:50, 2 June 2016 (UTC)[reply]

Proof of concept

I've written a quick proof of concept here. Feedback welcome! An interesting discussion about how open access should be indicated in references is taking place here. Pintoch (talk) 18:05, 16 March 2016 (UTC)[reply]

Might "Free Version"be better in front?

There's a section the the free version link has "Free version" added at the end. I think it might be better to put "Free version" or whatever marker at the front of the link rather than at the end, especially if the links are going to look like the one in the example. Chuck Baggett (talk) 15:07, 28 May 2016 (UTC)[reply]

Hey, Chuck Baggett. Thanks for the feedback. This mockup is entirely hypothetical and would ultimately have to be refined and approved by the CS1 template editors and other reference buffs. I personally think it's problematic to put free version before basic identifiers like title and author and date. There may be many other ways to make the free version link more prominent and I'm open to modeling and demoing any/all of them. For now the immediate focus is on having the bot add a link (technically). How that link appears is important and not-yet-decided. We will raise that discussion in the next week or two. Cheers! Jake Ocaasi (WMF) (talk) 15:57, 31 May 2016 (UTC)[reply]

I think the |url= should be used as a place where put the most useful link for our readers. A free to read version is arguably more useful than a paywalled one. We should make sure we still link to the version from the publisher, but that is what |doi= is for. If the free to read link also corresponds to an identifier, then we should also add it as an identifier (so, it would appear both as |url= and |arxiv=, say). Adding a "Free version" link would generate too much clutter, I think. − Pintoch (talk) 19:18, 31 May 2016 (UTC)[reply]

I think the url should be the version the editor actually read when they cited the content, but I'm open to discussing all the options. Jake Ocaasi (WMF) (talk) 18:42, 2 June 2016 (UTC)[reply]

URL replacement

Re the "Edge cases for future development": it's always good to remove an URL to a paywalled version from the url parameter, as long as the DOI is provided (which can be used to easily reach the publisher's version). --Nemo 11:30, 15 August 2017 (UTC)[reply]

Yeah I agree - the bot should not be blocked by |url= that are resolved versions of an existing |doi=. − Pintoch (talk) 10:18, 16 August 2017 (UTC)[reply]

Another good example: in [1], the existing URL is broken and the CiteSeerX cache is probably an archived copy of that original URL. It would be very good to replace or remove the broken URL. --Nemo 11:33, 30 August 2017 (UTC)[reply]

CiteSeerX

I'm duly checking the CiteSeerX links before adding them, so I now got this (after about 20 downloads):

Download Limit Exceeded You have exceeded your daily download allowance.

Lame. --Nemo 11:40, 30 August 2017 (UTC)[reply]

You can try downloading the uncached versions that they list instead (which are not hosted by them, so you should not have any rate-limit on that). But they are not always listed though. − Pintoch (talk) 12:40, 30 August 2017 (UTC)[reply]

On the bright side, now that oaDOI was added as a source the CiteSeerX links are much less common, so it will take more time to hit the limit in any given day. --Nemo 07:36, 26 October 2017 (UTC)[reply]

cds.cern.ch links

In [2], rather than linking at [3], it should link to [4].

This applies to other links to that domain, like the 2nd link it changed in that diff. Headbomb {t · c · p · b} 12:04, 30 August 2017 (UTC)[reply]

Likewise, in [5] rather than link at [6], the bot should link at [7]. Headbomb {t · c · p · b} 12:27, 30 August 2017 (UTC)[reply]

This should be true in general. Link to the free document/PDF when possible, rather than simply to a page where the document can be found if you look hard enough. Headbomb {t · c · p · b} 12:27, 30 August 2017 (UTC)[reply]

Going to @Nemo bis: on this as well, since you've unleash the both on a lot of physics articles, creating a lot of these links needing to be updated to point to the PDFs. Headbomb {t · c · p · b} 12:28, 30 August 2017 (UTC)[reply]

Personally I prefer links to the records because then the abstract is quickly accessible. I prefer the link to the PDF only when the interface makes the PDF hard to find. --Nemo 12:34, 30 August 2017 (UTC)[reply]

Repository managers also tend to prefer that, as it gives an opportunity to the reader to discover their platform. I have met multiple researchers who were explicitly told not to give direct links to the full texts but to the landing page instead (for various reasons). If a direct link to the PDF is really preferred (by a guideline somewhere on Wikipedia), then the CiteSeerX identifier should be updated to point directly to the cached PDF (and same for arXiv), as the PDF url can be obtained directly from the identifier. − Pintoch (talk) 12:44, 30 August 2017 (UTC)[reply]

CERN links are hard to find. They're buried at the bottom of a page containing videos, and half of million other links. We should put readers first, not repository managers first. Go at [8], where is the relevant link? It will take you a while to find it. Headbomb {t · c · p · b} 12:56, 30 August 2017 (UTC)[reply]

For me it took maybe a couple seconds (without knowing the repository software). There is a clear "PDF" link text and icon, with good contrast, in a clearly delimited area, in a predictable position, without a need for JavaScript, localised in my language. This is not a case of hard to find PDF. Additionally, what if the user is interested in the video after all? From the PDF URL they'll almost never be able to go back to the record. Nemo 13:06, 30 August 2017 (UTC)[reply]

For me it took me about 2 minutes, because I thought it was the video, and that didn't make any sense. Clicking on download also didn't give me the paper I was looking for. Then I scrolled to the bottom of the box, and there was still no link, so I went back up and dug in "files" where I finally found the link. Headbomb {t · c · p · b} 13:20, 30 August 2017 (UTC)[reply]

I do realise that people react differently. For instance I tend to not click anything (I'm particularly video-blind) and to use page up/down or "end" abundantly. But still, you'll probably agree this repository is a masterpiece in usability compared to, say, Elsevier's websites. --Nemo 13:43, 30 August 2017 (UTC)[reply]

I'm not saying a link shouldn't be given, but it should be a link to the document, rather than making the reader hunt for it, otherwise they'll think it's a link added in mistake, or a link only containing superficial information about the document. Headbomb {t · c · p · b} 14:16, 30 August 2017 (UTC)[reply]

I think for me the safest way to reject this change is simply to say: I am happy to deploy this change if somebody takes the time to write the code for it... − Pintoch (talk) 17:12, 30 August 2017 (UTC)[reply]

Article size

Sometimes the tool seems to timeout on some articles. Do we know what's the largest article size or number of links it can handle? For now the biggest I found in my testing is [9], I think. --Nemo 12:38, 30 August 2017 (UTC)[reply]

That is a problem indeed. No I don't know what the maximum size would be. Note that there is some caching at reference-scale, so the request could potentially complete if you try a second time. − Pintoch (talk) 12:45, 30 August 2017 (UTC)[reply]

OAbot usage

Is there any way to see the OAbot edits a user has made? I found a link that seemed to be an error in a CiteSeerX link - namely, the link did not go to a full article but rather went to a notice that said

you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

So, then I look up in the upper right hand corner and see a pdf link, which has the following info within the URL - www(dot)employees(dot)csbsju(dot)edu - and which, from the link, I can see is a link for a specific Economics class taught at the College of Saint Benedict/Saint John's University in Minnesota. I'm not sure this is all strictly legal-ish, to get at this article's content through a link that clearly isn't posted for public use. Shearonink (talk) 15:00, 2 September 2017 (UTC)[reply]

I'm not sure I understand your question or what case you're talking about, but if the link gets interrupted for copyright reasons that's one more reason to consider the CiteSeerX links ok: it means they handle copyright notices so we need not worry about what remains up. As for the legal implications of linking, it's generally ok to link resources which are already public in the web (see e.g. CJEU on "new public"). --Nemo 09:08, 3 September 2017 (UTC)[reply]

Adding links when doi is already free

According to § How does the bot work?, the bot should only be looking for links when the existing linking isn't the full free article. That seems in keeping with general ideas of maximum compliance with WP:V and making as much source material as reachable as possible to readers, while simultaneously not bloating refs with redundant external links. Things like DOI and PMID have the advantage of being stable and redirect properly, whereas links to publishers' websites are susceptible to linkrot if the publisher changes their website. I noticed a series of edits today where following the doi link allows me to access to the free full article for free, but using OABot still added a direct link to it (example). Is that intended? DMacks (talk) 02:22, 24 October 2017 (UTC)[reply]

As an alternative, the CS1 citation templates have a field specifically to identify when a standard identifier does provide the full content for free (vs just free abstract and possible link to paywalled full). See Template:Cite/doc#Access level of identifiers for details. Seems like it would be preferable to note that an existing stable link is already free vs adding a less-stable additional link that goes to the same provider. DMacks (talk) 02:30, 24 October 2017 (UTC)[reply]

I don't see how adding an extra (accessible) link is a problem. It's a problem only to add paywalled links. :) --Nemo 14:23, 24 October 2017 (UTC)[reply]

I agree we could try to avoid proposing these links, it's just not entirely straightforward but I'll try to look into that − Pintoch (talk) 15:12, 24 October 2017 (UTC)[reply]

Actually according to the docs (WP:OABOT#Examples), it already is supposed to know about and add |doi-access=free tags. Maybe do a regex or string comparison of the proposed link and the doi (or other identifier) and if they match to within some closeness (same hostname and maybe some later string details) assume that the publisher itself (target of doi link) hosts the free content (url link) and presumably one can get to free to doi. DMacks (talk) 07:48, 27 October 2017 (UTC)[reply]

How to run this bot on a specific page?

Sorry if I'm missing it. Can I run this bot on a page similar to checklinks or the citation bot? Thanks. - Scarpy (talk) 16:47, 24 October 2017 (UTC)[reply]

Click "Start editing a random page", and then at the top you'll have an input field where you can type the name of the page you want to analyze. Analysis will take a long time, though. − Pintoch (talk) 17:16, 24 October 2017 (UTC)[reply]

Issues

Hello. I am having several issues:

The random pages button isn't working properly. It's giving me pages that other WIkipedians have already included links with the OAbot e.g Biotechnology, Ant and Economics. I've only just started today using this bot, so is it starting from the list from scratch for me?
Also, edits are not being made for me on Firefox/Microsoft edge for Brain. I've checked the wikipedia article to check if it's being used (which it isn't) but for some reason, I can't add the link. Thanks --MrLinkinPark333 (talk) 22:10, 24 October 2017 (UTC)[reply]

Rights checking

Hi,

I wanted to free some references as my birthday gift. It turned out checking the rights on the link proposed.

Alexander technique and Feldenkrais method: a critical overview Sanjiv Jain, MD, Kristy Janssen, PA-C, Sharon DeCelle, MS, PT, CFT

However, according to http://www.sherpa.ac.uk/romeo/search.php (checking what authors can do) "author cannot archive publisher's version/PDF". The pdf proposed in the link is clearly the publisher's version. (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=DCD202DCC6ABDCDB516E1C80F4CB94AC?doi=10.1.1.611.4183&rep=rep1&type=pdf)

So I wonder about the rights on publications available trough CiteSeerX.

Using the bot, what is the interaction with the authors ? In my opinion changing access to science should be done upstream, not downstream. I emailed Sanjiv. I'll make a deeper opinion on this with such experiences.

--RP87 (talk) 08:11, 25 October 2017 (UTC)[reply]

RP87, I don't know what you mean by "upstream", but sure, it's better if the articles are archived by the authors themselves (green open access).

A good guide to share with authors: https://cyber.harvard.edu/hoap/How_to_make_your_own_work_open_access
A tool to make it very easy, already used to archive thousands of papers cited on the English Wikipedia: https://dissem.in/

Nemo 07:39, 26 October 2017 (UTC)[reply]

Broken links

Tracked in Phabricator
Task T179101

[10] is obviously a bogus URL. Is the bot confusing url with title (trying to construct a wiki-formatted link with visible alternative text) but then inducing editors to paste that as the "url" itself or is it failing to urlencode the url to protect whitespace? DMacks (talk) 17:34, 26 October 2017 (UTC)[reply]

This is indeed a whitespace error, I will fix it in the tool. − Pintoch (talk) 08:46, 27 October 2017 (UTC)[reply]

Please fix the bot or stop

Whatever is going in with this bot, people are using it to add, and even re-add, ELNEVER violations. See this followed this for example. I don't know if it is the bot or people not being careful enough using it, but either way COPYVIO additions are being added throughout WP. Jytdog (talk) 15:02, 28 October 2017 (UTC)[reply]

Those are not WP:ELNEVER violations by a longshot. Headbomb {t · c · p · b} 18:20, 28 October 2017 (UTC)[reply]

The paper that OAbot suggested a link for was published by Liebert and their policy is here and says authors can post preprints but says in bold: "The final published article (version of record) can never be archived in a repository, preprint server, or research network." The link there is to the final published article.

OA is important but ELNEVER must be followed. Jytdog (talk) 21:21, 28 October 2017 (UTC)[reply]

Author webpages are neither repositories, preprint servers, nor research networks. Headbomb {t · c · p · b} 21:32, 28 October 2017 (UTC)[reply]

I am going to post at ANI to have this bot paused. Done here. Jytdog (talk) 22:17, 28 October 2017 (UTC)[reply]

Jytdog, author rights for sharing their paper (and which version) can be deterimed at this website, which we link to right in the tool on every page where you can add a link: http://www.sherpa.ac.uk/romeo/index.php Ocaasi (WMF) (talk) 23:04, 28 October 2017 (UTC)[reply]

People appear to be ignoring that tool. And I just put in "science" and while the detailed entry is correct (final published version not allowed to be posted), the summary version on the results page (which I can't link to) is incorrect and lists Science as "green" (OK to post final published version). Jytdog (talk) 23:06, 28 October 2017 (UTC)[reply]

The solution is to educate people, not to shut the bot down, people are responsible for the edits they make. Headbomb {t · c · p · b} 23:12, 28 October 2017 (UTC)[reply]

I just had to revert this. How on earth can kitsrus.com be treated as a credible repository for reprints from the NEJM? LeadSongDog come howl! 20:09, 2 November 2017 (UTC)[reply]

ANI

There is currently a discussion at Wikipedia:Administrators' noticeboard/Incidents regarding an issue with which you may have been involved. Jytdog (talk) 22:41, 28 October 2017 (UTC)[reply]

IEEE Article

I just saw this OABOT edit that added a link to an MIT website that apparently is user "benmv"'s "public" subdirectory. The 2002 paper's authors are Han and Thorup; neither of them appear to be at MIT. Neither appears to be "benmv".

Querying https://dissem.in/api/10.1109/SFCS.2002.1181890 finds nothing.

Querying oaDOI produces a hit, but it looks like a COPYLINK. That makes me doubt the source can be trusted. Glrx (talk) 23:44, 3 November 2017 (UTC)[reply]

Bot adds pmc= to citation when PMC= is already present?

This edit is tagged as using the bot to add a |pmc= parameter when |PMC= was already present. If this is a bot bug, can you please try to fix it? Thanks. – Jonesey95 (talk) 15:38, 12 November 2017 (UTC)[reply]

New maintainers

Hi CristianCantoro, Nemo bis, Ocaasi and Samwalton9, I have added you as maintainers of oabot on the Toolsforge so you should be able to deploy the changes made to the tool yourselves. It works as follows:

ssh to tools-login.wmflabs.org with your wikitech account
run the command "become oabot"
"cd www/python" to go to the directory where the source code of the bot lives
"git checkout master ; git pull" to sync the code from github
"webservice uwsgi-python restart" to restart the web server

Cheers, − Pintoch (talk) 16:06, 15 November 2017 (UTC)[reply]

@Pintoch: Thanks! Sam Walton (talk) 18:06, 15 November 2017 (UTC)[reply]

Thank you :) --CristianCantoro (talk) 14:15, 16 November 2017 (UTC)[reply]

OABot is unable to parse these references

Tracked in Phabricator
Task T166287

OABot is currently unable to add links in articles that do not use reference templates, like this one. Can we add a citation parser to OABot so that it can find links for these references? Jarble (talk) 17:03, 7 January 2018 (UTC)[reply]

I would love to see that happening. I do not currently have the time to work on that but I am keen to help anyone find its way in the codebase and brainstorm. − Pintoch (talk) 17:51, 8 January 2018 (UTC)[reply]

@Pintoch: Instead of using a citation parser, it would also be possible to find open-access documents using scholar.py. As far as I know, OABot is not yet capable of doing this. Jarble (talk) 00:04, 21 January 2018 (UTC)[reply]

@Jarble: I do not think this is feasible because of the rate limits imposed by Google Scholar. I also think it is against their terms of service. But I would be happy to be proved wrong. − Pintoch (talk) 19:32, 21 January 2018 (UTC)[reply]

Wrong parameter for arXiv links

In Special:Diff/819541202 the bot (apparently under manual control) added an arXiv link to a paper, in the format |url=http://arxiv.org/pdf/math/9805045. The correct format for such links is instead |arxiv=math/9805045. Please fix. —David Eppstein (talk) 23:55, 9 January 2018 (UTC)[reply]

@David Eppstein: thanks for the bug report. Will do. − Pintoch (talk) 13:30, 13 January 2018 (UTC)[reply]

Zenodo

Please remove Zenodo from the sites that OABOT looks for publications to link to. I keep finding witless users of OABOT adding links using OABOT to Zenodo, and Zenodo seems to have no screen for what people upload. The last diff I reverted like this, was this one. Jytdog (talk) 19:53, 20 April 2018 (UTC)[reply]

How do you know the author did not gain authorisation for that upload? --Nemo 20:51, 20 April 2018 (UTC)[reply]

Revision as of 19:54, 20 April 2018 edit Jytdog (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers 187,951 edits →Zenodo: ce ← Previous edit		Revision as of 20:51, 20 April 2018 edit undo Nemo bis (talk \| contribs) Extended confirmed users 39,315 edits →Zenodo: +re Next edit →
Line 198:		Line 198:

	Please remove Zenodo from the sites that OABOT looks for publications to link to. I keep finding witless users of OABOT adding links using OABOT to Zenodo, and Zenodo seems to have no screen for what people upload. The last diff I reverted like this, was [https://en.wikipedia.org/w/index.php?title=Hydrocodone&diff=prev&oldid=837427398 this] one. [[User:Jytdog\|Jytdog]] ([[User talk:Jytdog\|talk]]) 19:53, 20 April 2018 (UTC)		Please remove Zenodo from the sites that OABOT looks for publications to link to. I keep finding witless users of OABOT adding links using OABOT to Zenodo, and Zenodo seems to have no screen for what people upload. The last diff I reverted like this, was [https://en.wikipedia.org/w/index.php?title=Hydrocodone&diff=prev&oldid=837427398 this] one. [[User:Jytdog\|Jytdog]] ([[User talk:Jytdog\|talk]]) 19:53, 20 April 2018 (UTC)
			:How do you know the author did not gain authorisation for that upload? --[[User:Nemo_bis\|Nemo]] 20:51, 20 April 2018 (UTC)