Wikisource:WikiProject OCR

From Wikisource
Jump to navigation Jump to search
WikiProject OCR

This project is for users to request for scans to be OCRed for various Wikisource-related projects.

Instruction

[edit]

The participants listed below are users who have access to some kind of OCR software and are willing to extract text from scanned documents.

Users who desire for a text to be OCRed should place their request under the Requests section with the following format:

[[Title of the book]] (year published) - Author. # of pages. [source where pages can be found]

Note: "year published" should be when it was published in the U.S. as this will make determining the copyright status easier.

While these are the general instructions for requesting that a project be scanned, other users may have more specific instructions if they are to take on a project.

Requested uploads to Internet Archive

[edit]

Uploading scan from any external website to Internet Archive saves the trouble of extracting the OCR text and Djvu conversion. Please follow the instructions of Help:Internet Archive/Requested uploads to request upload to IA.

Participants

[edit]

Instructions

[edit]

Preference given to:

  1. Smaller requests
  2. Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
  3. Works that are hard to find in text form elsewhere on the Internet
  4. Works that I do not proofread

I will only work on two large projects at a time (they are first come, first serve) and will work smaller projects in the mix as I make time for them.

Current projects

[edit]
Title Year published Author Pages Source Completion
Historical Library 1814 Diodorus Siculus (trans. G. Booth) 677 < 5%

Instructions

[edit]

Preference given to:

  1. Smaller requests
  2. Requests where obtaining the scans is easier (such as downloading a ZIP file instead of having to access each scan and download them all individually)
  3. Works that are hard to find in text form elsewhere on the Internet
  4. Works that I have not proofread

Current projects

[edit]

World Revolution


Instructions

[edit]

Preference given to:

  1. Larger or non-standard requests, or where image batch-processing or DjVu conversion is needed
  2. English requests
  3. Requests where obtaining the scans is hard (batch-downloading is my favourite bot activity)
  4. Works that are hard to find in text form elsewhere on the Internet
  5. Works that are likely to be proofread soon
  6. Large reference works which, even if not proofread soon, provide a valuable reference resource.

Current projects

[edit]

Requests

[edit]

Done

[edit]
Done, but the noisy text with long-s has not OCR'd very cleanly. Is it sufficient? Inductiveloadtalk/contribs 09:23, 22 February 2012 (UTC)[reply]
:( I dont think so. Thanks for trying. Moondyne (talk) 13:33, 22 February 2012 (UTC)[reply]
But I just found this. I might OK after all. Moondyne (talk) 13:41, 22 February 2012 (UTC)[reply]
OK, sorry that didn't turn out so well. The OCR generated by Tesseract from that kind of scan is generally only really useful for match and split, since the noise and old-fashioned font work against clean OCR. Google has a much more powerful and well-tuned software for the job, but I don't know exactly what it is. Inductiveloadtalk/contribs 18:40, 22 February 2012 (UTC)[reply]

OCR bot

[edit]

There is an automatic tool for OCRing single pages at time, which is useful for repairing text on pages where it is missing or incomplete. It is available through the editing toolbar in the Page: namespace. It is accessed by clicking the button. The edit box will go grey while the server processes the image and the OCR will appear in the edit box within a few seconds (larger pages with more text take longer). You can check the status at http://tools.wmflabs.org/phetools/ocr.php. A further feature of the tool is that the next page is automatically OCR'd when one page is retrieved, so the next page's text should be ready by the time you edit the next page.

Requested uploads to Internet Archive

[edit]

Google Books

[edit]

PDF Scans derived from Google Books contains a warning which needs to be stripped off before adding the text to IA for facilitating proofreading for Wikisource. These are normally done by the user/bot "tpb" (not affiliated to Internet Archive): we dream of a way to suggest tpb books we're interested in; we can start accumulating Google Books URLs here and then maybe tpb at some point will fetch them.

Also see this Scriptorium thread opened by Yann. Solomon7968 (talk) 10:37, 4 February 2014 (UTC)[reply]

  • The work tpb has started seems to be ended years ago, but I'm not sure. In the meantime the GBS original collection grown considerably. Maybe we are in need of a tool to do direct research on GBS + warning page removal + IA upload instead? Lugusto 19:03, 7 February 2014 (UTC)[reply]
    • Many editors here equipped with the software to remove the warnings and watermarks replace the existing IA derived file on Commons with the clean one without warnings and watermarks. It is especially trouble-some for large files. An automated system for uploading to IA will help for sure. Solomon7968 (talk) 03:31, 8 February 2014 (UTC)[reply]

The easiest way to remove the warning page is with DjView, see Help:DjVu files#Removing a copyright page. The best place to ask for someone to do it for you is probably the Repairs (and moves) section of the Scriptorium. —Beleg Tâl (talk) 03:13, 10 January 2017 (UTC)[reply]


See also

[edit]