Investigate sentence splitting
Open, HighPublicSpike
Actions

Assigned To

Authored By

	VPuffetMichel
	Dec 2 2022, 8:18 PM

Description

This task involves the work of identifying the technical approaches we will consider using for splitting an arbitrary range of text/content into discrete sentences.

Decision(s) to be made

1. What – if any – technical approach is accurate and reliable enough for the Editing Team to depend on for identifying discrete sentences for the purpose of Edit Check being able to "automatically place a reference at the end of a sentence."

Investigation output

Per what @DLynch proposed in T324363#8561900, we will strive to make a prototype (or series of prototypes depending on how many approaches seem viable) that we can use in Patch demo to evaluate how effective a given approach is at splitting arbitrary content into discrete sentences.

Findings

Not currently able to support sentence splitting in Thai
- Reason being: Thai does not have punctuation to signify the end of a sentence which the current sentence splitting approach depends on/cues off of.

Open question(s)

1. What languages will a given sentence splitting approach need to work in for us to consider it viable?

Done

Next steps are documented for all Decision(s) to be made
Answers to all Open question(s) are documented
Next steps are identified for all Findings (should there be any)

Details

	Subject	Repo	Branch	Lines +/-
	WIP: sentencebreak	unicodejs	master	+283 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T265163 Create a system to encode best practices into editing experiences
Open		Trizek-WMF	T331946 [RELEASE TICKET] Make Edit Check (references) available to all newcomers at all Wikipedias
Open		None	T331947 [RELEASE TICKET] Make Edit Check MVP available to all newcomers at Partner Wikipedias
Resolved		ppelberg	T338907 [RELEASE TICKET] Make Edit Check MVP available as Beta Feature (all wikis)
Resolved		None	T332646 [Deployment] Make Edit Check MVP (mobile + desktop) available for testing at test.wikipedia.org
Resolved		ppelberg	T328944 Build Edit Check MVP (references)
Resolved		ppelberg	T328598 Build a technical prototype for the initial reference check (mobile)
Open		None	T325700 Conduct technical investigations to assess feasibility of various edit check UX approaches
Open	Spike	dchan	T324363 Investigate sentence splitting
Declined		dchan	T331080 Hook up WIP sentence splitting as a bookmarklet
Open		dchan	T331083 Link up WIP sentence splitting to content detection
Open	Spike	dchan	T331686 Evaluate reliability of sentence splitting approach

Event Timeline

VPuffetMichel created this task.Dec 2 2022, 8:18 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptDec 2 2022, 8:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ppelberg added a parent task: T265163: Create a system to encode best practices into editing experiences.Dec 2 2022, 11:49 PM

ppelberg subscribed.

ppelberg moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.Dec 3 2022, 1:18 AM

Isaac subscribed.Dec 5 2022, 9:23 PM

Copying some content from Slack to ensure it's not lost:

https://www.unicode.org/reports/tr29/#Sentence_Boundaries

https://unicode-org.github.io/icu/userguide/boundaryanalysis/#sentence-boundary is the ICU usage guide for their API

php-intl binds this; https://www.php.net/manual/en/intlbreakiterator.createsentenceinstance.php looks like it's the PHP binding

Following up on a discussion with @ppelberg : I recently helped start a project to do sentence tokenization in (ideally all) Wikipedia languages. The project is written in Python but there might be some reusable pieces. Always happy to talk. The quick summary:

We are doing our best to compile a list of sentence-ending punctuation that covers all Wikipedia languages
We then have a very simple sentence segmenter that essentially looks for the presence of those punctuation (with some caveats like ignoring decimal points)
We are currently working on trying to identify how big of an issue things like abbreviations are (we know they're very prevalent in e.g., German Wikipedia) and whether we can devise a simple solution for detecting and skipping over them.

Project: https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools
Our current list of full-stop punctuation: https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/blob/main/src/wikinlptools/config/symbols.py#L324

ppelberg added a subscriber: Esanders.Dec 8 2022, 11:42 PM

ppelberg mentioned this in T325700: Conduct technical investigations to assess feasibility of various edit check UX approaches .Dec 21 2022, 12:30 AM

VPuffetMichel added a project: EditCheck.Jan 18 2023, 3:58 PM

ppelberg edited parent tasks, added: T325700: Conduct technical investigations to assess feasibility of various edit check UX approaches ; removed: T265163: Create a system to encode best practices into editing experiences.Jan 23 2023, 8:02 PM

A complexity to bear in mind is that we're not splitting text, we're splitting either wikitext or HTML. This might cause some challenges around finding valid insertion points.

Given that sentence splitting is (probably) going to need a server-side component, it might make sense for for us to make a prototype tool that can be thrown onto patchdemo and output some sort of visualization of an article and how it'd be split so we can get an idea of how viable this is for placement.

ppelberg updated the task description. (Show Details)Jan 27 2023, 11:14 PM

What languages will a given sentence splitting approach need to work in for us to consider it viable?

While a number of languages have unique sentence-ending punctuation, the list of sentence-ending punctuation I shared earlier hopefully captures almost all of those so makes this pretty feasible for almost all Wikipedia languages. The major caveats that we're currently aware of when it comes to language-specific trickiness:

German Wikipedia uses a lot of abbreviations that lead simple methods to split the text far too often.
Thai doesn't have punctuation so the only straightforward thing there without developing a custom segmentation approach is to split by paragraph -- I haven't talked with Search to see if they do anything special there that could be borrowed from.

Meeting notes from today:

@dchan said he was going to look into this?
Compare front-end version in CX with libICU (intl in PHP, required in MW since 1.36)
- @cscott concerned this might hit a brick wall when this is eventually deployed to a language which the front-end-only version doesn’t support. (CJK requires large dictionaries, eg.)
ES: https://www.php.net/manual/en/intlbreakiterator.createsentenceinstance.php Example: https://www.php.net/manual/en/class.intlbreakiterator.php#113798
https://unicode-org.github.io/icu/userguide/boundaryanalysis/

matmarex mentioned this in T327705: [SPIKE] Investigate current capacity to automatically place references.Jan 31 2023, 11:00 PM

ppelberg moved this task from Ready to Be Worked On to Doing on the Editing-team (Kanban Board) board.Feb 6 2023, 5:26 PM

ppelberg renamed this task from [edit check] Investigate sentence splitting to Investigate sentence splitting.Feb 6 2023, 5:46 PM

ppelberg mentioned this in T324730: Create the heuristic that will [initially] trigger the reference check.Feb 17 2023, 11:48 PM

ppelberg triaged this task as High priority.Feb 22 2023, 5:17 PM

We're currently envisaging two uses for sentence segmentation in Edit Check:

Counting sentences added. Used to decide whether to trigger the reference check.
Suggesting sentence boundaries as locations to add a new citation.

As @Isaac pointed out, sentence segmentation is script dependent. For some scripts/languages, there are unambiguous sentence terminators, such as U+3002 (。) IDEOGRAPHIC FULL STOP in Chinese or Japanese, or U+0964 (।) DEVANAGARI DANDA in Indic languages. For other scripts, sentence terminators can be ambiguous, e.g. in Latin, Cyrillic etc. the character U+002E (.) FULL STOP is used to end sentences but also for abbreviations, decimals etc.

Unicode TR29, as mentioned by @cscott earlier, ( https://www.unicode.org/reports/tr29/#Sentence_Boundaries ) gives a language-independent algorithm that is generally quite accurate, but for some scripts is not 100% accurate. It describes how to extend the algorithm using language-specific lexical data, e.g. to identify whether a Latin FULL STOP character is part of a common abbreviation and therefore unlikely to be acting as a sentence terminator. The ICU library (with PHP bindings) implements the TR29 rules, augmented with lexical data. But note that for sentence segmentation, there's not actually very much data there! It's just abbreviation lists for seven languages: de en es fr it pt ru (totalling a few hundred abbreviations: see https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr ). ICU contains a lot of data for other types of segmentation (e.g. what is in effect a list of Chinese words), but not that much for sentence segmentation. Using the lexical data improves sentence segmentation accuracy, but not to 100%.

So implementing this with server-side ICU is a possibility. However, I think this is not the best option, because of the particular requirements for Edit Check, which I'll explain.

Edit Check requirements

For counting sentences, we do not need 100% accuracy: a confident answer could be counted as 1 and a tentative answer as, say, 0.5. For suggesting sentence boundaries, we also do not need 100% accuracy, but it is better to have a false positive (offering an unsuitable location that the user can simply ignore) than a false negative (not offering the correct location, making it difficult for the user to place the citation there).

Once we have the sentence boundary, we do need to support per-language configuration of where exactly citations go at the end of a sentence, relative to closing parentheses, sentence terminators and whitespace. This is because different wikis have different conventions, and it would be disruptive if an automatic tool does not respect the conventions. Here are some examples, where I've marked the citation placement with "█".

enwiki: … barely registered in my mind."█
eswiki: … y manifestó «profunda gratitud y gran humildad».█
frwiki: … vous faites appel à Roy »█.
jawiki: …（生まれはニャンザ州ラチュオニョ県カニャディアン村█）。
kowiki: … 생각하는 어떤 것을 선택할 수도 있다.█
ruwiki: … на берегах реки много парков и набережных█.
zhwiki: … 1532年法國作家弗朗索瓦·拉伯雷的《巨人傳》█。

You can see we have to support a number of different conventions about whether the citation goes before or after the sentence terminator, close parenthesis, whitespace etc. To support this, we'd really want to use Unicode character data to identify parentheses, quotation marks, whitespace and sentence terminators from all scripts.

Therefore I recommend we don't use ICU, but implement this in our own code instead. The Edit Check code will need direct access to the Unicode character data in any case, in order to place references correctly. (This is similar in essence to the Unicode data we already make available in UnicodeJS to find word boundaries). Given this data, it is a small step to implement the TR29 rules ourselves, which would give us more flexibility to tune the algorithm. For example, we’d prefer false positives to false negatives, and we may not be satisfied with the trade-off built into ICU. Additionally, we could collect and use our own lexical data for abbreviations languages other than the seven supported by ICU (de en es fr it pt ru). The custom segmenter can also be client-side code, giving us flexibility to perform sentence segmentation without performing a server round-trip. This may make more types of UX innovation possible in the future.

Change 893832 had a related patch set uploaded (by Divec; author: Divec):

[unicodejs@master] WIP: sentencebreak

https://gerrit.wikimedia.org/r/893832

gerritbot added a project: Patch-For-Review.Mar 3 2023, 12:32 AM

dchan added a subtask: T331080: Hook up WIP sentence splitting as a bookmarklet.Mar 3 2023, 1:32 AM

ppelberg updated the task description. (Show Details)Mar 3 2023, 2:05 AM

dchan added a subtask: T331083: Link up WIP sentence splitting to content detection.Mar 3 2023, 3:44 AM

ppelberg mentioned this in T331686: Evaluate reliability of sentence splitting approach.Mar 10 2023, 2:11 AM

RHo mentioned this in T269655: Add a link: sentence highlighting research spike.Mar 15 2023, 9:23 PM

kostajh subscribed.Mar 16 2023, 8:20 AM

SBisson subscribed.Mar 27 2023, 5:19 PM

santhosh subscribed.Apr 12 2023, 6:08 AM

ppelberg moved this task from Doing to Blocked / Needs More Work on the Editing-team (Kanban Board) board.Jul 21 2023, 10:44 PM

ppelberg added a project: Goal.Aug 22 2023, 9:38 PM

VPuffetMichel added a project: Release.Sep 6 2023, 8:07 PM

ppelberg removed a project: Editing-team (Kanban Board).Sep 11 2023, 5:26 PM

Proof-of-concept patch set to surface sentence segmentation in VisualEditor: https://gerrit.wikimedia.org/r/c/VisualEditor/VisualEditor/+/961095

Patchdemo instance of the above patch set (thanks @DLynch) https://patchdemo.wmflabs.org/wikis/4b31a139ad/w/index.php?title=Douglas_Adams&veaction=edit

Press Enter in a paragraph to detect sentence boundaries. The following characters are added around the sentence boundary:

⓿ immediately before the sentence terminator (e.g. ? or ! or . or 。)
❶ after the sentence terminator, but before any immediately following closing punctuation (e.g. brackets or quotes)
❷ after closing brackets or quotes, but before trailing whitespace
❸ after trailing whitespace

Sometimes the sentence boundary is marked by an ambiguous terminator (e.g. the Latin FULL STOP). Such terminators have dual use for another purpose (e.g. the Latin FULL STOP is used for abbreviations). Therefore the algorithm cannot be 100% certain this is actually a sentence boundary. This demo denotes the ambiguity by inserting alternative lighter characters ⓪①②③.

Examining the code's behaviour on the citation placement examples above, we can see that positions ⓿❶❷❸ would be sufficient for all those examples, except the Japanese close bracket, for which the citation placement comes before the Unicode TR29 definition of a sentence boundary would begin.

In case it helps, language team had a very similar requirement for our machine translation service(MinT) and for CX-cxserver. We just published our sentence segmentation library in python and javascript. It also clusters the references along with the previous sentence. It is designed to support large number of languages and custom rules per languages are possible by design.

JS library demo https://santhoshtr.github.io/sentencex-js/
NPM package https://www.npmjs.com/package/sentencex
Python package https://pypi.org/project/sentencex

ppelberg awarded a token.Sep 28 2023, 11:33 PM

ppelberg mentioned this in T347643: Enable volunteers to define new content added in terms of sentences.Sep 28 2023, 11:51 PM

ppelberg mentioned this in T347644: Introduce hidden tag to identify edits when new sentences are added.Sep 29 2023, 12:05 AM

ppelberg closed subtask T331080: Hook up WIP sentence splitting as a bookmarklet as Declined.

ppelberg mentioned this in T350904: [INVESTIGATION] What additional semantic primitives could be introduced to better understood VE edits at scale?.Nov 9 2023, 5:53 PM

ppelberg mentioned this in T350922: Ensure citations are placed in the position that matches project conventions.Nov 9 2023, 11:59 PM