Jump to content

Parsoid/Parser Unification

From mediawiki.org

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original core parser (legacy parser), and the other is Parsoid. As of early 2023, the core parser was used for all desktop and mobile web read views, while Parsoid was used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This project is primarily driven by the Content Transform team (previously Parsing Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Movement Communications team (previously Community Relations Specialists), Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details.

Updates will continue to be published on a separate page.

Project Goals

[edit]

Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase

Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.

How are we testing this change?

[edit]
  • Parser tests: This is how Parsoid has been developed since its inception. We ensure that Parsoid continues to pass parser tests, and where divergence is known, it is recorded after careful review. We have also vastly expanded parser test coverage over the years, and all patches against Parsoid need to pass tests.
  • Round-tripping / Integration tests: In this mode, before every production deployment, we convert wikitext to HTML and HTML back to wikitext on about 180K pages from about 50 production wikis. While this testing mode is primarily to ensure our HTML -> wikitext conversion is not broken (which would impact our editing client tools), this also implicitly serves to flag any breakages in our HTML output. But, these aren't the most reliable tests for verifying that our HTML output is not broken.
  • Visual diff tests: Here, we take renderings of legacy parser HTML and Parsoid HTML and compare the rendering screenshots and generate a numeric diff score. We have run this in an automated way on 25k+ pages from about 20 production wikis. This has been a really reliable way to identify various breakages and bugs in Parsoid output. As we get closer to rollout, we intend to expand our testing to a wider range of wikis.
  • Parsoid reading and editing clients: Parsoid's output has been used over the years by VisualEditor, Android and iOS mobile apps, Kiwix, and other clients. We have fixed a number of bugs and incompatibilities in Parsoid over the years and continue to fix the various long-tail edge cases as they are discovered and reported.

As we get closer to rollout, we will identify other QA and testing methodologies as required to ensure we can roll out this change in as smooth and non-disruptive fashion as possible and will update this page as that happens.

What is our deployment plan / strategy?

[edit]

At this stage of this project, we have split this work into a number of steps to achieve the intermediate goal.

  1. ✅ Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has its own deployment plan. This change has been live on mediawiki.org and officewiki since September 2021 and we expect to roll this out to all wikis gradually in 2022.
  2. ✅ Deploy individual user opt-in tools to use Parsoid for read views as part of the ParserMigration extension.
  3. ✅ Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly. This lets us iron out bugs in a restricted use case.
  4. Turn on Parsoid HTML read views on additional wikis incrementally
    • ✅ officewiki
    • ✅ Talk pages on wikitech
    • (in progress) wikivoyage
    • ...
  5. Continued work to ensure Parsoid is able to generate identical metadata that the legacy parser generates (categories, backlink tables, page properties, etc). This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in additional wikitext use cases.
  6. Start rolling out on all wikis gradually -- more specific deployment plan will be developed based on what we learn in previous stages.

Confidence Framework

[edit]

To validate our road-map evolution and use data-driven decision making for deployments, we have developed a Confidence Framework for Parsoid Read Views. This framework contains the guidelines for how we prioritise features, bugfixes, and deployments.

How does this impact wikis?

[edit]

For the most part, the switch to Parsoid generated HTML should be transparent to most users. But, below, we outline some possible impacts on readers, editors, and developers.

Readers

[edit]

Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes lead to differences in rendering in some edge case scenarios. If some wikitext pattern is commonly used, we have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up. At this time, we believe all rendering differences we expect to run into will be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.

Editors and bot, gadget, skin developers

[edit]
  • Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser is currently being updated to generate HTML that is pretty close to Parsoid's HTML. We expect to roll this out this year which might require some skins, gadgets, bots, and template styles to be updated.
  • The Cite extension that targets Parsoid relies on CSS rules to localize numbering of references rather than generate localized HTML. This requires editors with appropriate permissions to update MediaWiki:Common.css on their wikis to add suitable CSS rules targeting this HTML.

Extension developers

[edit]

Parsoid's internal processing model is different from the legacy parser. As a result, extensions may need to be updated. This only impacts extensions that do one or more of the following: (a) operate on wikitext (b) provide handlers for parser hooks (c) call a public method of the legacy parser.

Extensions that process wikitext will definitely need to be updated to work with Parsoid. To date, the vast majority of such extensions have been updated. Since Parsoid continues to access the legacy parser for expanding templates, processing parser functions, any parser hooks triggered during this processing will continue to operate and extensions that rely on these hooks will continue to operate. For the rest, we are exploring strategies to minimize updated needed to extensions.

We will file phabricator tasks for all impacted extensions as we proceed with this work, and will fix whatever extensions we can within our team. If you are an extension developer, we would greatly appreciate any proactive work and code review (for patches we might submit).

Kartographer

[edit]

Kartographer has been ported to Parsoid, but the port is not deployed to all wikis yet. It can be deployed on a specific wiki with a patch similar to Gerrit change 969168. It is strongly advised to prepare a list of pages to test during the deployment (via Wikimedia Debug) and to test these before moving on with the full deployment. The following cases are interesting to test:

What kind of support will we provide to impacted editor and developer communities?

[edit]

The Content Transform Team is driving this project. Our goal is to make this switch to Parsoid as seamless as possible. So, we have tried to roll out changes over the years gradually.

We started with replacing HTML4 Tidy with HTML5 RemexHtml in the 2015 - 2018 timeframe. In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, we ported Parsoid from JS to PHP. This switch went very smoothly. In the 2020 - 2022 timeframe, we started work to unify the media output generated by Parsoid and by core. This has mostly involved making changes to core, but we have occasionally adjusted Parsoid's output based on feedback and other technical considerations.

Going forward, we will provide support in the following ways:

  • Linter rules for any wikitext that needs fixing.
    • The vast majority of this work was completed as part of the Tidy -> Remex migration and we don't expect to introduce a large number of new linter categories for this
  • Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
  • Opt-in mechanisms for early adopter users / wikis to test and report problems.
    • See the next section for more details!

How can you help / be involved?

[edit]

Starting November 2023, you can opt-in to using the new Parsoid parser for reading articles on Wikipedia. See Help:Extension:ParserMigration for more information!

Other things you can do to help:

  • Test your gadgets / user scripts against Parsoid HTML to identify / fix any breakages
  • Parsoid read views will be rolled out first on wikis whose communities have elected to be early adopters; watch this space for more details.
[edit]