Sailing Steady - How you can help keep Wikimedia sites Error-free

Translate this post

Back in September 2020, I wrote about how we began rolling out JavaScript error logging for Wikimedia sites.

Now it’s February 2021 and I’m pleased to say that we’ve added English Wikipedia, our largest traffic wiki, to the sites that we log on, we’ve fixed many of the errors that have been flagged to us, we have alerts in place to let us know when new errors are introduced, and we are blocking deployments on any noticeable error spike, and we have triaged all our existing issues.

If you haven’t already, and have the relevant permissions please check our dashboard. We log just under 40,000 errors every day and fixing any of them would be a tremendous help!

Source: Commons cc by sa 4

So while we are in this period of sailing steady on still waters, what still needs to be done?

1) Check error logs when changing site code

This one might seem obvious, but we’re not doing it. Code health is the responsibility of all introducing code.

We’re getting great at catching errors pretty soon after they happen. We’ve had various errors originating from banner campaigns. We’ve also had incidents where site administrator volunteers have introduced bugs to all users via Common.js

Unfortunately for privacy reasons we can’t make our error logs public, but we can make our alerts public. The challenge now is getting people aware of the impacts of their changes and keeping an eye out for such alerts. Release engineering now check JavaScript errors whenever we roll the train forward so we are covered for any code that rolls out with the train.

However, we deploy code outside the train. We may backport changes using a backport window, we might edit a special wiki page which ships JavaScript to all users e.g. [[MediaWiki:common.js]], enable a gadget by default, or enable a banner campaign via CentralNotice.

How can you help?

If you are a wiki interface admin, deploying banners, or backporting code that impacts JavaScript, please check error logs after your deployment. Please tell others to do the same, and to fold this into any process documentation. Alerts will display in the #wikimedia-operations IRC channel as “wikimedia-client-errors” and can be monitored via this grafana graph.

IRC screenshot of alerts Source: own

2) Create a dashboard for your team or self

Checking for errors becomes easier if you are familiar with the dashboard you are using. While we have a board for all errors, it will likely be more useful to have your own team dashboard tracking errors concerning your projects. This will allow you to see when errors get fixed in production as well as when new deploys cause spikes.

Depending on the projects you work on it might make sense to create filters based on stack trace, page, or wiki. Feel free to suggest new filters that may be useful, we’re currently working on improving the information we provide.

The generic dashboard has currently triaged existing bugs and suggested owners for each of them, so is a good place to start with suitable filters.

How can you help?

If you want a dashboard setup and need some help, feel free to reach out by filing a Phabricator ticket cry for help or feel free to reach out to me via your favorite communication medium and I’ll gladly help you get one setup and show you the ropes of logstash!

3) Help Wikimedia redefine JavaScript browser support

About 20% of our errors come from older browsers. A frequent issue has been seeing gadgets get enabled site-wide by default which assume ES6.

Lifting our browser support, would give those users more error-free experiences and remove noise from our monitoring systems.

Right now we are planning to raise support to ES6 browsers which should help considerably.

How can you help?

Subscribe to the Phabricator ticket and review existing bugs for older browsers to help inform the heuristic for determining browser support.

4) Find active maintainers for our media extensions

One big challenge of rolling out error logging as been that its uncovered errors in extensions that were built several years ago, but are now not actively supported. In particular we have seen issues with many of our extensions that provide rich media: MultimediaViewer, TimedMediaHandler, Kartographer, Collection and Graph extensions. These are important extensions to our users, and should be of importance to the Wikimedia movement, given they help information go beyond text.

They generated a large volume of errors which despite tickets being raised were not addressed. For many of these bugs, I — with no experience of any of the extensions —  had to jump in and suggest fixes, and find reviewers so that these errors didn’t make the error logging dashboard useless by overwhelming it with noise. In one case an unbreak now sat open without comment for over a week — that’s not a great situation.

In future, there is a danger as browsers improve, that left unchecked these extensions could degrade further, leading to “unbreak now” phabricator tickets that would derail future work by providing a source of interrupt work. 

How can you help?

If you are an engineer who maintains an older extension, think about how it could be simplified, or improved alongside existing work efforts. If the extension is not a good fit for your team please make use of the Code Stewardship Review process. If you need any help constructing project proposals, please consider reaching out to someone in the product department, or an engineer with staff, architect or principal in their job title.

If you are in a position of power in management, please consider prioritizing the improvement or turning off these extensions, while we can, before these extensions degrade to a point where we are forced to make decisions we might not want to, that could cause upset in our community or derail important strategic initiatives. Please also prioritize finding homes for these extensions with the Code Stewardship Review process. Many extensions have been in this process for some time without decisions.

5) Help build a strategy for user scripts and gadgets

This one is a big one! MediaWiki is a very unique product given that we allow users to run arbitrary JavaScript for themselves and if trusted to do so, for others.

As part of the roll out of client side error logging for English Wikipedia alone, at time of writing I have edited almost 300 user scripts, site scripts and gadgets. In Chinese 84, Hebrew 61, Farsi 53, Spanish 123, Catalan 39, German 98. I haven’t discriminated against any particular wiki and have done this through my own free will to ensure that this new error logging service is as useful as possible.

My Activity before and after rolling out error logging. We added error logging in December 2020. Yellow = user namespace, green = MediaWiki namespace (gadgets) https://xtools.wmflabs.org/ec/en.wikipedia.org/Jon%20%28WMF%29#year-counts source: own work, screenshot of xtools.wmflabs.org

The problem with this is many of our users are not JavaScript developers and can and will cargo cult program, copying and pasting text to see if it works. In my travels I have seen attempts to add wikitext to JavaScript, PHP code (wfLoadExtension is a quite common one). Many users also edited gadgets over 5 years ago, and have not touched them since — no wonder they no longer work!

This adds significant noise to our monitoring, and unfortunately, we cannot filter this easily, because user and site scripts run in global scope, so cannot be easily distinguished from other code.

In many cases, certain power users may visit over 100 pages in a given day, throwing an error on every single page. On top of this, often these errors can easily be traced to users, meaning we are basically tracking the user’s activity across the site and which pages they view, which is not great from a privacy point of view.

Recently we added a mechanism for a client to disable error logging. This means if we identify users who’s privacy is being undermined, we can add code to their user scripts to opt them out of error logging like so:

mw.loader.using('mediawiki.storage').then(function () {
mw.storage.session.set( 'client-error-opt-out', '1' );
});

Right now it’s not clear what our strategy for gadget and user script errors should be. Unofficially, it seems to be “Wikimedia doesn’t care about user scripts or gadgets, these are maintained on wiki”, but now that we are tracking errors and cannot filter these, we have to care about gadgets to some extent.

Gadgets are particularly a problem as they can be enabled site wide and effect multiple users, driving error traffic up.

My recommendations

  1. I believe we should handle the user script problem by either disabling error tracking for non-empty user scripts or validating the scripts that are saved. It seems fair that editing a user script voids your warranty in some way, but if not, a script should at least run without a syntax error! Another way of approaching this is to limit errors per session to a certain number, perhaps a certain amount trips a user setting in preferences that turns off error logging indefinitely until it’s restored, however that would need careful planning particularly for the scenario when a site wide issue causes an error for every user.
  2. User scripts should expire after a certain period and when users retire and no longer maintain them. If a user script hasn’t been edited in over 5 years for example, a bot should blank the page.
  3. For gadgets, I think the Wikimedia Foundation needs a tooling or a team that communicates to editors about their gadget errors. Over the past few months, I have been manually maintaining an anonymized wiki page for reporting errors to users, and thanks to the amazing support of many editors, those bugs have generally been fixed. I can’t do this forever however.

How can you help?

Please subscribe to the Phabricator ticket, and join the conversation.

If you are an editor, please consider talking to your community about the problem of old user scripts and gadgets, or scripts without maintainers, and consider revising policies so that these scripts are taken offline rather than left to cause damage.

If you are a tool developer, I proposed on the community wishlist that a tool would be helpful. If you are capable of creating a tool that takes data from logstash and publishes it to a wiki page, I’d love to hear from you.

If you are in a position of power in management, please consider prioritizing defining a strategy for gadgets, and for improving the communication channel between Wikimedia and the people that write them.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?