Wikipedia:Wikipedia Signpost/2016-11-26/Special report
Taking stock of the Good Article backlog
- Wugapodes is a two-time GA Cup participant and WikiCup finalist. Their academic work focuses on the linguistic impacts of group behavior.
Before an English Wikipedia article can achieve good article status (the entry grade among the higher-quality article rankings), it must undergo review by an uninvolved editor. However, the number of articles nominated for review at any given time has outstripped the number of available reviewers for almost as long as the good article nominations process has existed creating a backlog of unreviewed articles. The resulting backlog in the queue of articles waiting to be reviewed has been a perennial concern. Nevertheless, the backlog at Good Article Nominations (GAN) reached its lowest point in two years on 2 July 2016. The culprit was the third annual Good Article Cup, which ended on 30 June 2016; the 2016-2017 GA Cup, its fourth iteration, began on 1 November and is ongoing. The GA Cup is the GA WikiProject's most successful backlog reduction initiative to date, but there is a problem that plagues this and all other backlog elimination drives: editor fatigue.
The backlog at GAN has been growing ever since the process was created, with fluctuations and trends along the way. If the GA Cup, or any elimination drive, is going to be successful, it must at some point begin to treat the cause not simply the symptom. While the GA Cup has done a remarkable job in reducing the backlog, for long term success the cause of the backlog needs to be understood. The cause appears to be editor fatigue, with boom and bust reviewing periods where the core group of reviewers try to reduce the backlog and then tire out, causing the backlog to rebound. This is the chief benefit of the GA Cup: its format helps counteract the cycle of fatigue with a long term motivational structure.
The GA Cup is a multi-round competition modeled on the older and broader-purpose WikiCup (which has run annually since 2007 and concluded this year on 31 October). Members of the GA WikiProject created the GA Cup as a way to encourage editors to review nominations and reduce the backlog through good-natured competition. Participants are awarded points for reviewing good article nominations, with more points being awarded the longer a nomination has languished in the queue. Each GA Cup sees a significant reduction in the number of nominations awaiting review. On this metric alone the GA Cup is a success; but counting raw articles awaiting review only gives insight into what happens while the GA Cup is running, ignoring the origin of the backlog and masking ways in which the GA Cup can be further improved.
The GA Cup's predecessors, backlog elimination drives, only lasted a month, while the GA Cup lasts four. While the time commitment alone can be a source of fatigue, the mismatch between the time taken to review and the ease of nomination can lead to an unmanageable workload. A good article review nominally takes 7 days, so if the rate of closing reviews is less than the rate of nominations added, the backlog will not only increase, but the number of reviews being done by a given reviewer will balloon, causing them to burn out by the end of the competition. Well-known post-cup backlog spikes demonstrate the oft temporary nature of GA Cup efforts.
With proper information and planning, the GA Cup can begin to treat the cause of the backlog rather than the symptom and succeed in sustaining backlog reductions after its conclusion.
A history of the Good Article project
The Good Article project was created on 11 October 2005 "to identify good content that is not likely to become featured". The criteria were similar to those we have now:
“ | [A Good Article] should be well written, factually accurate, neutral, and stable. It should definitely be referenced, and wherever possible it should contain images to illustrate it. Good articles may not be as comprehensive as our featured articles, but should not omit any major facets of the topic. | ” |
At first, the project was largely a list of articles individual editors believed to be good: any editor could add an article to the list, and any other editor could remove it. This received significant pushback, with core templates {{GA}} and {{DelistedGA}} receiving nominations for deletion on 2 December 2005 as "label creep" and a suggestion that the then-guideline should be deleted as well. They were kept, but, after discussions, the GA process received a slight tweak: while editors could still freely add articles they did not write as GAs, those wishing to self-nominate their work were referred to a newly created good article nomination page.
While the first version of the Good Article page told editors to nominate all potential Good Articles at Wikipedia:Good article candidates (now Good Article Nominations), that requirement was removed 10 hours later. The current process was not adopted until a few months later. In March 2006 another suggestion was made:
“ | ...like many people are noting, many articles are being promoted to GA status, except it doesn't do much to lead to article improvement. So I'm thinking whenever articles get promoted to GA status, rather than just put the tag up, we make a section of the talk page on the article and copy all the criteria into it. Then we can go through each criteria one by one, listing how we thought the article met each category, naming any specifics, and noting any flaws we saw in the article under each category, whether or not it made it unable to have GA status. | ” |
— Homestarmy |
The next day the GA page was updated to reflect this new assessment process, and the nominations procedure was extended to all nominations, not just self-nominations.
From there on the nomination page continued to grow. The first concerns over the backlog were raised in late 2006 and early 2007, when the nomination queue hovered around 140 unreviewed nominations. In May, the first backlog elimination drive was held, lasting three weeks. The drive saw a reduction in the backlog from 168 to just 77 articles. This did not last, however, with the backlog jumping back up to 124 a week later. The next backlog drive was held the next month, from 10 July to 14 August, with 406 reviews completed—but a net backlog reduction of just 50, leaving 73 articles still needing reviewed. Another drive planned for September was canceled due to perceived editor fatigue. Backlog elimination drives have been held at irregular intervals ever since then, with the most recent during August 2016. These drives were "moderately successful", to quote a 2015 Signpost op-ed by Figureskatingfan:
“ | The huge queue at GAN prevents articles from becoming the best examples of the best writing, research, and information on Wikipedia. To be honest, we haven't always been successful. Most of our attempts, like the now-defunct GA Recruitment Centre, which tried not only to train new GA reviewers but to retain editors (another chronic issue for Wikipedia), didn't succeed. Even our periodic backlog drives have been only moderately successful. | ” |
With a looming backlog of more than 450 unreviewed articles by August 2014, a new solution was sought: the GA cup. Figureskatingfan, who co-founded the cup with Dom497, writes of its creation:
I was in Washington, D.C., at the Wikipedia Workshop Facilitator Training in late August 2014. While I was there, I was communicating through Messenger with another editor, Dom497. We were discussing a long-standing challenge for WikiProject Good Articles—the traditionally long queue at GAN. Dom was a long-time member of the GA WikiProject. This impressive young man created several projects to encourage the reviewing of GAs, most of which I supported and participated in, but they all failed. I shared this dilemma with some of my fellow participants at the training, and in the course of the discussion, it occurred to me: Why not follow the example of the wildly successful and popular WikiCup, and create a tournament-based competition encouraging the review of GAs, but on a smaller scale, at least to start?
I was literally on the way to the airport on my way home, discussing the logistics of setting up such a competition with Dom. By the time I got home, we had set up a preliminary scoring system and Dom had created the pages necessary. We brought up our idea at the WikiProject, and most expressed their enthusiastic support. We recruited two more judges, and conducted our first competition beginning in October 2014.
A history of the backlog
Over the last nine years, the GAN backlog has grown by about three nominations per month on average—the solid blue line above. Backlog levels are almost never stable. Large trends cause the backlog to fluctuate above and below the regressive average often. These trends though also have their own fluctuations with local peaks and valleys along an otherwise upward or downward trend. What causes these fluctuations? For the three declines after 2014, the answer is relatively simple: the GA Cup. But what about the earlier declines?
The most obvious hypothesis is that the drops coincide with the backlog elimination drives, but this is not sufficient. While most backlog drives coincide with steep drops in the backlog, the ones that do are clustered towards the early years of GAN before it was as popular as it is now. It is easier to make significant dents in the backlog when only a couple nominations are coming in per day than when ten or more are coming in. Indeed, the last three backlog drives had a marginal impact, if any. More obviously, not all drops in the backlog stem from backlog elimination drives. Take, for instance, the reduction in the backlog in mid 2008—a reduction of 100 nominations without any backlog drive taking place. Similar reductions occurred thrice in 2013. In fact, the opposite effect has also been seen: the two most recent backlog drives seemingly occurred during natural backlog reductions, and didn't accelerate things by much. If elimination drives are not, taken together, the sole cause at play there must be some more fundamental cause that accounts for all the reductions seen.
A better explanation comes from the field of finance: the idea of support and resistance in stock prices. For a stock, there is a price that is hard to rise above—a line of resistance—and a price that it is hard to fall below—a line of support. These phenomena are caused by the behavior of investors. When a stock price rises above a certain point, investors sell, causing the price to fall; conversely, when the price falls to a certain point, investors buy, causing the price to rise.
Does this apply to good article reviews as well? By analogy, imagine GA reviewers as investors and the backlog as a stock price. When the backlog rises to a certain point, GA reviewers collectively think the backlog is too large and so begin reviewing at a higher pace to lower it—a line of resistance. When the backlog falls to a certain point, reviewers slow down their pace or get burned out, causing the backlog to grow—a line of support. This makes intuitive sense. The impetus behind most backlog elimination drives is a group of reviewers thinking the backlog has grown too large. The backlog elimination drives then are just a more organized example of reviewers picking up their pace.
If this hypothesis is correct, then backlog reduction initiatives should be held during the low tide, encouraging weary reviewers, rather than during the high, when they are more likely to review nominations anyway, initiatives notwithstanding. But how can we tell where these lines of support exist and when the backlog is likely to bounce back? Economists and investors have found the moving average to be a useful tool in describing the lines of support and resistance in stock prices, so perhaps it can be useful here. In the graph above, the dashed, red line represents a 90-day simple moving average. It seems to capture the lines of support and resistance for the backlog well, as most local peaks tend to bounce off of it, but major trend changes pass through it.
An example of the utility of this theory can be seen in early 2009. The backlog began to fall naturally in January, but was about to hit a line of resistance that may have caused the upward trend to continue. However, a backlog drive took place in February, causing an even steeper decline in the backlog, pushing it past the line of resistance. Unfortunately, the full impact of this cannot be understood as the data for April to November 2009 were never recorded by the GA Bot.
The impact of the GA Cup
After almost a year of no backlog drives in 2013, followed by two rather unsuccessful ones, the GA Cup was started. Over the past two years, three GA Cups have been run, all with robust participation and significant reductions in nominations outstanding. But is the cup succeeding? To answer that question I looked at the daily rates of new nominations, closed nominations, nominations passed, and nominations failed during each of the GA Cups and compared them to the rates before and after the first GA Cup.
The presence of a reduction in the backlog is obvious: each cup correlates with a steep drop in the number of nominations, the most effective being the third GA Cup, which concluded on June 30 this year. The most recent GA Cup reduced the backlog by about two nominations per day, 92 more nominations completed than during the first GA Cup—despite the third Cup being significantly shorter than the first. The third GA Cup was lauded a success.
Yet in late April, the backlog reduction began to stagnate. The number of nominations added remained relatively stable over this period, but this period coincided with a drop in the number of nominations being completed. In early May the backlog began to rise, crossing over the line of resistance in the process, and so beginning to shrink again towards the end of May, with a distinct downward trend by June.
Ultimately, the best way to conceptualize the GA review backlog is as a mismatch between the "supply" of reviewers and the "demand" for reviews. To borrow another concept from finance, it is simply a mismatch in supply and demand. The number of nominations—the demand—is relatively consistent, at about 10 nominations per day. There is a mild decrease in the rate of nominations—the daily rate decreases by one nomination every two years—but, all-in-all, relatively stable.
Measuring supply is more difficult. The change in the backlog is equal to the number of nominations added minus the number of reviews opened, so if the average demand is 10 nominations, and the average supply of reviews is 0, then the backlog would grow by 10 nominations each day; if the supply were 5, it would grow by 5. That means the average number of nominations minus the average number of reviews equals the average change in the backlog. Since the average change in the backlog, the linear regression, and the average number of nominations are both known, the average supply can easily be calculated. It turns out to be about six per day. Taken in combination with the aforementioned demand, shows a net daily increase in the backlog by four nominations each day. And since this analysis includes the GA cup time period, the backlog is actually increasing at an even higher rate whenever a Cup isn't active!
The number of open reviews does not inspire much confidence either. Reviews open drops dramatically after each GA cup, likely due to participant burnt-out. Interestingly, the number of open reviews also drops before the GA Cup causing a counterproductive uptick in the backlog. In fact, the drop just before this year's cup coincided with the announcement of the event's competition date a month prior to its start. This development came at a time when the number of reviews was increasing and the backlog naturally starting to decline.
All told, these are not fatal flaws, as the GA Cup is succeeding despite them in other ways. Most obviously, the backlog has been decreasing during cups, and review quality doesn't seem to decline, qualitatively, either. Comparing five months before with the four months during the first GA Cup, there is no significant difference between the pass rates during or before the GA Cup ( t(504.97)=-1.788, p=0.07 ). In fact, may have actually decreased slightly, from 85% beforehand to 82% during the cup and because the p-value is close to significance, the idea that GA Cup reviewers are more stringent may be worth examining further.
This is not to say that there is no other way to examine review quality. Reasonable minds can disagree on how well this metric describes the quality of reviews, and concerns of the quality of reviews have been raised a number of times, but this is the preferable starting point for this analysis. We now know that the GA Cup does not lead to "drive-by" passes, and that any problems with unfit articles passing or fit articles failing are occurring at about the same rate as normal. Hopefully, then, those solutions can be more general, improving all reviews' qualities, rather than specific to the GA Cup.
Conclusions
The GA Cups have been effective at encouraging editors completing GA reviews. Its effect on the cause of the backlog, on the other hand, is less clear. Long-lasting backlog reductions require a nuanced approach: recruiting more reviewers, finding the correct timing, and giving proper encouragement. The GA Cup is arguably already successful at encouragement, but that does not mean the former aspects cannot be improved as well.
The GA Cup has so far been executed at times when reviewers were already increasing their efforts to reduce the backlog, and the announcement of the third GA Cup, for instance, caused these efforts to stagnate. By allowing these natural reductions to take place, and then holding the GA Cup when editors get burnt out, we can leverage GA cups' morale boost to help reduce backlogs even further.
Furthermore, while there was no good way to analyze how well the GA Cup recruits new reviewers, anecdotally it seems to do so. Bringing in new reviewers when the regulars are getting burnt out would reduce the backlog rebound in the short term, and may lead to an increase in the number of regular reviewers in the long term.
The organizers of the GA Cup understand that what is most needed is more reviews and more reviewers, which and whom the GA Cup has done an admirable job recruiting. The Third GA Cup has been the most successful so far, and hopefully the next cup will surpass it in all metrics.
Discuss this story
The results of gamification
The assertion that GOCE drives. I think the Good Article WikiProject is key to objectively improving content whereas GOCE is by-and-large just fixing word salad, which almost anyone can do. Efforts like the GA Cup are our collective means of putting these articles to stringent standards. GA status is often, though not always, a precursor to pursuit of A-class or FA. I remain concerned that these contests (of which I am a part currently) attract editors who are still unfamiliar with proper reviewing. It's demoralizing to see bad reviews done, especially when you're competing for points. WikiProject Articles for Creation had 8 drives since 2012. The last drive saw a lot of poorly done draft reviews and the results were so skewed that the WikiProject hasn't held another drive since 2014. I would hate to have these drives ruined by bad editing and we can only rely on the judges of the competition to stay alert to malfeasance. Chris Troutman (talk) 21:17, 26 November 2016 (UTC)[reply]
has no basis in fact. It may be true that insufficient reviews occur at the same rate so the Cup doesn't encourage the practice but let's remember that the number of bad reviews is increasing at the same time reviews, generally, are increasing. Doing GA reviews sucks because it's actual work; I have more fun doingI don't really buy into the "gamification" of this (and various similar "challenges", "drives", etc.). Maybe it really does motivate a few people, but not everyone feels competitive about this stuff. The very nature of GA, FA, DYK, ITN, etc., as "merit badges" for editors to "earn", and the drama surrounding that, led to a rancorous ArbCom case recently, and cliquish behavior at FAC has generated further pointless psychodramatics. We really need to focus on the content and improving it for readers, not on the internal wikipolitics of labels, badges, and acceptance into politicized editorial camps.
It might be more practical and productive to have a 100-point (or whatever) scale and grade articles on it to a fixed and extensive set of criteria, with FA, GA, A-class, B, C, Start, and Stub all assigned as objectively as possible based on level of compliance with these criteria (and resolving the tension of exactly what A-Class is in this scheme, which seems to vary from "below GA" to "between GA and FA" to "FA+" to "totally unrelated to GA or FA"). There are a quite a number of GA, A and probably even FA quality articles that have no such assessments, because their principal editors just don't care about (or actively don't care for) the politics and entrenched personality conflicts of our article assessment processes as they presently stand. I, for one, will probably never attempt to promote an article to FA myself directly, because of the poisonous atmosphere at FAC (which is now an order of magnitude worse than it was when I first came to that conclusion several years ago). I guess the good news is I'll have more time for GA work. :-) The more that FA, and some of the more rigid and too-few-participants A-class processes, start to work like GA historically has, the better. If, as Kaldari suggests below, the opposite is happening, with GA sliding toward FA-style "our way or the highway" insularity, then you can expect negative results and declining participation. — SMcCandlish ☺ ☏ ¢ ≽ʌⱷ҅ᴥⱷʌ≼ 09:48, 2 December 2016 (UTC)[reply]
Age of nominations
This is very interesting read! One thing that I was looking for here that didn't get discussed was the effect of the Cup on the age of nominations - are reviews now sitting in the queue for less time than they were before these competitions started? (For those who don't know, I'm a judge in the Cup, after having competed in it the first year.)--3family6 (Talk to me | See what I have done) 05:13, 27 November 2016 (UTC)[reply]
Reviewer burnout
I took part in the first GA cup. It was a new idea with a good purpose and I felt I wasn't pulling my weight by putting more GA nominations on the pile than reducing the backlog by reviewing. Towards the end of the cup, I got burned out and reduced my activity; I still do the odd review but not as many as I used to. I know some other GA stalwarts have also stopped reviews. How can we reach out to these people and get them to participate in reviews again? Ritchie333 (talk) (cont) 14:59, 27 November 2016 (UTC)[reply]
Great article
I just wanted to say that this was a really interesting read. As someone who wasn't around for the early days of the project I'd love to learn more about how some of the other now well-established processes came to be. Sam Walton (talk) 16:13, 28 November 2016 (UTC)[reply]