Feed Sign in with OpenID OpenID

Simon Willison’s Weblog

Crowdsourced document analysis and MP expenses

As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here’s what we built:

http://mps-expenses2.guardian.co.uk/

It’s a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents—around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.

This is the second time we’ve tried this—the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week’s attempt was an opportunity to apply the lessons we learnt the first time round.

Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice—I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don’t manage to get useful data out of it quickly the story will move on before we can impact it.

ScaleCamp on the Friday meant that development work didn’t properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.

The system was written using Django, MySQL (InnoDB), Redis and memcached.

Asking the right question

The biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either “claims” or “receipts” and to rank them as “not interesting”, “a bit interesting”, “interesting but already known” and “someone should investigate this”. We also asked users to optionally enter any numbers they saw on the page as categorised “line items”, with the intention of adding these up later.

The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.

The categorisations worked reasonably well but weren’t particularly interesting—knowing if a document is a claim or receipt is useful only if you’re going to collect line items. The “investigate this” button worked very well though.

We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility—we changed the tags shortly before launch based on the characteristics of the documents—and had the potential to be a lot more fun as well. I’m particularly fond of the “hand-written” tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.

Sticking to an editorially assigned set of tags provided a powerful tool for directing people’s investigations, and also ensured our users didn’t start creating potentially libellous tags of their own.

Breaking it up in to assignments

For the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.

This time round, we added a new concept of "assignments". Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one "main" assignment and several alternative assignments right on the homepage.

Clicking “start reviewing” on an assignment sets a cookie for that assignment, and adds the assignment’s progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.

The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.

Get the button right!

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big “start reviewing” buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL “ORDER BY RAND()” statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn’t matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn’t nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page—and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation—the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some—the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the “next page” button would occasionally turn up a page that didn’t exist. I had some very poorly considered fallback logic for that—if the random page didn’t exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page—which subsequently attracted well over a thousand individual reviews.

Next time, I’m going to try and make the “next” button completely bullet proof! I’m also going to maintain a “denormalisation dictionary” documenting every denormalisation in the system in detail—such a thing would have saved me several hours of confused debugging.

Exposing the results

The biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added—mainly because we spent much of the intervening time figuring out the scaling issues.

This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.

This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you’re trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit “refresh” and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.

We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I’ll pay a lot more attention to getting the pagination options right from the start.

Rewarding our contributors

The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn’t want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.

For the new version, we tried to provide a much better feeling of activity around the site. We added “top reviewer” tables to every assignment, MP and political party as well as a “most active reviewers in the past 48 hours” table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.

Most importantly, we added a concept of discoveries—editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.

Light-weight registration

For both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.

It’s difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it’s hard to tell how true that is. The UI for this process in the first project was quite confusing—we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.

Overall lessons

News-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the “next thing to review” behaviour rock solid. I’m looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs’ expenses.

This is Crowdsourced document analysis and MP expenses by Simon Willison, posted on 20th December 2009.

Tagged , , , , , , , , , , ,

View blog reactions

Next: WildlifeNearYou: It began on a fort...

Previous: Node.js is genuinely exciting

45 comments

  1. link to mp-expenses2 points to mp-expenses1

    djerdo - 20th December 2009 13:13 - #

  2. Oops, thanks - fixed now.

    Simon Willison - 20th December 2009 14:57 - #

  3. Thanks for taking the time to write up. I love to hear the inside story.

    James - 20th December 2009 15:06 - #

  4. Thanks for this interesting read!

    jack - 20th December 2009 18:46 - #

  5. Very interesting write-up. I tried it out and it's very easy to use, and very quick. A user would have no idea of how much is happening behind the scenes. The light-weight registration is excellent, and I'm certainly an example of a user who wouldn't have tagged any pages at all if presented with an account creation page first. Instead I found it mildly addictive and did 50 or so. I even added a previously only "not interesting" handwritten letter to your favourite tag cloud.

    On the page viewer, I do wonder how many click the "not interesting" button by accident, it being the only button-shaped object on the page, and located right next to the tags you'd want to submit. I found myself automatically moving my pointer over toward it after clicking some tags.

    I wouldn't have expected it, but it's an interesting downside to using links as buttons, when one of them is contained in its own box and the other is amidst a block of text / checkboxes. But I might be more conditioned than average to expect a square button after a set of checkmark boxes. Is the "Not Interested" link acting as a form submission, allowing you to check how many people have submitted "not interesting" along with a collection of tags?

    Matt S. - 21st December 2009 17:42 - #

  6. Matt S - very good point about the "Not interesting" button. That's one of the most interesting things about crowdsourcing - the tiniest design detail can have a huge impact on the overall result.

    Not Interesting is indeed applied as a tag, but it's not visible by default. You can append /not-interesting/ to various URLs around the site to see all of the things in that context that were marked that way.

    Simon Willison - 22nd December 2009 00:55 - #

  7. Would it make sense to create a second table in MySQL that has a normal auto increment id column, and a unique random number column? You could then fetch them one at a time and use a join to get the corresponding info.

    That way you keep all this logic within MySQL, which might prevent some of the synchronisation issues you
    describe.

    Not sure if it is fast.

    Good luck and keep up the good work!

    Sjors Provoost - 24th December 2009 13:40 - #

  8. is the dataset available for download?

    poftp - 25th December 2009 00:04 - #

  9. Your time taken to write of the lesson learnt is much appreciated. I'm particularly interested in sprint type development and look forward to browsing your blog further.

    Phoebe Bright - 30th January 2010 20:46 - #

  10. Interesting, especially about the denormalisation issues. Was there a reason you didn't use a document oriented database such as couchdb or mongodb for the underlying datastore?

    It justs seems to me that if these tools are useful at all, the exampe of lots of discrete documents each with metadata about them is about as close to non-relational as you could get?

    Andrew Phillipo - 16th February 2010 11:53 - #

  11. Yup - we were optimising for speed of development, and using a document oriented database would have slowed us down. Django's ORM (and its integration with things like the Django form framework) was key to turning around the project in the time that we did. If we had a lot more experience with Couch or Mongo this might have not been the case, but my experience with alternative data stores is that they usually throw up hiccups due to their poor support for completely ad-hoc queries compared to SQL.

    Simon Willison - 16th February 2010 13:10 - #

  12. You are basically doing a full table sort each time you are retrieving a single random row. That's just wrong.

    You don't need to use Redis for this. http://gist.github.com/620181

    johno - 11th October 2010 08:38 - #

  13. First of Health articles all let me tell youpharmacy shop, you have got a great blog .I am interested in looking for more of such topics and would like to have further information. Hope to see the next blog soon

    micky mouse - 25th September 2011 16:56 - #

  14. Resources like the one you mentioned here will be very useful for me!

    Tool Steel - 7th October 2011 02:35 - #

  15. I was so confused about what to buy, but this makes it udnerstanadble.

    Destrie - 8th October 2011 17:30 - #

  16. Thank you for always. This is really helpful for me and I was really waiting for a great article to learn a lot from you cannot really valuable experience. This was really interesting for me. Data Recovery

    データ復旧 - 15th October 2011 13:54 - #

  17. It was really nice to study your post. I collect some good points here. I would like to be appreciative you with the hard work you have made in skill this is great article. mensagens para orkut | iPhone jailbreak

    hunny - 18th October 2011 15:28 - #

  18. Hello, I like you page. I’m glad Yahoo pointed me to it. I was able to get the know-how I was searching so badly for days now.Thank You very much for your really good web page. Have a good day.

    Tool steel - 25th October 2011 02:09 - #

  19. rt and a healthy appetite, and a wicked swing with a chainsahttp://www.brandsunglassesmall.com/oakley-s unglasses-c-22.html

    oakley sunglasses men - 26th October 2011 03:51 - #

  20. offered http://www.brandsunglassesmall.com/oakley-sunglass es-c-22.html the same content as you, the internet would be a much better

    oakley sunglasses men - 26th October 2011 03:52 - #

  21. You certainly have some agreeable opinions and views. Your blog provides a fresh look at the subject. Thanks for spending the time to discuss this. I believe your forthcoming articles will turn out to be just as helpful.. Free Cell Phone Spy

    almandadectori - 26th October 2011 06:58 - #

  22. beneficial for me because it gives the work a professional touch that is really an amazing thing. uk seo |uk web design |uk seo company

    uk seo - 28th October 2011 08:03 - #

  23. nice to see your latest analysis..

    OEM Manufacturers - 29th October 2011 08:37 - #

  24. Thanks This is really helpful for me and I was really waiting for a great article to learn barclays card

    barclays card - 30th October 2011 15:09 - #

  25. Thanks really helpful for me and I was really waiting for a great article to learn a lot Argan Oil

    Argan Oil - 31st October 2011 14:24 - #

  26. We also asked users to optionally enter any numbers they saw on the page as categorised with the intention of adding these up later.pellet mill line

    fgfasd - 2nd November 2011 02:49 - #

  27. Marijuana usually makes controversion in several country. Some countries allow to use marijuana for its good. However, some countries do not allow because for its bad. seo company

    SaniaMore - 2nd November 2011 09:16 - #

  28. I remember as well that within a dressmaker's store for the very first soil of
    www.christianlouboutintwins.com
    one of these unusual homes there is a bust during the window with a tape measure slung near to
    the guitar neck and I understand that we was greatly moved by this sight. there is snow for the soil www.airmaxshoes-us.com
    but the sunshine was out effective and I recall vividly how in regards to the bottoms of your ash barrels
    which have been frozen in to the ice there is then a very small swimming pool of h2o still left with the
    melting snow.

    air max shoes - 3rd November 2011 09:13 - #

  29. Thanks really helpful for me and I was really waiting for a great article to learn a lot.

    air max shoes - 3rd November 2011 09:14 - #

  30. The post is absolutely fantastic! Lots of great information and inspiration, both of which we all need! Also like to admire the time and effort you put into your blog and detailed information you offer! Metal Roofing Charleston

    Joe Park - 3rd November 2011 16:14 - #

  31. Government expenses are big, I find it normal to have many pages to verify. ionizer air purifier

    Kendra S - 4th November 2011 01:38 - #

  32. Your concepts were simple to understand that I wondered why I in no way looked at it prior to. Glad to know that there's a blogger out there that certainly understands what he's discussing. Great job!!! Bench Top Centrifuge

    Joe Park - 4th November 2011 06:32 - #

  33. If you are someone who does not have much time to spend at home and you need to spend most of your time outside then these portable FM transmitters are the best for you. These are compatible with almost all kinds of gadgets and are suitable with all the automobiles. You can easily operate these transmitters by using it with the iPhones and other similar gadgets.

    50W FM TRANSMITTER - 5th November 2011 06:23 - #

  34. Unlike other your piece of writing has a zeal that matters to your readers.it works according to the needs.

    assignment writing - 5th November 2011 06:33 - #

  35. MIMRGx Hi! Everyone who reads this blog - Happy Reconciliation and Accord..!!

    Buy cheap software - 6th November 2011 11:03 - #

  36. Hi. Thanks for taking the time to write up. I love to hear the inside story.

    ювелирный интернет магазин - 8th November 2011 07:00 - #

  37. Very interesting como ganar dinero Will you come back to write on this blog? chistes cortos You have done an excellent work here!

    Jorge - 8th November 2011 13:05 - #

  38. Not everyone can provide information with proper flow. Good post.

    buy assignment - 9th November 2011 04:51 - #

  39. Nice, accurate and to the point.I am going to save the URL and will definitely visit again.

    assignment help - 9th November 2011 05:00 - #

  40. Good post.I am looking forward for your more useful posts.

    statistics help - 9th November 2011 05:07 - #

  41. You could relate in each detail very well.Nice, accurate and to the point.

    pay someone to do your assignment - 9th November 2011 05:18 - #

  42. So informative things are provided here,I really happy to read this post,I was just imagine about it and you provided me the correct information I really bookmark it,for further reading,So thanks for sharing the information. cong ty bao ve

    kens - 9th November 2011 09:38 - #

  43. Thanks for this very useful info you have provided us. I will bookmark this for future reference and refer it to my friends. More power to your blog. bao ve

    great - 9th November 2011 10:01 - #

  44. I just read through the entire article of yours and it was quite good. This is a great article thanks for sharing this information. I will visit your blog regularly for some latest post. du hoc my

    kens - 9th November 2011 10:03 - #

  45. The technology behind this is so good. Its all about web applications now. They have to get better and better due to the better technology. The work that you are doing is so good funny sms

    matt - 9th November 2011 10:04 - #

Comments are closed.
A django site