Open space for technologists, investors, tech companies and hackers in Nairobi.


iHub Research By Angela Okune / July 16, 2013

2.5 Million to 5,000 Tweets: Sifting Through the Noise


One of the first projects to be housed under iHub Research’s new Data Lab is our IDRC-funded research on Developing a Framework for the Viability of Election-centered Crowdsourcing. In the first phase of this research, we’ve built a Kenya-specific spam filter that sifts through crowdsourced data from the elections to pull “newsworthy” events out of raw Twitter data collected during the elections (March 3, 2013 – April 9, 2013).

We had initially set out to find if crowdsourced data has characteristics inherent within it that can help to validate the information. We’ve found instead that before even looking at the validation question, most news agencies and organisations need to grapple with the sheer volume of crowdsourced data (This is explained in greater detail in a recent post by Patrick Meier, What is Big (Crisis) Data? Much of the crowdsourced data is irrelevant noise. If an organisation or individual has no capacity to sort the irrelevant from the relevant, using crowdsourced information becomes very difficult indeed.

We experienced this challenge firsthand when we collected over 2.5 million tweets during the 2013 KE elections. We used a third-party Twitter application called DataSift to capture and store tweets using Kenyan election-related keywords (e.g. kill, dead), user names (e.g. @UhuruKenyatta, @RailaOdinga), place names (e.g. Kawangware, Mathare, Kisumu), and hashtags (e.g. #KenyaDecides). In the past weeks, we’ve used a variety of data mining and machine-learning techniques to filter the irrelevant (non ‘newsworthy’) information out. As a result of this process, we built a spam filter, able to accurately boil down our data from 2.5 million tweets to 5,000 ‘newsworthy’ tweets related to an event or activity from the Kenyan elections that can be verified.

Click to zoom image

The implications of this work are great. Building upon the superb work done (and being done) by Chato, Aditi, Patrick Meier, and others, we’ve developed a tool that can be used in election scenario (where there is much higher likelihood of false information and rumours to be spread) by media agencies and organisations to quickly find the RELEVANT information.

Now we are looking at the filtered data to run cross-comparisons between different news sources (traditional media, Twitter, and Uchaguzi) about the type of information disseminated on the different channels. We have conducted 85 in-depth interviews with citizens in 3 hotspot locations to also better understand the relationship between the online space and the on-the-ground ‘reality’. We look forward to sharing more results soon.



Author : Angela Okune

Angela is Research Lead at iHub. She is keen on growing knowledge on the uptake and utility of ICTs in East Africa. She is also co-lead of Waza Experience, an iHub community initiative aimed at prompting under-privileged children to explore innovation and entrepreneurship concepts grounded in real-world experience.

  • Athman Mohamed Athman Ali at 09:38:34AM Thursday, July 18, 2013

    Interesting article and initiative. The next phases you outline remind me of a web-app I once saw while visiting Sunlight Foundation in Washington DC… here’s a link (with your permission) that I am sure you may already be aware of, but if not, check it out as it could be useful.

  • Angela Crandall at 09:39:37AM Thursday, July 18, 2013

    Cool. Thanks for the share Athman; we’ll definitely check it out.

  • 3Vs Crowdsourcing Framework for Elections launched at 17:09:27PM Thursday, August 29, 2013

    […] to publish the results of our research on developing a Crowdsourcing Framework for Elections. Over the past 6 months, we have been looking at a commonly held assumption that crowdsourced information (collected from […]

  • 3Vs Crowdsourcing Framework for Elections launched – Blog | Ushahidi at 15:39:21PM Monday, September 2, 2013

    […] to publish the results of our research on developing a Crowdsourcing Framework for Elections. Over the past 6 months, we have been looking at a commonly held assumption that crowdsourced information (collected from […]


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

{{ theme:js file="jquery.fittext.js" }}