By: Chris Orwa
As part of our Crowdsourcing Validation project
, during the general Kenyan elections in March 2013, iHub Research collected over 2.5 million tweets related to the polls. We are now in the process of crunching the data to unravel the profile of tweeps highly likely to report an incident on twitter.
In our data-mining project, we considered ‘useful information’ that which provides situational awareness of poll-related incidents. We used human judgment to mark a tweet as TRUE meaning useful information or FALSE for spam. We are building a classification algorithm in order to build a model that predicts whether a tweet is true or false and in turn learn the structure of the decision.
The initial process entails extracting features from tweets such as word count, availability of a link, presence of a mention, whether it is geo-tagged, number of followers of tweep, Klout score, location, twitter age, etc. Eighty-two features were available and broadly categorized as either ‘user-based features’ that represent information about a twitter user, or ‘tweet-based features’ that represent information about a tweet. The algorithm, in this case the Information Gain Filter determines which features are ‘important’. Below is a frequency distribution of 22 selected features, note the normal distribution of word count feature (statistically easy to analyze)
. The visualization provides an insight into how likely the features are to perform at classification.
Over 50,000 tweets were annotated true or false and fed into WEKA machine learning software
. Using WEKA’s J48 classification algorithm, a decision tree below was developed. This is where it gets interesting. Note only six features end up being chosen by the classification algorithm.
This decision tree shows that if a user is a verified account, that user will for the most part (85% of the time)
not report an ‘incident’ on twitter. If the tweet is from ‘unverified’ account, and there is a mention of someone in the tweet, the tweet will most likely not
be an incident report.
The next phase of our data-mining project involves studying the communication modes on twitter during election situations. The model developed suggests people only use broadcast mode of communication to report incidents, but an alternative model developed by using a Random Tree algorithm
illustrate that a number of people prefer “multicast” (mentioning influential Twitter handles) when reporting election incidences. It is also worth noting that tweet-based features outranked user-based features meaning, it is about how you tweet and not who you are that is important during an election period.
These deductions are only valid when looking at tweeting patterns during elections and cannot be extrapolated to general tweeting. Our study is ongoing and we will be sharing more insights. It appears that as the volume of data increases, inexorably, the proportion of it that people understand decreases. We hope our research will help people to better understand data, especially high volumes of data.