iHub Research By Angela Okune / July 16, 2013
Umati Project: Challenges of Capturing Relevant Data
As part of our Umati Project, which recently released its final report from Phase 1, we have been collecting online hate speech found on social media, forums, and online newspapers. The process, which was initially envisioned to be an automated process, turned out to be quite manual due to the nuanced nature of the speech being collected and the lack of corpus available of Kenyan hate speech.
The project therefore had to rely on 11 new media monitors to collect and code data. We had two sets of monitors, six for the weekday (Swahili/Sheng, Luo, Luhya, Kikuyu, Kalenjin, Somali; all monitored English) and five (all of the aforementioned languages except for Somali) for the weekend. Both groups worked from 8 am to 5 pm, with one hour allocated for lunch. The methodology is further detailed in the final report.
Several challenges emerged as a result of the manual data collection process. These included the possibility for correct misses/false alarms, as detailed by the signal detection theory. In other words, making sure that the monitors correctly categorized “dangerous speech”. We had to clean the data numerous times on different accounts and still encountered errors in categorization each time.
The monitors also often displayed fatigue and varying levels of productivity as a result of task dullness. After staring at a computer for hours on end, it is no surprise that monitors productivity would fluctuate wildly, spiking and crashing.
Most of these are challenges inherent in the use of humans to collect online data systematically. As a result of noting these challenges, we are interested in developing a more streamlined, and efficient process to collect online data systematically using machine learning and data mining techniques. The development of this tool will make up Phase 2 of the Umati Project. Phase 2 will use humans to calibrate the machine as the machine ‘learns’ the nuances of online dangerous speech, but eventually, the tool should be able to run with very little human effort. This will result in a low-cost system that can be scaled to other countries and contexts.
$i = 1; ?>
faith at 11:04:34AM Friday, July 19, 2013
- iHub Cluster
- iHub Consulting
- iHub Research
- iHub Robotics
- iHub UXlab
- [Sign up] August 2014 Graphic Design Bootcamp!
- [Crowdfunding Opportunity] Jumpstart Africa Accepting Projects
- BlackBerry Enterprise Server 10 For You
- First Pan-African Women in Tech Meetup
- AMI SELF-MANAGEMENT TRAINING, Tuesday, August 12, 2014
Tag Cloudafrica Akirachix apps community dangerous speech design developers Entrepreneurship event gaming ict iHub iHubResearch (R@iHub) Innovation Intel kenya m:lab microsoft mobile nairobi open data outreach PIVOT25 Research Social Entrepreneurship startups Tech Technology training Umati
- Online Dangerous Speech Monitoring in Kenya: Umati Project’s Findings from January – November 2013.
- Speech: What’s Dangerous, and What’s Not?
- The Evolving Scope of the Umati Project
- Building Swahili Stop words Corpus for Computing
- A Recap of #UmatiForum: The Many Faces of Online Hate Speech in Kenya