iHub Research By Angela Okune / July 16, 2013
Umati Project: Challenges of Capturing Relevant Data
As part of our Umati Project, which recently released its final report from Phase 1, we have been collecting online hate speech found on social media, forums, and online newspapers. The process, which was initially envisioned to be an automated process, turned out to be quite manual due to the nuanced nature of the speech being collected and the lack of corpus available of Kenyan hate speech.
The project therefore had to rely on 11 new media monitors to collect and code data. We had two sets of monitors, six for the weekday (Swahili/Sheng, Luo, Luhya, Kikuyu, Kalenjin, Somali; all monitored English) and five (all of the aforementioned languages except for Somali) for the weekend. Both groups worked from 8 am to 5 pm, with one hour allocated for lunch. The methodology is further detailed in the final report.
Several challenges emerged as a result of the manual data collection process. These included the possibility for correct misses/false alarms, as detailed by the signal detection theory. In other words, making sure that the monitors correctly categorized “dangerous speech”. We had to clean the data numerous times on different accounts and still encountered errors in categorization each time.
The monitors also often displayed fatigue and varying levels of productivity as a result of task dullness. After staring at a computer for hours on end, it is no surprise that monitors productivity would fluctuate wildly, spiking and crashing.
Most of these are challenges inherent in the use of humans to collect online data systematically. As a result of noting these challenges, we are interested in developing a more streamlined, and efficient process to collect online data systematically using machine learning and data mining techniques. The development of this tool will make up Phase 2 of the Umati Project. Phase 2 will use humans to calibrate the machine as the machine ‘learns’ the nuances of online dangerous speech, but eventually, the tool should be able to run with very little human effort. This will result in a low-cost system that can be scaled to other countries and contexts.
faith at 11:04:34AM Friday, July 19, 2013
- iHub Cluster
- iHub Consulting
- iHub Research
- iHub Robotics
- iHub UXlab
- Are you ready for Girls in ICT Day 2014? We are!
- iHub @ Connected Kenya Summit 2014
- Venture Capital Course Coming July 29-30
- Incubated Startups at the m:lab Win Accolades at Connected Kenya
- NETmundial Hub in Nairobi – Meeting on the future of Internet Governance
Tag Cloudafrica community design developers Entrepreneurship event ict iHub iHubResearch (R@iHub) Innovation Intel kenya m:lab Meetups microsoft mobile nairobi open data outreach PIVOT25 projects Research robotics Social Entrepreneurship startups talks Tech Technology training Umati
- Building Swahili Stop words Corpus for Computing
- A Recap of #UmatiForum: The Many Faces of Online Hate Speech in Kenya
- #UmatiForum: The Many Faces of Online Hate Speech in Kenya
- Umati Phase II : Findings from July- September 2013 Monitoring Period
- Released: Umati Analysis of Online Content from ICC and Devolution