Open space for technologists, investors, tech companies and hackers in Nairobi.

PastedGraphic-4

Data Science Lab By Leo Mutuku / April 7, 2014

Building Swahili Stop words Corpus for Computing

2 Comments

With Chris Orwa,

Data Scientist, iHub Research

iHub Research commenced phase two of Umati, the online hate speech monitoring project, in January 2014 with the aim of understanding how online conversations evolve over time in an election cycle and; detecting dangerous speech – speech with the potential to catalyse violence.

In order to improve efficiency in our past methodology, the first step was to automate data collection from various public social sites, including Facebook, Twitter, blogs and online forums. We have since been working to build a software, the Umati Logger, which will collect the requisite data as well as classify and filter noise using Machine Learning and Natural Language Processing algorithms. The first stage of the automation process, the Facebook collector, is complete and we are now successfully collecting comments from public Facebook pages and groups (More on this in a future blog post).

As part of building the auto-classifier, it becomes necessary to analyse the body of the text from the comments. This stage first involves removing stop words from the comments. Stop words in computing are words that don’t provide context to a document therefore only increasing the computation resources required to process text files. In English these words include: and, or, not, this, that, here, there etc. What is most interesting here is that, a lot of the comments we have collected contain Swahili words, or Sheng’ (local pidgin), or a mixture of all. However, to the best of our knowledge, there isn’t a corpus of Swahili stop words easily available, so we decided to create one.

Using 248,283 comments collected in the month of December 2013, we used the following procedure to extract Swahili stop-words:

Procedure

  • Load all comments from Facebook posts.
  • Convert all comments to lowercase.
  • Break sentences into word tokens.
  • Remove all English stop-words.
  • Create a frequency table of the words.
  • Sample top 30% of most occurring words
  • The sample will form a set of Swahili stop-words.

This is a particularly useful exercise, since we aim to deploy Umati in the forthcoming Nigerian elections. Nigeria’s context is quite similar to our own in Kenya; we expect a local version of pidgin language to be in use and just like in the case of Swahili and Sheng’, there are no stop words that we can feed to our software.  Moving forward, we hope to crowd source for help in this activity to ensure we build a comprehensive directory for several African languages, which are not typically recognised in computing software.

Tags , ,

Author : Leo Mutuku

Leo leads the data science lab at iHub Research. She conducts research on open data, data science and visualisation, design research methods, market and investment research.


2 Comments
  • Stephen Mwega at 17:44:41PM Thursday, April 10, 2014

    Hi,

    Excellent work. I worked on project that processed English, Sheng and Swahili words to asses their correct ‘forms’.

    Swahili is an Subject-Verb-Object language. Bound morphemes in Swahili (viambishi) modify verbs in many ways.

    With this in mind, the spelling correction system performs segmentation on verbal complexities by breaking them into subject agreement paradigm, tense, object, and root suffix morphemes, based on regular expressions. A verbal complex, for example ‘niliangukia’, would be segmented into: ni (subject) li (tense) anguk (verb stem) ia (suffix).

    Once segmentation is conducted, the system determines, if the verb is contained in the Swahili lexicon. If it was present, then the entire verbal complex would be deemed a grammatical token. Incorrectly spelt verb stems are processed by performing string similarity matching with words from the Swahili lexicon. If a word matches it, then it is replaced with the correct stem. This segmentation functionality is also designed to process verbal complex words comprising morphemes from both English and Swahili.

    For example, ‘nilitry’, would be broken down as follows: ni (subject) li (tense) try (invalid morpheme)

    After segmentation has been performed, the next step involves retrieving the invalid morpheme, ‘try’ and checking it against a special dictionary handling derivations and inflections of verbs, such that this word would be replaced by its Swahili translation equivalent, ‘fanya’. If the word ended with the suffix ‘-ia’, then it would be replaced by its corresponding inflectional translation counterpart, for example, ‘fanyia’.

    Finally a noisy channel model would stipulate the likelihood that a candidate correction would be the most probable, based on its frequency of occurrence within a corpus as well as the likelihood that a candidate correction based on its structural similarity, is similar to the erroneous invalid word. For example, the term ‘byk’ would be converted to its phonemic form ‘baik’ and if the input matches this term then ‘bike’ would be proposed as a candidate correction.

    Reply
  • Chris Orwa at 14:04:22PM Monday, April 14, 2014

    Hello Stephen,

    Thank you for getting in touch, the aforementioned project is similar albeit with certain differences. The latter approach leverages on syntactical structure of Swahili/Sheng language to extract different parts-of-speech which requires linguistic knowledge. That said, we are cognizant to the fact that stopwords in S-V-O languages e.g Swahili are, for the better part, a composition of adverbs, adjectives and conjunctions.

    From a data science perspective, preliminary analysis of text defines the approach to be used. Initial observations showed a significant difference between standard Swahili and written Swahili/Sheng in social media. Example, xaxa for sasa, ivo for hivyo, wee for wewe. Also use of non-standard words such as bana, manze e.t.c.

    We deployed machine learning algorithms to detect similar words, specifically the levenshtein distance and clustering words with L-distance less than two e.g alikuwa and alikua, vyenye and venye et.c. Still, we did miss a couple of few phrases such as xa wee to mean sasa wewe.

    Given these observations, we settled for a technique that would quickly capture stopwords for moderately developed languages such as creole, pidgin e.t.c. Frequency analysis of word tokens in sentences provide a good means of isolating stopwords since they have a high occurrence in a sentence structure.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


{{ theme:js file="jquery.fittext.js" }}