iHub Research By Angela Crandall / December 1, 2011
Insights from Safaricom “trash”
By Guest blogger, Elvis Bando
A few weeks ago, @chrisorwa, showed me some datasets that he had been working on. I got an adrenaline kick by just looking at the data. Mostly because it was challenging, then again, the prospects of cracking the data was even more motivating. In a previous blog, Chris wrote about trash sourcing, basically extracting information from trash, which was the basis of his project #Saiclique. When he invited me to join him in the project, I realized that some of the data he had were in a format I could not manipulate (we use different core analysis softwares, I use Rapid Miner, he uses Weka) so we had to start data entry process again. We started here:
We ended up getting the following from the cards (we did slightly over 1000 cards):
From this, I generated a beautiful dataset:
I thought the serial must be a concatenation of 4 sets of 4 digits, so I split the data into that. Running a DBSCAN clustering algorithm on Rapid Miner gave me the following:
The tall column is 0526, which was the second set in my 4-4-4-x data split. None of any other configurations had such strength. What does 0526 even mean? Just to confirm that 0526 meant something, I ran a frequency analysis of each digit in the entire serial and using Benford’s Law (the first digit is always 1, 30% of the time), I narrowed the cluster to 2-6-3-x configuration.
Not to bore you with my train of thought, after numerous other modelling and analysis, the data finally spoke, here is the transcript:
Safaricom serializing system seems to be similar to, or based on descriptions of a patented system found at http://www.freepatentsonline.com/5504808.html
If true, then the card serial number contains information about a card, the date and time it was produced and unique identifier. The rest of the information are called by the code from the system (the called info could be the amount of talk time, the expiry etc). The serial is therefore the only unique identifier of a particular card and show whether or not it has been used or not.
An analysis of cards produced in 2010 and earlier indicate that they were sequential for most parts. The initial two digits was 10 throughout the year indicating, probably the year of production. The remaining parts were sequential. The change of this system was probably because they would have run out of state space. At that time, the serial was a 13 digit number, as opposed to the current 17 digits.
Safaricom prepaid card serial number is organized into:
The batch number is a two numeral number running from 01-99. It is splits the batch of cards produced each hour to an approximately 10,000 cards. This ensures that they are easily identifiable in case there is theft or a problem.
ManDate is the date of production of the cards. It is written in the format yy-mm-dd. It is exactly 2 years to the expiry date.
Time is the approximate hour in which the cards were produced. It runs from 000-220 (with increments of 10, so we have 010..020…100..110..). In each hour, 99 batches of cards are produced (see Batch#).
Finally, there is the serial part which is a sequential number. The data I have may be inconclusive but it shows that each day (Time), about 1 million cards are serialized. All cards are serialized the same way, so there is no telling the value of a card from the serial (damn!).
The dataset could possibly have more information. This may be limiting in the current analysis as variables such as the location of collection of the cards, and the date of collection. This can possibly give a good picture of economic indicators, customer spending and possibly zone spending regions.
The writer is the team leader, Doban Africa Ltd. For more information or access to the raw data, contact Chris Orwa @chrisorwa.
Fred at 11:42:15AM Thursday, December 1, 2011
Quite an insight….Reply
Nick Hargreaves at 12:07:31PM Thursday, December 1, 2011
Wow, cool work Chris.Reply
Chris Orwa at 13:22:05PM Thursday, December 1, 2011
@Fred : Thanks,Reply
@Nick: The research is still on-gong & more insights are coming your way [STAY TUNED]
Anon at 02:50:32AM Sunday, December 4, 2011
It’s a crypt-analyst’s treasure trove you’ve got there.. Good work.Reply
Samuel Ngoda at 00:01:23AM Monday, December 5, 2011
This is soooo cool.Reply
Elvis (@levisdoban) at 14:34:30PM Tuesday, December 6, 2011
just a small correction: shows that each hour (Time), about 1 million cards are serialized. it should be each day. I hope to publish more insights in a comprehensive report later this month.Reply
Angela Crandall at 15:07:57PM Tuesday, December 6, 2011
Have made the change @levisdoban. Thanks.Reply
Chris at 17:41:38PM Thursday, March 29, 2012
And why were you doin this and for whose benefit. Might sound dumb but I am not into analytics and you might shed light into this mysteriuos art.Reply
kamal twaha at 19:26:50PM Sunday, April 1, 2012
if you use mathemathics eqns in permutations and combination , you will get shocking resultsReply
Leave a Reply
- Characteristics of Incident Reporters on Twitter
- Pivot East 2013 Finalists.
- .NET Meetup: Focus on Application Lifecycle Management | Wednesday, 29th May | 6 – 7:30 pm | iHub
- Social Entrepreneurship Happy Hour, Featuring Sanergy | Tuesday 28th May, 2013 | 6:30 – 8:30 pm | iHub
- Open Data in Developing Countries -Monitoring and Evaluation study on Kenyan OD technologies