Improving official statistics in emerging markets using machine learning and mobile phone data

Jahani, Eaman; Sundsøy, Pål; Bjelland, Johannes; Bengtsson, Linus; Pentland, Alex ‘Sandy’; de Montjoye, Yves-Alexandre

Author(s)

Sundsøy, Pål; Bjelland, Johannes; Bengtsson, Linus; de Montjoye, Yves-Alexandre; Jahani, Eaman; ... Show more

Download13688_2017_Article_99.pdf (2.478Mb)

PUBLISHER_CC

Terms of use

Creative Commons Attribution http://creativecommons.org/licenses/by/4.0/

Metadata

Show full item record

Abstract

Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. Mobile phone data has a great potential for good and our framework allows this data to be augmented with vulnerability and other information at a fraction of the cost.

Date issued

2017-05

URI

http://hdl.handle.net/1721.1/109143

Department

Massachusetts Institute of Technology. Institute for Data, Systems, and Society; Program in Media Arts and Sciences (Massachusetts Institute of Technology)

Journal

EPJ Data Science

Publisher

Springer

Citation

Jahani, Eaman; Sundsøy, Pål; Bjelland, Johannes; Bengtsson, Linus; Pentland, Alex ‘Sandy’ and de Montjoye, Yves-Alexandre. "Improving official statistics in emerging markets using machine learning and mobile phone data." EPJ Data Science 6, no. 3 (May 2017): 1-21. © 2017 The Author(s)

Version: Final published version

ISSN

2193-1127

Collections

MIT Open Access Articles

DSpace@MIT