big data, democratized analytics and international development

16
Big Data, Democratized Analytics and Deep Context will Change How We Think and Do Development Aniket Bhushan, Senior Researcher, The North-South Institute, [email protected] Mar 29, 2012 International development as a field of research and practice has been more of a laggard than a leader in using big data and powerful analytics. Much of the data is often dated or of poor quality. Huge areas including those of the greatest interest remain entirely unmapped, data poor or otherwise poorly understood. This situation is changing faster than anyone predicted and the set of tools driving this evolution represents one of the most important trends in international development. The proliferation of mobile technologies, computing power and democratization of analytics within an open-source, open-data chapeau will fundamentally change the way we think about and do development. I provide a synopsis of the most impactful developments in three key areas: the base (data) layer, the analysis layer, and the feedback (data) layer. Together these advances are changing how we think about data in development, how we develop a deeper contextual understanding at a very micro level without losing the ability to aggregate and generalize, and how we bring it all together in a meaningful way to make better decisions. 1

Upload: cidpnsi

Post on 11-May-2015

1.948 views

Category:

Investor Relations


0 download

DESCRIPTION

Big data, democratized analytics and deep context (feedback) are changing how we think about evidence in international development

TRANSCRIPT

Page 1: Big Data, Democratized Analytics and International Development

Big Data, Democratized Analytics and Deep Context will Change How We Think and Do Development

Aniket Bhushan, Senior Researcher, The North-South Institute, [email protected] Mar 29, 2012

International development as a field of research and practice has been more of a laggard than a leader in using big data and powerful analytics. Much of the data is often dated or of poor quality. Huge areas including those of the greatest interest remain entirely unmapped, data poor or otherwise poorly understood.

This situation is changing faster than anyone predicted and the set of tools driving this evolution represents one of the most important trends in international development. The proliferation of mobile technologies, computing power and democratization of analytics within an open-source, open-data chapeau will fundamentally change the way we think about and do development. I provide a synopsis of the most impactful developments in three key areas: the base (data) layer, the analysis layer, and the feedback (data) layer. Together these advances are changing how we think about data in development, how we develop a deeper contextual understanding at a very micro level without losing the ability to aggregate and generalize, and how we bring it all together in a meaningful way to make better decisions.

1

Page 2: Big Data, Democratized Analytics and International Development

2

Base layer: Big and Open

Priavte sector (proprietary)

Public sector data at

various levels

International institutions: World Bank,

IMF, UN Data

Analytics layer

Virtualization

Visualization

Feedback layer

"crowds"

Purposive - push

Anonymous - big

Page 3: Big Data, Democratized Analytics and International Development

The Base Layer: Open Data and Big Data

How do we know what we know in the field of international development? What is the information, or evidence base, who generates it and how? There are at least three main collectors, sorters and repositories of development information: international institutions (UN, World Bank, IMF etc), national and sub-national official public sector institutions and the private sector. Change is afoot in each.

Take for instance the open data push by international institutions such as the World Bank and African Development Bank, or the UN’s Global Pulse initiative. Opening up the World Bank’s databank has made a huge amount of information available to a wider range of stakeholders than ever before, similarly Global Pulse is creating a platform to harness new data streams. Groups such as Development Gateway and Aid Data have sized the opportunity to push openness further. A good example of what is made possible by this opening is a tool like Development Loop which not only plots all World Bank and African Development Bank projects at precise geographic locations across Africa but also overlays the same with feedback sourced from the intended beneficiaries of the project or initiative.

This full circle or loop is a powerful reminder of the importance of data transparency and universal standards. Opening up aid data in a standardized format will make geocoding a potent tool for real transparency and accountability.

To appreciate what a game-changer open data can be examine the current state of affairs. Research conducted by UK based Publish What You Fund recently in Uganda aimed at simply tallying up financial resources available for development,

found that the government was unaware of the amount donors planned to spend in that year (2006-07). The planned expenditure was more than double what the government was aware of; indeed financial resources flowing into the country were far higher than had been estimated. Or take another example, what the World Bank’s Chief Economist for Africa calls a “statistical tragedy”. The majority of Africa’s population lives in countries that still use an outdated (1960s) method of national income accounting used to generate fundamental data points such as Gross Domestic Product (GDP). Ghana for instance only shifted to the 1993 UN system of national

3

Page 4: Big Data, Democratized Analytics and International Development

accounts last year. When they did so they found their GDP was 62 percent higher than previously thought, catapulting the country to ‘middle income’ status.

If even the most basic information often taken for granted is riddled with problems then what do we really know? Data reliability is one issue but time-lag is another. Most of the data used in international development is stale by the time it is called upon in decision making. The information base we rely on in international development needs to be bolstered by building bridges with new sources and data-streams.

Opening up proprietary private sector data and exposing it to the concerns in international development in the coming years will be a game changer. To date international institutions have made the most progress towards data openness, select public sector authorities (for instance under the purview of the Open Government Partnership) are also making progress, but it is the private sector – the main repository of “big data” – that is the holy grail. If you total all the data collected by the US Library of Congress (one of the largest public sector repositories) it would be about 235 terabytes as of April 2011. Wal-Mart processes and stores about 2500 terabytes per hour! The big data revolution is changing business models fundamentally, and like capital and labour data itself has become a commercial driver and firms like Google, Facebook, Twitters are the first “data factories”, pioneers much like their antecedents in the industrial revolution.

This is truly big data and its growth has exploded off the charts thanks in large part to the explosion of mobile sensors (the most important of which is the mobile phone but think also credit cards, laptops, GPS and everything from radio-frequency to QR codes), and the rapid democratization of high-power analytics. Whether it is geocoded mobile phone data modeled to track slum development or predict microfinance loan defaults or provide weather-indexed insurance to small farmers big data is already a game-changer in development and we have only begun to scratch the surface.

The Analytics Layer: Virtualization, Visualization are driving Democratization

Analytics is simply the collection of tools and techniques used to make sense of data. High-power analytics proved a game-changer in the commercial sector and can now do the same in the social sector. At its core analytics is about unearthing and understanding relationships and patterns. Analytics helped retailers discover unlikely trends, most famously that customers who came in to buy diapers also tended to buy beer! It can do the same for complex social systems. Developments in analytics have kept pace with the speed with which big data has grown.

4

Page 5: Big Data, Democratized Analytics and International Development

Bringing this capacity to bear on development challenges such as food security and urbanization is just getting started.

Before we look at some examples let us put in perspective what we mean by the explosion of mobile sensors. The mobile phone is already the most celebrated example; anyone who has visited Kenya (or really any African country) has seen vividly the power of mobile money. This is old news. Anyone who has followed elections in Kenya probably also heard of Ushahidi, a locally developed open-source platform to map reports of post-election violence which went live in 2008. This is also old news.

What is new is what is now made possible by bringing to bear requisite analytics on the huge proliferation of mobile sensors. Mobile phones have grown from under 750million with more than two thirds in developed countries to over 5billion with about four times as many in developing countries as in the developed world. Of the 5billion about 1billion live on less than $5 a day. The developing world is the leading driver of mobile big data. This includes voice, text, financial, locational and positional information, which is now possible to overlay with the base data layer described earlier (income, health, education and other indicators generated by official sources) to produce new insights into real behaviour and complex incentive structures.

Sticking with Kenya, take the example of the Engineering Social Systems lab. Coupling terabytes of mobile phone data with Kenyan census information ESS is modeling the growth of slums to inform urban planners about where to locate services such as water pumps and public toilets. In Uganda the same group is developing causal structures of food security, in Rwanda they collected a sample of every phone-call over a four year period, coupled with a random survey, to analyze how different people react to the same economic shock. What is really interesting is the way experimental initiatives are being brought out of the lab into real world application. The shift that is taking place is fundamental; away from models governing theory to models informed and built on real networks (see Reality Mining).

5

Page 6: Big Data, Democratized Analytics and International Development

For long development analysis at best has been limited to correlations and inferences based on correlations, for the first time big data coupled with high-power analytics is opening up the possibility if not of entirely causal dynamics then at least more robust inferences. Our traditional methods of inquiry have conditioned us to think in terms of generalizing on the basis of random sampling, for the first time the proliferation of mobile sensors is making possible highly targeted yet nonintrusive and anonymous inquiry.

And this is not another story about mobile phones alone. The point is the rapid emergence of altogether new data-streams, in step with the development of analytical capacity to draw useful inference out of them. Take Twitter for instance which generates information about the size of the entire US Library of Congress in two weeks and together with Facebook has already shown its efficacy during the Arab uprisings. At the heart of this evolution are open-source software systems and tools that allow the simultaneous collection, categorization, and analysis of various data types from Twitter hashtags to videos to positional data and machine IDs. Swift River developed by Ushahidi is an example of a free open-source platform that enables rapid simultaneous filtering and verification of real-time data from channels like Twitter, SMS, Email and others. It also visualizes the information in dashboards that the average user can understand. This is particularly powerful for monitoring immediate post-crisis developments when the information flow suddenly increases but is also only useful if immediately analyzed.

Democratization of analytics driven by a commitment to open-source is furthered by virtualization of platforms and visualization of information (to make it engaging for the average user). An aspect of virtualization is community or crowd driven problem solving. Take for instance Data Without Borders, a pro-bono data scientist exchange. DWB organizes ‘data dives’ to help NGOs, civil society organizations and other who might not have the time, capacity or inclination but may be sitting on information useful for

purposes beyond their imagination, to make sense of their own information. At a recent data dive DWB helped a human rights group that allows users to anonymously upload information about violations get a better look at who was using the system and plot trends, without compromising anonymity. Using open-source tools (such as R) they also visualized the information to make it easy to understand.

Here are some in a fast growing toolkit worth following:

6

Page 7: Big Data, Democratized Analytics and International Development

Social network analysis: with the growing penetration of web 2.0 technologies, social media is becoming the dominant communication channel for rapid exchange. Network analysis includes a set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed, e.g., how information travels, or who has the most influence over whom. Examples of applications include identifying key opinion leaders, and identifying bottlenecks in enterprise information flows. Two important techniques within network theory and analysis are exploratory data analysis (EDA) and link analysis. EDA is an approach for analyzing large datasets in summarized formats according to their main characteristics. EDA came about in part as a reaction to overdue emphasis in statistical fields on hypothesis testing or “confirmatory analysis”. EDA emphasizes using the data to suggest hypotheses to test. It emphasizes testing assumptions on which inference is based. Similarly link analysis focuses on analyzing relationships among nodes through visualization methods, including network diagrams and association matrices. Gephi is an interactive open source (java based) platform for complex systems and network analyses.

Automated web-scraping: almost everything is on the web today, which means almost everything has a hypertext (HTTP) related identity. Web scraping is an automated technique for collecting information from the web. Web scraping transforms typically unstructured data in HTML format into structured data that can be centralized in a spreadsheet. Simply using MS Excel it is possible for instance to scrape tables and other data in a variety of unstructured formats (including real-time, e.g. weather information on world clock, or stock price info), and save them as a spreadsheet for further use. Web scraping, though somewhat mired in legal and privacy issues, has the potential to massively increase the volume of accessible information. For example UN Global Pulse is running a project with Price Stats and the Billion Prices Project at MIT, investigating daily bread prices across Latin American countries by scraping the web for online prices. The end product is a demonstration, an e-bread index which tracks bread price inflation real-time (daily), and can be compared contrasted or can complement the traditional consumer price index (which is only published on a monthly basis). UN Global Pulse is also pioneering Hunch Works, the first social network for hypothesis formation, evidence gathering and decision making. Hunch Works allows researchers to connect with other experts with complementary resources so that together they could quickly determine if data signals are indications of deepening crisis and warrant further investigation.

World Bank – Adept software platform: ADePT is a free (STATA based) platform developed by the World Bank that automates and standardizes analysis. ADePT allows complex statistical analyses, and direct pre-configured access to a range of micro level data from the Bank’s and other sources. It is particularly useful for economic analysis and

7

Page 8: Big Data, Democratized Analytics and International Development

particularly useful in developing countries where researchers may not have ready access to otherwise expensive statistical packages.

Hadoop – distributed parallel analytics: if big data is the latest buzz word in IT then Hadoop is a large part of the story behind it. Hadoop is a platform for distributing problems, tasks, analyses across a number of servers, speeding up analysis and shrinking the distance between data, analysis and result. Hadoop works behind things you know well but you have never seen it or are likely to know it. For example one of the most well-known implementations is Facebook, which brings in core data stored by you into Hadoop clusters where it is reflected against your friends, their interest to suggest recommendations back to you. High power distributed parallel analytics solutions like Hadoop help square the big data circle, make it small, in real-time.

The Feedback Layer: Deep Context, Complex Microsystems, Real-Time Loops

The efficacy of the feedback layer is also new. This layer has two key aspects: the purposive or push driven response and the big anonymous response (discussed above). Targeted crowdsourcing has already come a long way. The Ushahidi experience in Kenya for instance also worked for monitoring elections in Afghanistan and tracking emergencies including the cholera outbreak during the Haiti earthquake. Mobile phone SMS platforms have been adapted to make participatory budgeting more inclusive in hard to reach areas such as conflict-affected South-Kivu in the Democratic Republic of Congo and results have been encouraging.

To understand how powerful the feedback layer can be consider the experience of the Mobile Accord, which at the initiative of the World Bank’s World Development Report 2011, ran Geo Poll an SMS based targeted polling in the DRC. The poll asked 10 sensitive questions including about topics such as rape and violence against women in conflict zone. The survey produced 1.2million text responses and the outputs were turned into a video “DRC Speaks” which captured people’s responses to questions about their experiences in their own words. This ended up being one of the largest surveys ever conducted in the country.

Some of the most valuable data in development comes from surveys; including household, labor market, living standard and other social surveys. But there are two key problems with such surveys: time (they take time to implement and can only be done infrequently) and high costs. Mobile technology is helping get around these issues. The World Bank is piloting an interesting initiative in Latin America called “Listening to LAC” (L2L)1 where a range of mobile technologies are being deployed to conduct real-time (higher frequency) self-administered surveys, to generate panel data on key questions pertaining to vulnerability and coping strategies.

1 http://siteresources.worldbank.org/NEWS/Resources/Gettingthenumbersright4-19-10.pdf

8

Page 9: Big Data, Democratized Analytics and International Development

While still in a pilot phase this is the first time such information is being collected near real-time and with lower costs than large national surveys.

There is a pattern here. In the base layer more and more (new and old) data is opening every day. In the analytics layer experimental ideas are leaving the lab for real world application; virtualization and visualization are helping foster new communities geared towards collaboration and collective problem solving. Similarly in the feedback layer the tools are also democratizing. Ushahidi has created a very easy to use version of their implementation called CrowdMap. Anyone who knows how to set up an email account will be able to set up their own incident mapping of whatever trend, alert or issue they are interested in getting feedback from the crowd on. The service is already being used to track everything from citizen report cards assessing corruption in India to the Syrian uprising to national emergencies on the tiny island of Samoa. The feedback layer is also innovating in reverse – backwards from new to old media and formats – something essential in international development where so much remains unmapped or in non-digital formats. Take mapping. In the developing world there are still vast areas beyond GPS coverage. But people in those communities have intimate knowledge of their surroundings. If only there was a way to tap into that knowledge and build a bridge between that data source and an open-source digital resource like Open Street Map. Walking Papers does precisely that. Contributors can simply draw, by hand, on simple paper, say a map of their neighbourhood and upload to Walking Papers, where a community specializes in taking that information and using it to deepen Open Street Maps.

Looking Ahead International development as a field of practice and research has tended to be a laggard in using big data, powerful analytics and innovative sourcing techniques, and with good reasons. However this is changing faster than anyone predicted and faster than most organizations that

9

Page 10: Big Data, Democratized Analytics and International Development

‘do development’ are prepared for. The open data movement has widened access to a broad range of basic contextual information. A similar push is needed to open private sector data in the service of social good. Big data is beginning to have a big impact on how we think about development challenges and this is because we have the ability to make it understandable like never before. Powerful analytical tools and collaborative platforms are dramatically changing what is possible for even the most intractable challenges like understanding socioeconomic risks and responses, dealing with urban planning, and better preparing for emergencies. For the first time we have a feedback layer which has made possible deep and near real-time awareness of what is working or not working where and why. Together big data, democratized analytics and the ability to tap deep contexts will change the way we think and do development in the coming years.

BibliographyAid Data, Development Gateway, African Development Bank - Development Loop. Development Loop. n.d.

http://www.aiddata.org/content/index/Maps/development-loop-app (accessed Dec 21, 2011).

10

Page 11: Big Data, Democratized Analytics and International Development

Courtney, Alexa, and David Kilcullen. "Big data, small wars, local insights: Designing for development with conflict-affected communities." What Matters. McKinsey & Company, December 2, 2011.

Crowd Map. n.d. https://crowdmap.com/mhi.

Data Without Borders. n.d. http://datawithoutborders.cc/.

Davis, Steve, and Jonathan Bays. "Harnessing big data to address the world’s problems." What Matters. McKinsey & Company, November 2, 2011.

DRC Speaks (Geo Poll). n.d. http://www.youtube.com/watch?v=1VtaWPAtHyA.

Eagle, Nathan. Reality Mining. Dataset: http://reality.media.mit.edu/, Massachusetts: Massachusetts Institute of Technology, 2009.

Engineering Social Systems. "Big Data for Social Good." Engineering Social Systems. n.d. http://ess.santafe.edu/bigdata.html (accessed December 21, 2011).

Jay Chen, Trishank Karthik, Lakshminaryanan Subramanian. Contextual Information Portals. Online: http://ai-d.org/, Association for the Advancement of Artificial Intelligence, n.d.

McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. McKinsey & Company, 2011.

Metha, Abhishek. Big Data: Powering the Next Industrial Revolution. White paper: http://www.tableausoftware.com/learn/whitepapers/big-data-revolution, Seattle: Tableau Software, n.d.

Priebatsch, Seth. The game layer on top of the world. Video online at: http://www.ted.com/talks/lang/en/seth_priebatsch_the_game_layer_on_top_of_the_world.html, Boston: Ted Talks, 2010.

Publish What You Fund. "Aid Budgets in Uganda." Publish what you fund. n.d. http://www.publishwhatyoufund.org/resources/uganda/ (accessed December 21, 2011).

Shantayanan Devarajan. "Africa’s statistical tragedy." Africa Can End Poverty (blog of World Bank, Africa Chief Economist). October 6, 2011. http://blogs.worldbank.org/africacan/africa-s-statistical-tragedy (accessed October 6, 2011).

Swift River. n.d. http://ushahidi.com/products/swiftriver-platform.

UN Global Pulse. n.d. http://www.unglobalpulse.org/.

Ushahidi. n.d. http://ushahidi.com/.

Walking Papers. n.d. http://walking-papers.org/.

11