brown bag talk with micah altman, sources of big data for social sciences

41
Sources of Big Data for the Social Sciences Micah Altman Director of Research MIT Libraries Prepared for Program on Information Science Brown Bag Series MIT August 2015

Upload: micah-altman

Post on 08-Apr-2017

494 views

Category:

Technology


1 download

TRANSCRIPT

Sources of Big Data for the Social Sciences

Micah AltmanDirector of Research

MIT Libraries

Prepared for

Program on Information Science Brown Bag Series

MIT

August 2015

Sources of Big Data for the Social Sciences

Roadmap What the @#%&!

Is “big data”? Two examples of

big data in social & health sciences

Open questions Potential roles for

libraries

Big Data Challenges

Acquisition

Retention

Analysis

Access

Sources of Big Data for the Social Sciences

Credits&

Disclaimers

Sources of Big Data for the Social Sciences

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx,

Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. 

Sources of Big Data for the Social Sciences

Collaborators & Co-Conspirators Workshop Series Co-Organizers

– U.S. Census Bureau Cavan Capps Ron Prevost

Research Support Supported by the U.S. Census Bureau

Sources of Big Data for the Social Sciences

Related Work

Main Project: Census-MIT Big Data Workshop Series

projects.informatics.mit.edu/bigdataworkshops Related publications:(Reprints available from: informatics.mit.edu ) Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study: Request

for Information.” Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach to

Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming. Altman M, McDonald MP. 2014. Public Participation GIS : The Case of

Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science .

Sources of Big Data for the Social Sciences

Workshops Series: Big Data and Official Statistics

Acquisition ChallengesUsing New forms of Information for Official Economic Statistics[August 3-4]

Privacy ChallengesLocation Confidentiality and Official Surveys [October 5-6]

Inference ChallengesTransparency and Inference[December 7-8]

Expected outcomes:

Workshop reports (September, October, December)

Integrated white paper(February)

Identifying new opportunities for statistical agencies

Inform the Census Big Data Research Program.

projects.informatics.mit.edu/bigdataworkshops

Sources of Big Data for the Social Sciences

What the @#%&!

is Big Data?

Sources of Big Data for the Social Sciences

Small, Big, Massive & Ginormous Data Characteristics: the k “V’s” of big data

Volume Velocity Variety + Veracity + Variability + …

Sources of Big Data for the Social Sciences

“Big” is in the use, not just the dataWhen do challenges of “big” exceed limits of well-selected traditional methods and practices?

Data Management – Workflow & Governance Challenge

Implementation – Performance Challenges Analysis methods – Inferential Challenges

Sources of Big Data for the Social Sciences

Why pay attention now?

Sources of Big Data for the Social Sciences

Trends and Challenges Trends

Increasingly data-driven economy Individuals are increasingly mobile Technology changes data uses Stakeholder expectations are changing Agency budgets and staffing remain flat.

The next generation of official statistics Utilize broad sources of information Increase granularity, detail, and timeliness Reduce cost & burden Maintain confidentiality and security

Multi-disciplinary challenges : Computation, Statistics, Informatics, Social Science, Policy

Sources of Big Data for the Social Sciences

Two examples

(Good Cop, Bad Cop?)

Using Weibo to Discover Chinese Censorship Strategies(and U.S. Debate Strategies)

Sources of Big Data for the Social Sciences

More Information• Grimmer, Justin, and Gary King. "General purpose

computer-assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.

• King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107 (2 (May): 1-18. Copy at http://j.mp/LdVXqN

“Posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored… the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content.”

Data Source - Social Media Messages

Data: Structure - Network, Unstructured Text, Structured metadata

Unit of Observation - Individuals; InteractionsCollection Design - Pure observationalDesired Inferences - Causal inference

– what censorship strategies cause observed reaction

- Inference to Population Frame

Performance challenges

- High volume- Complex network

structure- Scaling bespoke

algorithms- Sparsity- Systematic and sparse

metadataManagement Challenges

- License- Replication- Revision Control

Inferential Challenges - Measurement error – extracting topics from text

Using Google Searches to Forecast Disease Outbreaks

Sources of Big Data for the Social Sciences

More Information• Ginsberg, Jeremy, et al. "Detecting influenza

epidemics using search engine query data." Nature 457.7232 (2009): 1012-1014.

• Lazer, David, et al. "The parable of Google Flu: traps in big data analysis." Science 343.14 March (2014).

“Big data hubris” is the often implicitassumption that big data are a substitutefor, rather than a supplement to, traditional data collection and analysis.

Data Source - Google search queriesData: Structure - Quasi-tabular, structured

metadata and unstructured text

Unit of Observation - Interactions with a system

Collection Design - Pure observational

Desired Inferences - Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)

- Inference to general population

Performance challenges - Streaming algorithmsManagement Challenges - Replication

- Transparency- Variability

Inferential Challenges - External Validity- Measurement error

– extracting topics from text- Overfitting- Sampling

Sources of Big Data for the Social Sciences

Comparing CasesChinese Censorship Flu Prediction

Data Source - Social Media Messages - Google search queries

Data: Structure - Network, Unstructured Text, Structured metadata

- Quasi-tabular, structured metadata and unstructured text

Unit of Observation - Individuals; Interactions - Interactions with a systemCollection Design - Pure observational - Pure observationalDesired Inferences - Causal inference

– what censorship strategies cause observed reaction

- Inference to Population Frame

- Predictive inference-- where will flu clusters appear next-- Short-term (nearcasting)-- small-area (fine-spatial granularity)

- Inference to general populationPerformance challenges - High volume

- Complex network structure- Scaling bespoke algorithms- Sparsity- Systematic and sparse

metadata

- Streaming algorithms

Management Challenges - License- Replication- Revision Control

- Replication- Transparency- Variability

Inferential Challenges - Measurement error – extracting topics from text

- External Validity- Measurement error

– extracting topics from text- Overfitting- Sampling

Sources of Big Data for the Social Sciences

Why is dealing with big data

hard?

Challenges of Big Data

Big Data Challenges

Acquisition

Sources

Incentives Quality

Provenance

Retention

Change Managemen

t

Integration

Security

StorageAnalysis

Bias

CausationComputation

Visualization

Access

Transparency

Reproducibility

Durable Access(Preservation)

Confidentialiity

Challenges of Big Data

Acquisition Challenges:

Quality, Provenance, Sources

Challenges of Big Data

Some Sources of Economic Information Smartphone sensors – GPS + Vehicle systems IoT – smart thermostats, fire alarms Transactions – online, internal Search behavior – search engine queries Social media – twitter, FaceBook, LinkedIN Imagery – satellite, thermal, video …

Challenges of Big Data

Source Characteristics Unit of Observation

Location, virtual service, communication network, individual

Context Behavior, transaction, environment, statement

Measure characteristics Measure scale Measure structure Accuracy, precision

Frame & Sample characteristics

Challenges of Big Data

Analysis Challenges:

Bias, Computation, Causation, Integration

Challenges of Big Data

Some Potential Sources of Analysis Error

Target Population

FrameSelection

Super Population

Laws

(structures)

λβ

(generates)

Parameters

• Selection bias• Frame

uncertainty• Measurement

error• Unknown

measurement semantics

• Non-independence of measures

• Non-independence of samples

• Model uncertainty

• Unknown causal structure

• Shift in measurements, samples, frames

Challenges of Big Data

Access Challenge:

Data Repeatability, Transparency, Preservation

Challenges of Big Data

Many Initiatives to Improve Scientific Reliability

Retraction monitoring Data citation Clinical trial

preregistration Registered replication Open data Badges

Challenges of Big Data

Some Types of Reproducibility Issues

• Fraud• Misconduct• Negligence• Bit Rot• Versioning problem• Replication• Reproduction• Extension• Result Validation• Fact Checking• Calibration, Extension, Reuse• Undereporting• Data Dredging• Multiple Comparisons’ P-Hacking• Sensitivity, Robustness• Reliability• Generalizability

Ensuring Repeatability & Transparency

Challenges of Big Data

‘’ΩΩΩΩ

Theory(Rules, Entities, Concepts)

Algorithm (Protocol, Operationalization)

Theory(Rules, Entities, Concepts)

Theory(Rules, Entities, Concepts)

Implementation(Software, Coding Rules, Instrumentation )

Execution(Deployment, House Survey Style, Equipment

Setting )

Algorithms (Protocol, Operationalization)

Implementations(Software, Coding Rules, Instrumentation Design )

Executions(Deployment, House Survey Style, Operating

System, Hardware, Starting Values, PRNG seeds)

Structure

Formats

Versions/RevisionsSelections

Integrations

Instantiations(copies)

Execution Context(weather, compiler, operating system system load)

Challenges of Big Data

Access Challenge:

Data Confidentiality,

Security

Durable, Long-Term Access• Why durable access?

• The rule of law require maintaining authentic public records• Scientific advances rely on a cumulative, traceable evidence base• Art, history, culture require durable access to national heritage

information• Our nation needs durable access to a strategic information reserve• Humanity needs durable long-term access information in order to

communicate to future generations• Big data challenges to durability

• Velocity – information is updated, sometime overwritten• Many sources are commercial/private

– not routinely archived, preserved• Modeling future value of information • Maintaining privacy and confidentiality

Challenges of Big Data

Challenges of Big Data

Big data challenges… Anonymization can completely destroy utility

The “Netflix Problem”: large, sparse datasets that overlap can be probabilistically linked [Narayan and Shmatikov 2008]

Observable Behavior Leaves Unique “Fingerprints” The “GIS”: fine geo-spatial-temporal data impossible

mask, when correlated with external data [Zimmerman 2008; ]

Big Data can be Rich, Messy & Surprising The “Facebook Problem”: Possible to identify masked

network data, if only a few nodes controlled. [Backstrom, et. al 2007]

The “Blog problem” : Pseudononymous communication can be linked through textual analysis [Novak wet. al 2004]

Source: [Calberese 2008; Real Time Rome Project 2007]

Challenges of Big Data

Little Data in a Big World Little Data in a Big World

The “Favorite Ice Cream” problem

-- public information that is not risky can help us learn information that is risky

The “Doesn’t Stay in Vegas” problem-- information shared locally can be found anywhere

The “Unintended Algorithmic Discrimination” problem-- algorithms are often not transparent, and can amplify human biases

Sources of Big Data for the Social Sciences

Categorizing Challenges Implementation – Performance

Challenges Systems challenges

Exceed capacity of locally managed storage

Location and migration of data becomes critical for performance

Standard backup, recovery and data integrity mechanisms ineffective

Communication bandwidth Algorithmic Challenges

“in core” vs. “out-of-core” implementations

O(N^2) vs. O(log n) complexity Static vs. streaming algorithms Serial vs. massively parallel Distributed – shared-nothing algorithms

Analysis methods – Inferential Challenges Sources: Designed vs. “found” data Model-based vs. data-based analysis Causal inference vs.

Descriptive/ predictive (forecasting) inference

Data Management & Workflow Provenance Data quality Change management Continuous integration Accommodating variety – semantics,

quality Transparency and reproducibility Privacy Security

Data Governance and Policy Standards Incentives Certifications Regulation

Sources of Big Data for the Social Sciences

Some Open Questions About

Data Sources

Sources of Big Data for the Social Sciences

Preliminary Observations from First WorkshopTopic:

Sources of Economic Big DataUse Case:

Commodity Flow Survey

Observations: Different classes of decisions require different sources of data:

E.g. much designed survey data contributes baseline data for decisions about infrastructure and strategic planning

Transaction based big data could contribute frequency and granularity of estimates

In big data, data sources are stakeholders Businesses need to react quickly and predict the future – and need frequently

updated detailed data Critical to provide a value proposition to business Critical to develop a trust relationship

Some Potential sources ERP and DRP operations data EDI Mobile Phone Traffic Data

Sources of Big Data for the Social Sciences

Some Non-Technical Questions About Sources● Who are the key stakeholders in big data source,

and what are the key stakeholder incentives?○ What key decisions does this information support for

stakeholders? What are the gaps in data from the stakeholder perspective?

○ What are barriers associated with new sources of information?○ Legal barriers○ Economic barriers○ Social/trust barriers

Sources of Big Data for the Social Sciences

Potential Roles for Libraries

Sources of Big Data for the Social Sciences

Potential Roles -- Infrastructure Dissemination

Catalog range of new statistics/indicators , sources Selection based on quality Guide proper use

Durability Ensure long-term accessibility of big-data Manage provenance, versioning Provide transparency of new indicators/statistics

Security & Confidentiality Libraries could be a trusted and accountable 3rd party Store and integrate data from multiple sources Could develop expert implementation of privacy

best practices

Sources of Big Data for the Social Sciences

Potential Roles - LeadershipAdvocacy

Advocate for quality, transparency, replication, durable access.

StandardizationDevelop new methods for big data management

Identify “best practices” for replication, transparency, long-term access

Standardize licenses for reuse, preservation

Sources of Big Data for the Social Sciences

Additional References● Einav, Liran, and Jonathan Levin. "Economics in the age of

big data." Science 346.6210 (2014): 1243089. http://www.sciencemag.org/content/346/6210/1243089.short

● Varian, Hal R. "Big data: New tricks for econometrics." The Journal of Economic Perspectives 28.2 (2014): 3-27.http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf

Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. http://dx.doi.org/10.1787/5js7t9wqzvg8-en

Kriger, David S., et al. Freight Transportation Surveys. Vol. 410. Transportation Research Board, 2011. http://www.nap.edu/catalog/13627/nchrp-synthesis-410-freight-transportation-surveys

Questions?E-mail: [email protected]

Web: informatics.mit.edu

Sources of Big Data for the Social Sciences

Sources of Big Data for the Social Sciences

Creative Commons License

This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.