using your powers for good - drivendata

61
Using Your Powers for Good

Upload: peter-bull

Post on 20-Feb-2017

225 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Using Your Powers

for Good

PETER BULLISAAC SLAVITT

@drivendataorg

[peter | isaac]@drivendata.org

www.drivendata.org

When  it  comes  to  data,  nonprofits  don’t  know  what  they  don’t  know.

THE DATA CAPACITY GAP

• McKinsey predicts a 140k – 180k shortage of data scientists

• Average salary of data scientist:$118,709

• Average salary of the Executive Director of a nonprofit (budget $500k – $5m):

• $133,000

LITERACY

CAPACITY

“Finding ways to make big data useful to humanitarian decision makers is one of the great challenges and opportunities of the network age.”

-­‐ UN Office for the Coordination of Humanitarian Affairs

What is datafor good?

The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2014Brew, Loewi, Majumdar, Reece, Rozier.

Buildings: 197,157Time: 76 yearsMoney: $98 million

Buildings: 42,695Time: 16.4 yearsMoney: $21.3 million

Buildings: 378Time: 2 monthsMoney: $189,000

Prediction Saves Time & MoneyNo Prediction Current Model Model Forecast

Lead Paint Inspections

Abelson, Varshney. http://www.datakind.org/blog/using-satellite-images-to-understand-poverty/

Quantifying Poverty

Protect the Environment

"The muffins are great...espthe blueberry! I have never had that good a blueberry muffin...its not super sickeysweet like most...."

"Food was soggy and cold by the time I got it and they messed up my order. Better food and better service from other places nearby."

Sentiment Analysis

𝑁"##$ − 𝑁&'$𝑁(#)$*

brainyrockstarsagilityunrealfascinateproblem-solversteadinessgreatnessworkscommendably

gawkcons

disorderinane

martyrdomirrecoverable

plastickymadman

decrement

Topic Modeling (LDA)

Using the winning algorithm, Boston could catch the same number of health violations with 40 percent fewer inspections, simply by better targeting city resources at what appear to be dirty-kitchen hotspots.

- Mike Luca, Harvard Business School

Education Resource Strategies

Budget

We are budget-ing our things

$Expenditures

Monies for

making students

good

Beware there be dollars here

Lots of Labels!PETRO-VEND FUEL AND FLUIDS

MAINT MATERIALS

SATELLITE COOK

UPPER EARLY INTERVENTION PROGRAM 4

Regional Playoff Hosts

Supp.- Materials

ITEMGH EXTENDED DAY

FURNITURE AND FIXTURES

NON-CAPITALIZED AV

Water and Sewage *

Instructional Materials

Food Services - Other Costs

Capital Assets - Locally Defined Groupings

All about the features!

Text features in scikit-­‐learn

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()

vectorized_data = vec.fit_transform(text_data)

Processing: Tokenize on Punctuation

PETRO-VEND FUEL AND FLUIDS

PETRO VEND FUEL AND FLUIDS

PETRO-VEND

FUEL AND FLUIDS

Processing: Tokenize on Punctuation

from sklearn.feature_extraction.text import CountVectorizer

alpha_tokens = lambda text: re.split("[^a-z]", text.lower())

vec = CountVectorizer(tokenizer=alpha_tokens)

vectorized_data = vec.fit_transform(text_data)

Processing: 2-grams and 3-grams

PETRO VEND FUEL AND FLUIDS

1-grams

2-grams

3-grams

from sklearn.feature_extraction.text import CountVectorizer

alpha_tokens = lambda text: re.split("[^a-z]", text.lower())

vec = CountVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))

vectorized_data = vec.fit_transform(text_data)

Processing: 2-grams and 3-grams

Computational: Hashing Trick

PETRO VEND FUEL AND FLUIDS

2954 9384 4569 1197 8947

Computational: Hashing Trick

from sklearn.feature_extraction.text import HashingVectorizer

alpha_tokens = lambda text: re.split("[^a-z]", text.lower())

vec = HashingVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))

vectorized_data = vec.fit_transform(text_data)

Statistical: Pairwise Interaction Terms

from sklearn.feature_extraction.text import HashingVectorizerfrom sklearn.preprocessing import PolynomialFeatures

alpha_tokens = lambda text: re.split("[^a-z]", text.lower())

vec = HashingVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))

vectorized_data = vec.fit_transform(text_data)

prep = PolynomialFeatures(degree=2, interaction_only=True,include_bias=False)

preprocessed_data = prep.fit_transform(vectorized_data)

Statistical: Pairwise Interaction Terms

Education Resource Strategies

Saving about300 staff-hours

per year

How do I get involved in data for good?

1 By joining data for good organizations

www.datakind.org

www.codeforboston.org

meetup.com/Data-Science-for-Social-Good

2 By doing #data4good in your spare time

blog.datalook.io/openimpact/

data.cityofboston.gov

www.drivendata.org

3 By participating in a fellowship program

dssg.io

dssg-atl.io

escience.washington.edu

www.codeforamerica.org

www.bayesimpact.org

4 By attending data for good events

www.kdd.org

www.bloomberg.com/company/d4gx

dogooddata.com

http://socialgoodtech.org/

5 By getting involved professionally

Look for local non-profits!

Get your company involved!

Questions?

@drivendataorg