using your powers for good - drivendata
TRANSCRIPT
THE DATA CAPACITY GAP
• McKinsey predicts a 140k – 180k shortage of data scientists
• Average salary of data scientist:$118,709
• Average salary of the Executive Director of a nonprofit (budget $500k – $5m):
• $133,000
“Finding ways to make big data useful to humanitarian decision makers is one of the great challenges and opportunities of the network age.”
-‐ UN Office for the Coordination of Humanitarian Affairs
The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2014Brew, Loewi, Majumdar, Reece, Rozier.
Buildings: 197,157Time: 76 yearsMoney: $98 million
Buildings: 42,695Time: 16.4 yearsMoney: $21.3 million
Buildings: 378Time: 2 monthsMoney: $189,000
Prediction Saves Time & MoneyNo Prediction Current Model Model Forecast
Lead Paint Inspections
Abelson, Varshney. http://www.datakind.org/blog/using-satellite-images-to-understand-poverty/
Quantifying Poverty
"The muffins are great...espthe blueberry! I have never had that good a blueberry muffin...its not super sickeysweet like most...."
"Food was soggy and cold by the time I got it and they messed up my order. Better food and better service from other places nearby."
Sentiment Analysis
𝑁"##$ − 𝑁&'$𝑁(#)$*
brainyrockstarsagilityunrealfascinateproblem-solversteadinessgreatnessworkscommendably
gawkcons
disorderinane
martyrdomirrecoverable
plastickymadman
decrement
Using the winning algorithm, Boston could catch the same number of health violations with 40 percent fewer inspections, simply by better targeting city resources at what appear to be dirty-kitchen hotspots.
- Mike Luca, Harvard Business School
Education Resource Strategies
Budget
We are budget-ing our things
$Expenditures
Monies for
making students
good
Beware there be dollars here
Lots of Labels!PETRO-VEND FUEL AND FLUIDS
MAINT MATERIALS
SATELLITE COOK
UPPER EARLY INTERVENTION PROGRAM 4
Regional Playoff Hosts
Supp.- Materials
ITEMGH EXTENDED DAY
FURNITURE AND FIXTURES
NON-CAPITALIZED AV
Water and Sewage *
Instructional Materials
Food Services - Other Costs
Capital Assets - Locally Defined Groupings
Text features in scikit-‐learn
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vectorized_data = vec.fit_transform(text_data)
Processing: Tokenize on Punctuation
PETRO-VEND FUEL AND FLUIDS
PETRO VEND FUEL AND FLUIDS
PETRO-VEND
FUEL AND FLUIDS
Processing: Tokenize on Punctuation
from sklearn.feature_extraction.text import CountVectorizer
alpha_tokens = lambda text: re.split("[^a-z]", text.lower())
vec = CountVectorizer(tokenizer=alpha_tokens)
vectorized_data = vec.fit_transform(text_data)
from sklearn.feature_extraction.text import CountVectorizer
alpha_tokens = lambda text: re.split("[^a-z]", text.lower())
vec = CountVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))
vectorized_data = vec.fit_transform(text_data)
Processing: 2-grams and 3-grams
Computational: Hashing Trick
from sklearn.feature_extraction.text import HashingVectorizer
alpha_tokens = lambda text: re.split("[^a-z]", text.lower())
vec = HashingVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))
vectorized_data = vec.fit_transform(text_data)
from sklearn.feature_extraction.text import HashingVectorizerfrom sklearn.preprocessing import PolynomialFeatures
alpha_tokens = lambda text: re.split("[^a-z]", text.lower())
vec = HashingVectorizer(tokenizer=alpha_tokens,ngram_range=(1, 3))
vectorized_data = vec.fit_transform(text_data)
prep = PolynomialFeatures(degree=2, interaction_only=True,include_bias=False)
preprocessed_data = prep.fit_transform(vectorized_data)
Statistical: Pairwise Interaction Terms