what is data science? daniel d gutierrez

20
What is Data Science? Daniel D. Gutierrez, Data Scientist AMULET Analytics February 2014

Upload: amuletc

Post on 10-May-2015

813 views

Category:

Technology


3 download

DESCRIPTION

What is Data Science talk for General Assembly by Daniel D Gutierrez - February 5, 2014

TRANSCRIPT

Page 1: What is Data Science? Daniel D Gutierrez

What is Data Science?

Daniel D. Gutierrez, Data Scientist

AMULET Analytics

February 2014

Page 2: What is Data Science? Daniel D Gutierrez

/ page 2

Page 3: What is Data Science? Daniel D Gutierrez

/ page 3

A Life in Data ScienceAMULET Analytics

My personal consultancy doing work in data science – computational marketing

Doing data analysis, machine learning and visualization for enterprises

Wide breadth of industries: startups, manufacturing, non-profit, fashion, e-

commerce, market research, etc.

Big Data Journalist

Managing Editor – insideBIGDATA.com

Blogger – Big Data Republic (bigdatarepublic.com)

Blogger – All Analytics (allanalytics.com)

Teaching

Community TA – Coursera

UCLA Extension

Writing a book: “Introduction to Machine Learning with R”

Page 4: What is Data Science? Daniel D Gutierrez

Facilitates a cascade of technologies

Big Data is facilitated by data science

Data science is facilitated by machine learning

Machine learning is a confluence of technologies and disciplines

– Computer science, mathematical statistics, probability theory, visualization

Data science in nothing new!

Components have been around for decades

“Data science” is just a new name for something old and proven (I do love it!)

“Machine learning” used to be “data mining” or KDD.

Much hype recently

Harvard Business Review proclaimed “sexiest job for the 21st century.” I’ll take it!

Now with “big data” it’s a force barely contained

Data Science in Perspective

/ page 4

Page 5: What is Data Science? Daniel D Gutierrez

/ page 5

Page 6: What is Data Science? Daniel D Gutierrez

/ page 6

Page 7: What is Data Science? Daniel D Gutierrez

/ page 7

Page 8: What is Data Science? Daniel D Gutierrez

Controversy in hiring data scientists

Some companies post job ads for

unicorns, mythical creatures having

no basis in reality

Hire a data science TEAM!

Don’t expect a single individual to be

both a “theorist” and an

“experimentalist”

Consultant vs. full-time hire

/ page 8

Who Does Data Science? Unicorns!

Page 9: What is Data Science? Daniel D Gutierrez

Big Data

– “large data sets so big that commonly-used software tools are unable to capture,

curate, manage, and process the data within a tolerable elapsed time.”

Hadoop Dominates Big Data market

– Used widely by some of the world's largest websites,

such as Facebook, eBay, Amazon and Yahoo

– Moving into the enterprise

– Invented by developers at Yahoo!

/ page 9

What is Big Data?

Apache Hadoop

Page 10: What is Data Science? Daniel D Gutierrez

Applications for Big Data

Smarter Healthcare

Multi-channel sales

Finance

Log Analysis

Homeland Security

Traffic Control

Telecom

Search Quality

Manufacturing

Trading Analytics

Fraud and Risk

Retail: Churn

/ page 10

“Big Data is the definitive source of

competitive advantage across all

industries. For those organizations

that understand and embrace the new

reality of Big Data, the possibilities

for new innovation, improved agility,

and increased profitability are nearly

endless.”

Source: Wikibon 2012

Page 11: What is Data Science? Daniel D Gutierrez

Father and daughter walk into Target store and to speak with the manager:

– Wants to know why the store is bombarding his teenage daughter with ads for baby

strollers, diapers and other baby goods. "Are you trying to encourage her to get

pregnant?”

– The befuddled manager apologizes and responds he has no idea why the company is

sending her such items

Father later phones the store to apologize - turns out his daughter was expecting

How?

– Target used Big Data to predict pregnancy. When a woman begins buying vitamins,

increases her purchases of lotion, and buys an oversized purse or bag, the odds are

very high she is expecting

– Target knew the daughter was pregnant before the family

/ page 11

The Minnesota Dad

Page 12: What is Data Science? Daniel D Gutierrez

/ page 12

Page 13: What is Data Science? Daniel D Gutierrez

What is Machine Learning?

Components have been around for decades

“Data science” is just a new name for something old and proven (I do love it!)

“Machine learning” used to be “data mining” or KDD.

Supervised learning

Prediction and classification

Linear regression, logistic regression, classification trees, SVM, neural nets

Train the algorithm on known labelled data to be able to predict new data

Unsupervised learning

Hierarchical clustering

K-means clustering

Principal component analysis (PCA)

Dimensionality reduction to address “the curse of dimensionality”

/ page 13

Machine Learning Overview

Page 14: What is Data Science? Daniel D Gutierrez

/ page 14

Sentiment Analysis

Page 15: What is Data Science? Daniel D Gutierrez

R

– Very good for data acquisition, cleaning, munging, exploratory analysis, model

selection, machine learning algorithm development and training, model performance

evaluation

– One of the best visualization tools bar none

– Has over 4,000 packages

Python

– Good choice for production deployment

– Rapidly catching up with R in terms of data science capabilities

/ page 15

R vs. Python Wars

Page 16: What is Data Science? Daniel D Gutierrez

/ page 16

Visualization is Critical

Page 17: What is Data Science? Daniel D Gutierrez

Doing Data Science

Cathy O’Neil & Rachel Schutt

O’Reilly Media

/ page 17

Learning More About Data Science

Page 18: What is Data Science? Daniel D Gutierrez

/ page 18

Data Science in Action

Page 19: What is Data Science? Daniel D Gutierrez

Integral part of Big Data– Data science and machine learning fuel big data ✔

The shortage of data scientists is real

– Big data is expected to be a $53.4 billion industry by 2016 ✔

– Job postings for “data scientist” increased 15,000% between 2011 and 2012 ✔

– Job market currently 140,000 – 190,000 open positions ✔

– Between 2010-2020 project growth of 18.7% ✔

Companies of all sizes need to plan out their data science strategy– Increase value of enterprise data assets ✔

2014 should be a wild year!– Conference circuit is exploding ✔

– New books, news sources, press coverage abound ✔

/ page 19

Summary – Data Science is Here to Stay

Page 20: What is Data Science? Daniel D Gutierrez

Thank you!

Follow me: @AMULETAnalytics

Contact me: [email protected]

www.amuletanalytics.com