data science
DESCRIPTION
A quick introduction to the fascinating world of business and data analyticsTRANSCRIPT
prithwis mukerjee, ph.d.
Introduction to Data Science
Prithwis Mukerjee, PhDPraxis Business School, Calcutta
prithwis mukerjee, ph.d.
Agenda
● Why data science ?● Techniques
○ Statistics○ Data Mining○ Visualisation
● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems
● Business Domains
prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
Data is being acquired from a variety of sources● EFT in Banks, Credit card
payments● Cell phones● Sensors attached to a variety
of equipment● Surveillance cameras, CCTV● Social Media Updates● Blogs● Websites
Volume
prithwis mukerjee, ph.d.
Variety / Velocity
● Numeric data● Structured text data● Unstructured text data● Images● Sound and video recordings● Graph Nodes
○ Social Media “friends”○ Websites linked to each
other
Data is being generated fast and is becoming obsolete or useless equally faster● Realtime ( or near realtime)
data from sensors, cameras● Website traffic● Social media “trends”
prithwis mukerjee, ph.d.
So what is Big Data ?
● Volume● Velocity● Variety ?
A new term coined by IT vendors to push new technology like● Map Reduce● Hadoop● NOSQL
A new way to● collect● store● manage● analyse● visualise data
prithwis mukerjee, ph.d.
Big Data is like Crude Oil { not new Oil }
Think of data as crude oil !
Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos
But what about refining ?
prithwis mukerjee, ph.d.
The Science (and Art ) of Data
Think of data as crude oil !
Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos
Data Science● Discovering what we do not
know about the data● Obtaining predictive, actionable
insight● Creating data products that have
business impacts● Communicating relevent
business stories
Refining
prithwis mukerjee, ph.d.
Two Perspectives
Programmingor “Hacking”Skills
Mathematics,Statistics
Knowledge
BusinessDomain
Knowledge
MachineLearning
OperationsResearch
RDBMSERP / BI
DataScience
prithwis mukerjee, ph.d.
10 Things {most} Data Scientists do ...1. Ask good questions
What is what ?We do not know ! We would like to know
2. Define, Test Hypothesis, Run experiments3, Scoop, scrape, sample business data4. Wrestle and tame data5. Play with data, discover unknowns
6. Create models, algorithms7. Under data relationships8. Tell the machine how to learn from the data9. Create data products that deliver actionable insights10. Tell relevant business stories from data
prithwis mukerjee, ph.d.
Statistics - World of Data
● Data comes in various types○ Nominal - colour, gender,
PIN code ○ Ordinal - scale of 1-10,
{high, medium, low}○ Interval - Dates,
Temperature (Centigrade)○ Ratio - length, weight, count
● Data comes in various structure○ Structured data - nominal,
ordinal, interval, ratio○ Unstructured text - email,
tweets, reviews○ Images, voice prints○ graphs, networks - social
media friendships, likes
prithwis mukerjee, ph.d.
Descriptive Statistics
● Numeric Description○ Mean, Median, Mode○ Quartile, Percentile○ Variance / Standard
Deviation
prithwis mukerjee, ph.d.
Statistics : The Path Ahead
Probability, Distributions
Testing of Hypothesis
Regression,Testing
PredictiveAnalysis
prithwis mukerjee, ph.d.
Data Mining / Machine Learning
Is the process of obtaining● novel
● valid
● potentially useful
● understandable
patterns in data
Typical tasks are ● classification
● clustering
● association rules
● sequential patterns
● regression
● deviation detection
prithwis mukerjee, ph.d.
Some definitionsInstance ( an item or record)● an observation that is
characterised by a number of attributes
○ person - with attributes like age, salary, qualification
○ sale - with product, quantity, price
Attribute● measuring characteristics of an
instanceClass● grouping of an instance into
○ acceptable, not acceptable○ mammal, fish, bird
Nominal● colour, PIN code, state
Ordinal● ranking : tall, medium, short or
feedback on a scale of 1 - 10Ratio● length, price, duration, quantity
Interval● date, temperature
prithwis mukerjee, ph.d.
Data Mining : Classification
Classification● Which loan applicant will not
default on the loan ?● Which potential customer will
respond to a mailer campaign ?
prithwis mukerjee, ph.d.
Classification Example
categorical
categorical
continuous
class
Training Set
ModelLearn
Classifier
Test Set
prithwis mukerjee, ph.d.
Data Mining : Clustering
Given a set of unclassified data points, how to find a natural grouping within them
● Can we segment the market in some way that is not yet known ?
prithwis mukerjee, ph.d.
Example of Document Clustering
Clustering points : 3204 article from the Los Angeles Times
Similarity Measure : How many words are common in these documents ( after excluding some common words )
prithwis mukerjee, ph.d.
Clustering of S&P Stock Data
● Observe Stock Movements every day.
● Clustering points: Stock-{UP/DOWN}
● Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.
● We used association rules to quantify a similarity measure.
prithwis mukerjee, ph.d.
Regression● Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear or nonlinear model of dependency.○ Greatly studied in statistics, neural network fields.
● Examples:○ Predicting sales amounts of new product based on advertising
expenditure.
○ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
○ Time series prediction of stock market indices.
prithwis mukerjee, ph.d.
Data Mining : Association Rules Mining
Association Rules● which products
should be kept along with other products
● which two products should never be discounted together
prithwis mukerjee, ph.d.
Visualisation : The need to tell a story
prithwis mukerjee, ph.d.
Visualisation : The need to tell a story
prithwis mukerjee, ph.d.
Definitions
Data Mining● Is the process of extracting
unknown, valid and actionable information from large databases and using this to make business decisions
● Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data
Data Science is a rare combination of multiple skills that include● Technology : obviously !
but also● Curiosity - a desire to go below
the surface and discover a hypothesis that can be tested
● Storytelling - create a business story around the data
● Cleverness - again obviously, to look at the problem from different angles
prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
R : Your first step into Data Science
Try out this free interactive tutorial just now
prithwis mukerjee, ph.d.
Statistical Tools
http://r4stats.com/articles/popularity/
prithwis mukerjee, ph.d.
Some Comparisons
prithwis mukerjee, ph.d.
Map Reduce
● Input : A set of (key, value) pairs
● User supplies two functions○ Map (k,v) => List(k1,v1)○ Reduce (k1, list(v1)) => v2
● Output is the set of (k1,v2) pairs
prithwis mukerjee, ph.d.
Hadoop
A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● the Map and Reduce functions● loading data into HDFS
1. HIVEa. A plug-in that allows one to
use SQL like queries that are converted into map-reduce jobs
2. PIGa. A scripting language for
writing long queries3. HBASE
a. A non-relational DBMS4. SQOOP
a. moves data to andfrom HDFS
prithwis mukerjee, ph.d.
Data-in-Flight
prithwis mukerjee, ph.d.
Business Domain
● Financial Sector○ Risk Management, Credit
Scoring○ Predict Customer Spend○ Stock and Investment
Analysis○ Loan approval
● Telecom Sector○ Fraud Detection○ Churn Prediction
● Retail and Marketing○ Market segmentation○ Promotional strategy○ Market Basket Analysis○ Trend Analysis
● Healthcare & Insurance○ Fraud Detection○ Drug Development○ Medical Diagnostic Tools
prithwis mukerjee, ph.d.
Conclusion
Data Science is a rare combination of multiple skills that include● Technology : obviously !
but also● Curiosity - a desire to go below
the surface and discover a hypothesis that can be tested
● Storytelling - create a business story around the data
● Cleverness - again obviously, to look at the problem from different angles
● Why data science ?● Techniques
○ Statistics○ Data Mining○ Visualisation
● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems
● Business Domains
prithwis mukerjee, ph.d.
prithwis mukerjee, ph.d.
Thank You
Contact
Prithwis MukerjeeProfessor, Praxis Business [email protected]
This presentation is accessible at at the blog
http://blog.yantrajaal.com at the following URL
http://bit.ly/pm-datascience