data science

37
prithwis mukerjee, ph.d. Introduction to Data Science Prithwis Mukerjee, PhD Praxis Business School, Calcutta

Upload: prithwis-mukerjee

Post on 11-May-2015

477 views

Category:

Education


4 download

DESCRIPTION

A quick introduction to the fascinating world of business and data analytics

TRANSCRIPT

Page 1: Data Science

prithwis mukerjee, ph.d.

Introduction to Data Science

Prithwis Mukerjee, PhDPraxis Business School, Calcutta

Page 2: Data Science

prithwis mukerjee, ph.d.

Agenda

● Why data science ?● Techniques

○ Statistics○ Data Mining○ Visualisation

● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems

● Business Domains

Page 3: Data Science

prithwis mukerjee, ph.d.

Page 4: Data Science

prithwis mukerjee, ph.d.

Data is being acquired from a variety of sources● EFT in Banks, Credit card

payments● Cell phones● Sensors attached to a variety

of equipment● Surveillance cameras, CCTV● Social Media Updates● Blogs● Websites

Volume

Page 5: Data Science

prithwis mukerjee, ph.d.

Variety / Velocity

● Numeric data● Structured text data● Unstructured text data● Images● Sound and video recordings● Graph Nodes

○ Social Media “friends”○ Websites linked to each

other

Data is being generated fast and is becoming obsolete or useless equally faster● Realtime ( or near realtime)

data from sensors, cameras● Website traffic● Social media “trends”

Page 6: Data Science

prithwis mukerjee, ph.d.

So what is Big Data ?

● Volume● Velocity● Variety ?

A new term coined by IT vendors to push new technology like● Map Reduce● Hadoop● NOSQL

A new way to● collect● store● manage● analyse● visualise data

Page 7: Data Science

prithwis mukerjee, ph.d.

Big Data is like Crude Oil { not new Oil }

Think of data as crude oil !

Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos

But what about refining ?

Page 8: Data Science

prithwis mukerjee, ph.d.

The Science (and Art ) of Data

Think of data as crude oil !

Big Data is like extracting the crude oil, transporting it in mega tankers, pumping it through pipelines and storing it in massive silos

Data Science● Discovering what we do not

know about the data● Obtaining predictive, actionable

insight● Creating data products that have

business impacts● Communicating relevent

business stories

Refining

Page 9: Data Science

prithwis mukerjee, ph.d.

Two Perspectives

Programmingor “Hacking”Skills

Mathematics,Statistics

Knowledge

BusinessDomain

Knowledge

MachineLearning

OperationsResearch

RDBMSERP / BI

DataScience

Page 10: Data Science

prithwis mukerjee, ph.d.

10 Things {most} Data Scientists do ...1. Ask good questions

What is what ?We do not know ! We would like to know

2. Define, Test Hypothesis, Run experiments3, Scoop, scrape, sample business data4. Wrestle and tame data5. Play with data, discover unknowns

6. Create models, algorithms7. Under data relationships8. Tell the machine how to learn from the data9. Create data products that deliver actionable insights10. Tell relevant business stories from data

Page 11: Data Science

prithwis mukerjee, ph.d.

Statistics - World of Data

● Data comes in various types○ Nominal - colour, gender,

PIN code ○ Ordinal - scale of 1-10,

{high, medium, low}○ Interval - Dates,

Temperature (Centigrade)○ Ratio - length, weight, count

● Data comes in various structure○ Structured data - nominal,

ordinal, interval, ratio○ Unstructured text - email,

tweets, reviews○ Images, voice prints○ graphs, networks - social

media friendships, likes

Page 12: Data Science

prithwis mukerjee, ph.d.

Descriptive Statistics

● Numeric Description○ Mean, Median, Mode○ Quartile, Percentile○ Variance / Standard

Deviation

Page 13: Data Science

prithwis mukerjee, ph.d.

Statistics : The Path Ahead

Probability, Distributions

Testing of Hypothesis

Regression,Testing

PredictiveAnalysis

Page 14: Data Science

prithwis mukerjee, ph.d.

Data Mining / Machine Learning

Is the process of obtaining● novel

● valid

● potentially useful

● understandable

patterns in data

Typical tasks are ● classification

● clustering

● association rules

● sequential patterns

● regression

● deviation detection

Page 15: Data Science

prithwis mukerjee, ph.d.

Some definitionsInstance ( an item or record)● an observation that is

characterised by a number of attributes

○ person - with attributes like age, salary, qualification

○ sale - with product, quantity, price

Attribute● measuring characteristics of an

instanceClass● grouping of an instance into

○ acceptable, not acceptable○ mammal, fish, bird

Nominal● colour, PIN code, state

Ordinal● ranking : tall, medium, short or

feedback on a scale of 1 - 10Ratio● length, price, duration, quantity

Interval● date, temperature

Page 16: Data Science

prithwis mukerjee, ph.d.

Data Mining : Classification

Classification● Which loan applicant will not

default on the loan ?● Which potential customer will

respond to a mailer campaign ?

Page 17: Data Science

prithwis mukerjee, ph.d.

Classification Example

categorical

categorical

continuous

class

Training Set

ModelLearn

Classifier

Test Set

Page 18: Data Science

prithwis mukerjee, ph.d.

Data Mining : Clustering

Given a set of unclassified data points, how to find a natural grouping within them

● Can we segment the market in some way that is not yet known ?

Page 19: Data Science

prithwis mukerjee, ph.d.

Example of Document Clustering

Clustering points : 3204 article from the Los Angeles Times

Similarity Measure : How many words are common in these documents ( after excluding some common words )

Page 20: Data Science

prithwis mukerjee, ph.d.

Clustering of S&P Stock Data

● Observe Stock Movements every day.

● Clustering points: Stock-{UP/DOWN}

● Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day.

● We used association rules to quantify a similarity measure.

Page 21: Data Science

prithwis mukerjee, ph.d.

Regression● Predict a value of a given continuous valued variable

based on the values of other variables, assuming a linear or nonlinear model of dependency.○ Greatly studied in statistics, neural network fields.

● Examples:○ Predicting sales amounts of new product based on advertising

expenditure.

○ Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

○ Time series prediction of stock market indices.

Page 22: Data Science

prithwis mukerjee, ph.d.

Data Mining : Association Rules Mining

Association Rules● which products

should be kept along with other products

● which two products should never be discounted together

Page 23: Data Science

prithwis mukerjee, ph.d.

Visualisation : The need to tell a story

Page 24: Data Science

prithwis mukerjee, ph.d.

Visualisation : The need to tell a story

Page 25: Data Science

prithwis mukerjee, ph.d.

Definitions

Data Mining● Is the process of extracting

unknown, valid and actionable information from large databases and using this to make business decisions

● Non trivial process of identifying valid, novel, potentially useful and understandable / explainable patterns in data

Data Science is a rare combination of multiple skills that include● Technology : obviously !

but also● Curiosity - a desire to go below

the surface and discover a hypothesis that can be tested

● Storytelling - create a business story around the data

● Cleverness - again obviously, to look at the problem from different angles

Page 26: Data Science

prithwis mukerjee, ph.d.

Page 27: Data Science

prithwis mukerjee, ph.d.

R : Your first step into Data Science

Try out this free interactive tutorial just now

Page 28: Data Science

prithwis mukerjee, ph.d.

Statistical Tools

http://r4stats.com/articles/popularity/

Page 29: Data Science

prithwis mukerjee, ph.d.

Some Comparisons

Page 30: Data Science

prithwis mukerjee, ph.d.

Map Reduce

● Input : A set of (key, value) pairs

● User supplies two functions○ Map (k,v) => List(k1,v1)○ Reduce (k1, list(v1)) => v2

● Output is the set of (k1,v2) pairs

Page 31: Data Science

prithwis mukerjee, ph.d.

Hadoop

A programming framework that allows you to run Map-Reduce jobs on a distributed cluster of low cost machines without having to bother about anything except ● the Map and Reduce functions● loading data into HDFS

1. HIVEa. A plug-in that allows one to

use SQL like queries that are converted into map-reduce jobs

2. PIGa. A scripting language for

writing long queries3. HBASE

a. A non-relational DBMS4. SQOOP

a. moves data to andfrom HDFS

Page 32: Data Science

prithwis mukerjee, ph.d.

Data-in-Flight

Page 33: Data Science

prithwis mukerjee, ph.d.

JavaScript for Data Visualisation

Page 34: Data Science

prithwis mukerjee, ph.d.

Business Domain

● Financial Sector○ Risk Management, Credit

Scoring○ Predict Customer Spend○ Stock and Investment

Analysis○ Loan approval

● Telecom Sector○ Fraud Detection○ Churn Prediction

● Retail and Marketing○ Market segmentation○ Promotional strategy○ Market Basket Analysis○ Trend Analysis

● Healthcare & Insurance○ Fraud Detection○ Drug Development○ Medical Diagnostic Tools

Page 35: Data Science

prithwis mukerjee, ph.d.

Conclusion

Data Science is a rare combination of multiple skills that include● Technology : obviously !

but also● Curiosity - a desire to go below

the surface and discover a hypothesis that can be tested

● Storytelling - create a business story around the data

● Cleverness - again obviously, to look at the problem from different angles

● Why data science ?● Techniques

○ Statistics○ Data Mining○ Visualisation

● Tools & Platforms○ R○ Hadoop / MapReduce○ Real Time Systems

● Business Domains

Page 37: Data Science

prithwis mukerjee, ph.d.

Thank You

Contact

Prithwis MukerjeeProfessor, Praxis Business [email protected]

This presentation is accessible at at the blog

http://blog.yantrajaal.com at the following URL

http://bit.ly/pm-datascience