· identify and prioritize data needs and sources acquire data, harmonize, rescale, clean, and...

48
www.bigbang-datascience.com

Upload: others

Post on 25-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

www.bigbang-datascience.com

Page 2:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Certificate

CAP: Certified Analytics Professional from INFORMS

Page 3:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

1. Business Problem (Question) Framing

2. Analytics Problem Framing

3. Data

4. Methodology (Approach) Selection

5. Deployment

6. Model Life Cycle Management

Agenda

Page 4:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Problem (Question) Framing (12%–18%)

CAP Domain I

Page 5:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain I

The ability to understand a business problem and determine whether

the problem is amenable to an analytics solution

T-1 Obtain or receive problem statement and usability requirements

T-2 Identify stakeholders

T-3 Determine whether the problem is amenable to an analytics solution

T-4 Refine the problem statement and delineate constraints

T-5 Define an initial set of business benefits

T-6 Obtain stakeholder agreement on the problem statement

Page 6:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Understanding

Determine Business

Objectives

Assess Situation

Determine Data Mining

Goals

Produce Project Plan

Data Science is not panacea for all business problems. Common

Culprits include

Inadequate pre-processing of the data

Inadequate Model Validation

Unjustified extrapolation (Application of the Model to data that

reside in a space which the model has never seen)

Over-fitting the model to the existing data

“Flash Crash” on May 6, 2010 : - Stock market rapidly lost more

than 600 points

CAP Domain I

Page 7:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Understanding

Determine Business

Objectives

Assess Situation

Determine Data Mining

Goals

Produce Project Plan

Ability to understand a business problem and determine whether the

problem is amenable to an analytics solution

Obtain or receive problem statement and usability requirements

Identify stakeholders

Determine whether the problem is amenable to an analytics

solution

Refine the problem statement and delineate constraints

Define an initial set of business benefits

Obtain stakeholder agreement on the problem statement

Who is your client (Stakeholders)?

What exactly is the client asking you to solve?

CAP Domain I

Page 8:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Understanding

Determine Business

Objectives

Assess Situation

Determine Data Mining

Goals

Produce Project Plan

CAP Domain I

Page 9:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Decision First : Designed to fully “ frame” the decision to be made

https://cc.readytalk.com/cc/playback/Playback.do?id=4d7yds

Free webinar

CAP Domain I

Page 10:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Strategic Decision Making:

Multi-objective Decision

Analysis with Spreadsheets

Craig W. Kirkwood

ISBN-13: 978-0534516925

ISBN-10: 0534516920

Recommended Readings

Page 11:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain II

Analytics Problem Framing (14%–20%)

Page 12:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to reformulate a business problem into an analytics

problem with a potential analytics solution

T-1 Reformulate problem statement as an analytics problem

T-2 Develop a proposed set of drivers and relationships to outputs

T-3 State the set of assumptions related to the problem

T-4 Define key metrics of success

T-5 Obtain stakeholder agreement

CAP Domain II

Page 13:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Understanding

Determine Business

Objectives

Assess Situation

Determine Data Mining

Goals

Produce Project Plan

Ability to reformulate a business problem into an analytics problem

with a potential analytics solution

Reformulate problem statement as an analytics problem

Develop a proposed set of drivers and relationships to outputs

State the set of assumptions related to the problem

Define key metrics of success

Obtain stakeholder agreement

How can you translate their ambiguous request into a concrete,

well-defined problem?

CAP Domain II

Page 14:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Keeping Up with the Quants: Your Guide to Understanding

and Using Analytics

Stakeholder Analysis worksheet

Recommended Readings

Page 15:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Data Analysis and Decision

Making

Recommended Readings

Christian Albright

ISBN-13: 978-0538476126 ISBN-10: 0538476125

Page 16:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain III

Data (18%–26%)

Page 17:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to work effectively with data to help identify potential

relationships that will lead to refinement of the business and

analytics problem

T-1 Identify and prioritize data needs and sources

T-2 Acquire data

T-3 Harmonize, rescale, clean, and share data

T-4 Identify relationships in the data

T-5 Document and report findings (e.g., insights, results, business performance)

T-6 Refine the business and analytics problem statements

CAP Domain III

Page 18:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Is this data already available?

If so, what parts of the data are useful?

If not, what more data do you need?

What kind of resources (time, money, infrastructure) would it

take to collect this data in a usable form?

Data Preparation

Format Data

Integrate Data

Construct Data

Clean Data

Select Data

CAP Domain III

Page 19:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Identify and prioritize data needs and sources

Acquire data, Harmonize, rescale, clean, and share data

Identify relationships in the data. Document and report findings

(e.g., insights, results, business performance)

Refine the business and analytics problem statements

Reformulate problem statement as an analytics problem

Develop a proposed set of drivers and relationships to outputs

State the set of assumptions related to the problem

Define key metrics of success. Obtain stakeholder agreement

Ability to work effectively with data to help identify potential

relationships that will lead to refinement of the business and analytics

problem

CAP Domain III

Data Preparation

Format Data

Integrate Data

Construct Data

Clean Data

Select Data

Page 20:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

• Binary Variables

• Nominal Variables

• Ordinal Variables

• Interval Variables

• Ratio Variables

Structured VS. Unstructured

Primarily Source data VS. Secondarily Source Data

CAP Domain III

Data Preparation

Format Data

Integrate Data

Construct Data

Clean Data

Select Data

Page 21:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

How to Measure Anything: Finding

the Value of “Intangibles” in

Business

Douglas W. Hubbard

ISBN-13: 978-1118539279

ISBN-10: 1118539273

Introduction to Management Science: A Modeling and

Case Studies Approach with Spreadsheets

Frederick S. Hillier

ISBN-13: 978-0534260347

ISBN-10: 0534260349

Page 22:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain IV

Methodology (Approach) Selection (12%–18%)

Page 23:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to identify and select potential approaches for solving

the business problem

T-1 Identify available problem solving approaches (methods)

T-2 Select software tools

T-3 Test approaches (methods)1

T-4 Select approaches (methods) 1

CAP Domain IV

Page 24:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Once you have cleaned the data, we need to identify available

problem solving approaches (methods)

Select software tools

Test approaches (methods)

Select approaches (methods)

Ability to identify and select potential approaches for solving the

business problem Modeling

Assess Model

Build Model

Generate Test Design

Select Modeling Technique

CAP Domain IV

Page 25:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

How many variables to be analyzed

(machine learning, statistical models, algorithms): This step

is usually the meat of your project, where you apply all the

cutting-edge machinery of data analysis to unearth high-

value insights and predictions.

• One (Univariate)

• Two (Bivariate)

• Three or more (Multivariate)

What kind of model?

• What variables to include?

• Prediction or classification

• Do we want description or inference question answered

• Which specific technique ?

Modeling

Assess Model

Build Model

Generate Test Design

Select Modeling Technique

CAP Domain IV

Page 26:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

• Understand the information contained within at a high level.

• What kinds of obvious trends or correlations do you see in the

data?

• What are the high-level characteristics and are any of them more

significant than others?

Modeling

Assess Model

Build Model

Generate Test Design

Select Modeling Technique

CAP Domain IV

Page 27:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain IV

Page 28:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Applied Linear Regression

Models

Michael H

KutnerChristopher J.

Nachtshei

Page 29:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain V

Model Building (13%–19%)

Page 30:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to identify and build effective model structures to help

solve the business problem.

T-1 Identify model structures1

T-2 Run and evaluate the models

T-3 Calibrate models and data1

T-4 Integrate the models1

T-5 Document and communicate findings (including assumptions, limitations, and

constraints)

CAP Domain V

Page 31:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Modeling

Assess Model

Build Model

Generate Test Design

Select Modeling Technique

Ability to identify and build effective model structures to help solve

the business problem

Identify model structures

Run and evaluate the models

Calibrate models and data

Integrate the models

Document and communicate findings (including assumptions,

limitations, and constraints)

CAP Domain V

Page 32:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Introduction to Operations Research

Frederick S. Hillier

ISBN-13: 978-0071238281 ISBN-10: 007123828X

Introductory Statistics

Sheldon M. Ross

ISBN-13: 978-0123743886 ISBN-10: 0123743885

Page 33:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain VI

Deployment (7%–11%)

Page 34:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to identify and select potential approaches for solving

the business problem

T-1 Perform business validation of the model

T-2 Deliver report with findings

T-3 Create model, usability, and system requirements for production

T-4 Deliver production model/system1

T-5 Support deployment

CAP Domain VI

Page 35:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Evaluation

Determine Next Steps

Review Process

Evaluate Results

Perform business validation of the model

Deliver report with findings

Create model, usability, and system requirements for

production

Deliver production model/system1

Support deployment

Ability to deploy the selected model to help solve the business

problem

Deployment

All the analysis and technical results that you come up

with are of little value unless you can explain to your

stakeholders what they mean, in a way that’s

comprehensible and compelling.

Data storytelling is a critical and underrated skill that you

will build and use here.

Page 36:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Making Hard Decisions: An

Introduction to Decision

Robert T. Clemen

ISBN-13: 978-0534260347

ISBN-10: 0534260349

Simulation Modeling and Analysis

Averill M. Law

ISBN-13: 978-0071255196

ISBN-10: 0071255192

Page 37:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

CAP Domain VII

Model Life Cycle Management (4%–8%)

Page 38:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

The ability to manage the model life cycle to evaluate business

benefit of the model over time

T-1 Document initial structure

T-2 Track model quality

T-3 Recalibrate and maintain the model1

T-4 Support training activities

T-5 Evaluate the business benefit of the model over time

T-6 Tasks that are beyond the scope of the CAP® certification exam and that will not be

tested.

CAP Domain VI

Page 39:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Deployment

Review Project

Produce Final

Report

Plan Monitoring &

Maintenance

Plan Deployment

Document initial structure

Track model quality

Recalibrate and maintain the model

Support training activities

Evaluate the business benefit of the model over time

Ability to manage the model life cycle to evaluate business benefit of

the model over time

CAP Domain VI

Page 40:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Business Analytics for Managers: Taking

Business Intelligence Beyond Reporting

Gert H. N. Laursen

ISBN-13: 978-0071255196

ISBN-10: 0071255192

Page 41:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Certifications

Page 42:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Certifications

Cloudera Certified Professional or CCP data scientist. Cloudera Certified Developer for Apache Hadoop or CCDH. Cloudera Certified Administrator for Apache Hadoop or CCAH. Cloudera Certified Specialist in Apache HBase or CCSHB.

CAP: Certified Analytics Professional from INFORMS

Page 43:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Columbia Certification of Professional Achievement in Data Sciences,

Digital Analytics Association Web Analyst Certification Program(tm)

Certifications

Page 44:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

EMC Data Scientist Associate (EMCDSA) Certification.

MCSE Business Intelligence Certification

SAS Certified Big Data Professional

SAS Certified Data Scientist

Certifications

Page 46:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Newsletters

• Data Science Central http://www.datasciencecentral.com

• Analytics Vidhya: https://www.analyticsvidhya.com

• Big Data University : https://bigdatauniversity.com/

• Kdnuggest http://www.kdnuggets.com/

Page 47:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

Q & A

Page 48:  · Identify and prioritize data needs and sources Acquire data, Harmonize, rescale, clean, and share data Identify relationships in the data. Document and report findings (e.g.,

BIG BANG DATA SCIENCE SOLUTIONS

LEARN . ACHIEVE. STANDOUT