lesson 1 analytics
DESCRIPTION
SATRANSCRIPT
Copyright 2014, Simplilearn, All rights reserved.
Copyright 2014, Simplilearn, All rights reserved.
Lesson 1
Introduction to Analytics
Copyright 2014, Simplilearn, All rights reserved.
● Understand what is analytics and the difference between analysis and analytics
● Know the popular tools used in analytics
● Understand the role of a data scientist
● Know the processes involved in analytics
● Define a problem statement
● Collect and summarize data
● Detect and treat outliers in the data
After completing this course, you will be able to:
Objective Slide
Copyright 2014, Simplilearn, All rights reserved.
Analytics versus Analysis
Analytics
Analytics is the science of analysis whereby statistics, data mining, computer technology, etc. is used in doing analysis
Analysis
Analysis is the process of breaking down a complex object into its simpler forms
Copyright 2014, Simplilearn, All rights reserved.
What is Analytics?
• It’s the science of wisely acquiring meaningful results from given data using various methods and technologies.
• Aims at discovering pattern of variation from the given data.
• It helps to understand the future from past data and the uncertainty related to business.
• It’s a sophisticated process that uses statistics, mathematics and economics models to predict the future and prescribe strategies.
How analytics works
Gather Data Organize Data Analyze Data
Copyright 2014, Simplilearn, All rights reserved.
Analytics Stages
Descriptive Diagnostic Predictive Prescriptive
Information
Insights
Decision
How many students
dropped out last year?
Why has the drop-out rate
increased in the last one year?
Which students are most likely to
drop out?
Which students should I target to
keep from dropping out?
Copyright 2014, Simplilearn, All rights reserved.
Popular Tools
R
Revolution R
R Studio
Tableau
SAP HANA
Weka
KXEN
SAS
• Inquisitive, can stare at data and spot trends.
• Come out with unrevealed stories hidden in data that helps in creating more useful insights and help solving business problems.
• Work in sync with application developer to get relevant data for analysis.
• Make an analytical plan in such a way that the results satisfy the business needs.
• Come up with an effective data mining architecture and prepare suitable models.
• Respond to and resolve data mining performance issues.
• Generate reports that are affordable from a business perspective.
Role of a Data Scientist
Data Analytics Methodology
DISCOVERY
PUT INTO USE
DELIVER RESULTS
MODEL BUILDING
MODEL PLANNING
DATA PREPARING
Problem Definition
WHAT IS THE PROBLEM?
WHAT IS IT NOT ?
WE HAVE THIS PROBLEM BECAUSE?
WE DON’T HAVE A SOLUTION BECAUSE?
Techniques involved in defining a problem
• State the problem in a general way
• Understand the nature of the problem
• Survey the available literature
• Go for discussions for developing ideas
• Rephrase the research problem into a working proposition
Types of Data
Qualitative Data
• Data expressed as groups or categories
• Descriptive data
• E.g. Dividing a population into high, medium and low height groups
Quantitative Data
• Data expressed as numbers
• Definitive Data
• E.g. The height of a person
● Data can be of two types – qualitative and quantitative
Summarizing Data
● Summarizing is the process of converting huge amounts of raw data into a format that can be easily analyzed.
● Summaries differ based on the type of data; and can be descriptive or graphical.
Marital Status Frequency
Single 203
Married 2,580
Widowed 334
Divorced 367
Separated 46
Total 3,530
Summarizing Data
Numeric - Descriptive
• Mean
• Median
• Mode
Categorical - Descriptive
• Frequency distribution tables
Numeric - Graphical
• Box plot
Categorical - Graphical
• Bar charts
• Histograms
Data Collection
● Process of collecting relevant data that aids in solving the problem statement
● Data Collection process needs to be defined, and systematic.
● Observations need to be recorded and organized for optimal usefulness
Collect Relevant Data
Categorize the Data
Organize the Data
Data Collection Methods
Observation
Experiment
Census
Questionnaire
Survey
Reporting
● Data collection methods fall broadly into two categories – primary and secondary.
● Primary methods are where the data is gathered directly through investigating, experimenting or observing various entities.
● Secondary methods refer to the methods where the data has already been gathered before the study, and is available as already published facts and reports.
Registration
Data Sources
● A Data Dictionary is a file that describes the structure of the database itself.
● Includes details like –
● Number of records
● Name of each field
● Characteristic of each field
● Description of each field
● Relationships between different fields
● It helps in analyzing different data variables and their relationships between each other.
Data Dictionary
Outlier Treatment
● Outlier is a point or an observation that
deviates significantly from the other observations.
● Due to experimental errors or “special circumstances”
● Outlier detection tests to check for outliers
● Outlier treatment –
● Retention
● Exclusion
● Other treatment methods
Outlier!
Study time (Minutes)
Mar
k (P
erce
nta
ge)
Copyright 2014, Simplilearn, All rights reserved.
● What is analytics and analysis, and what are the differences between them
● Popular tools used in analytics
● What does a data scientist do
● The processes involved in analytics life cycle
● How to formally define a problem statement
● Methods of collecting and summarizing data for analytics
● Data dictionary and its contents
● What are outliers and how to detect and treat outliers
Summary
Here is a quick recap of what we have learned in this lesson
Copyright 2014, Simplilearn, All rights reserved.
Quiz
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Interviews
Data Sources
Experiments
Surveys
Which of the following is a secondary data collection method?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: c.
Explanation: Surveys, Interviews and Experiments are personally conducted by the researchers, and hence belong to primary data collection methods. Data sources are already existing sources of data – thus belongs to secondary methods.
Which of the following is a secondary data collection method?
Interviews
Data Sources
Experiments
Surveys
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Which of the following is NOT a part of data dictionary?
Characteristic of fields
Type of fields
Actual records
Number of records
2
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: d.
Explanation: Data dictionary refers to the meta data, i.e., defining the attributes of the data. It does not contain the actual data.
Which of the following is NOT a part of data dictionary?
Characteristic of fields
Type of fields
Actual records
Number of records
2
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Which of the following is a way of summarizing categorical data?
Frequency distribution
Median
Mode
Mean
3
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: b.
Explanation: Mean, median and mode are mathematical summaries of numeric or quantitative data. Frequency distribution is used to summarize categorical or qualitative data.
Which of the following is a way of summarizing categorical data? 3
Frequency distribution
Median
Mode
Mean
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Deliver results
Model building
Re-checking
Discovery
4 Which one of the following is NOT a step in data analytics methodology?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: d.
Explanation: Re-checking is not a step in data analytics methodology.
Deliver results
Model building
Re-checking
Discovery
4 Which one of the following is NOT a step in data analytics methodology?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Interval and ratio
Nominal and ordinal
Random and selective
Primary and secondary
5 What are the two categories of data collection methods?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: a.
Explanation: Data collection methods are classified into primary and secondary
Interval and ratio
Nominal and ordinal
Random and selective
Primary and secondary
5 What are the two categories of data collection methods?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Predictive
Descriptive
Productive
Prescriptive
6 Which of the following is NOT a step in analytics?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: d.
Explanation: Productive is not a step in analytics.
Predictive
Descriptive
Productive
Prescriptive
6 Which of the following is NOT a step in analytics?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Work with application developers to extract data relevant for analysis.
There is no need of considering statistical algorithm working process.
Contribute to data mining architectures, modeling standards, reporting, and data analysis methodologies.
Develop and plan required analytic projects in response to business needs
7 Which of the following is FALSE with reference to the role of a data scientist?
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
a.
b.
c.
d.
Answer: c.
Explanation: Data scientist needs to consider statistical algorithm working process.
Work with application developers to extract data relevant for analysis.
There is no need of considering statistical algorithm working process.
Contribute to data mining architectures, modeling standards, reporting, and data analysis methodologies.
Develop and plan required analytic projects in response to business needs
7 Which of the following is FALSE with reference to the role of a data scientist?
Copyright 2014, Simplilearn, All rights reserved.
Thank You
Copyright 2014, Simplilearn, All rights reserved.