data mining 101
DESCRIPTION
Intro to Data MiningTRANSCRIPT
-
Data Mining 101Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars
-
Outline
Introduction
Terminology
Potential application
Venn diagram
Process overview
Business understanding
Data understanding (exploration)
Data preparation (preprocessing)
Modeling
Evaluation
Deployment (presentation)
Tools & Resource
-
Introduction Terminology
Data mining
Knowledge Discovery
in Databases
Big data analytics
Statistics
Data science
-
The process of collecting,
searching through, and analyzing
a large amount of data in a
database, as to discover patterns
or relationships.Data Mining - dictionary.reference.com
-
Introduction Potential Application
Customer segmentation
Recommendation engine
Social media mining
-
What should we do?
Where to start? Do I have to get a master degree in statistics?
-
http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg
-
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram
-
And now the business process
-
CRISP DM Methodology
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
-
Business UnderstandingCRISP DM Methodology
-
Objective Statement
Bottom-up
Top-down
-
Objective Statement
Data Problem
vs
-
Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
-
Situation Assessment
Inventory of Resources
Resource
Data, Knowledge,
Tools
Hardware
Personnel
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
-
Situation Assessment
Requirements, Assumptions, and Constraints
Requirements
Scheduling
Accuracy
Security
Assumptions
Data quality
External factors
Reporting type
Constraints
Legal issues
Budget
Resources
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
-
Situation Assessment
Risks and Contingencies
Contingency Plan
Financial
Organizational
Business
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
-
Situation Assessment TerminologyWrite down related terminology
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
-
Situation Assessment Costs and BenefitsMoney, money, money!
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg
-
How to evaluate the results?
Define your success criteria!
-
Data UnderstandingCRISP DM Methodology
-
Data Collection
External Internal
vs
-
Watch out!
-
visible accessible
storable presentable
Victor Lavrenko Text Technologies
http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
-
Data Exploration
Visualization Heuristics
Visualize fast. Visualize reactively.
Go for high information 2D visualizations.
Select data subsets to visualize.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
-
Data Exploration
Visualization Heuristics
Never let anomalies pass you by. Dig deeper.
Use your visualizations to inform potential
models. Use your potential model to direct your
visualizations.
Expect problems in your data.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
-
This is the cheapest and most
informative stage of data
mining.
Nigel Goddard DME Visualization
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
-
Data Exploration
Visualization Tools
Column/bar: Large change
Line, curve: Small change, long periods
Histogram: Frequency distribution
https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
-
Data PreparationCRISP DM Methodology
-
Which one should I include
(or exclude)?
Data Selection
-
Data Cleaning
Dirty Data
Missing value
Incomplete
OutdatedDuplication
OutlierRemember: Expect problems in your data.
-
Data Construction
Feature engineering derived attributes,
e.g.:
year from timestamp
quarter from timestamp
BMI from weight and height
Log(x) for skewed data (e.g. house price)
-
Data Splitting
Two kinds of data splitting:
Training-Validation-Testing
Cross Validation
-
Data Splitting
Training-Validation-Testing
Construct classifierTraining
Pick algorithm
Knob settings (tree depth, k in kNN, c in SVM)
Validation
Estimate future error rateTesting
Split randomly to avoid bias
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
-
Data Splitting
Cross ValidationEvery point is both training and testing, never at the same time
-
Dimensionality Reduction
Principal Component
Analysis
Linear Discriminant
Analysisvs
-
ModelingCRISP DM Methodology
-
Machine Learning
Classification Regression Ranking Clustering
-
Model Selection
Regression Technique
Generalization bound
Linear regression
Kernel ridge regression
Support vector regression
Lasso
-
Which one should I choose?
Should I use all of them?
-
It depends on
-
Model Selection
AssumptionsThe predictors are linearly
independent
The error is a random variable with a mean of zero conditional on
the explanatory variables
The sample is representative of the population for the inference
prediction
Interpretability
The understandability of why the model is true or how the model is induced
from
https://chenhaot.com/pubs/mldg-interpretability.pdf
-
Beware of Overfitting!
http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png
-
Model Assessment
Regression
(R)MSE
Mean Absolute Error
Correlation Coefficient
Classification
Accuracy
Precision
Recall
F-score
Descriptive
Std. Error
p-value
Confidence Interval
-
EvaluationCRISP DM Methodology
-
Does my model solve the
problem?
What is the impact? Is it novel? How useful is the solution?
-
DeploymentCRISP DM Methodology
-
The Tasks
Plan deploymentPlan monitoring
and maintenanceProduce final
reportReview project
-
Tools & Resource
Text mining: NLTK, spaCy, OpenNLP
Query expansion & clustering: Carrot2, Weka
Data mining & machine learning: Weka, scikit-learn
Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala
Python lib: Pandas, SciPy, NumPy, scikit-learn
Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark
Visualization: D3.js
Community: Big Data & Open Data Indonesia
http://www.nltk.org/http://honnibal.github.io/spaCy/https://opennlp.apache.org/http://project.carrot2.org/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://scikit-learn.org/stable/http://www.r-project.org/https://www.python.org/http://julialang.org/https://www.oracle.com/java/index.htmlhttp://www.mathworks.com/products/matlab/http://www.wolfram.com/mathematica/https://www.haskell.org/http://www.scala-lang.org/http://aws.amazon.com/http://hadoop.apache.org/https://cloud.google.com/http://azure.microsoft.com/en-us/https://spark.apache.org/http://d3js.org/
-
Thank you!
Data Mining 101 Python-ID Meetup February 2015
Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars