data mining sky surveys jan 21 2014
TRANSCRIPT
Data Mining In Modern Astronomy Sky Surveys:
Supervised Learning & Astronomy Applications
Ching-Wa Yip
[email protected]; Bloomberg 518
1/21/2014 JHU Intersession Course - C. W. Yip
Machine Learning
• We want computers to perform tasks. • It is difficult for computers to “learn” like the
human do. • We use algorithms:
– Supervised, e.g.: • Classification • Regression
– Unsupervised, e.g.: • Density Estimation • Clustering • Dimension Reduction
1/21/2014 JHU Intersession Course - C. W. Yip
Machine Learning
• We want computers to perform tasks. • It is difficult for computers to “learn” like the
human do. • We use algorithms:
– Supervised, e.g.: • Classification • Regression
– Unsupervised, e.g.: • Density Estimation • Clustering • Dimension Reduction
1/21/2014 JHU Intersession Course - C. W. Yip
Unsupervised vs. Supervised Learning
• Unsupervised:
– Given data {𝒙1, 𝒙2, 𝒙3, ⋯, 𝒙𝑛} find patterns.
– The description of a pattern may come in the form of a function (say, 𝑔 𝒙 ).
• Supervised:
– Given data {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑛, 𝒚𝑛)} find a function such that 𝑓 𝒙 = 𝒚.
– 𝒚 are the labels.
1/21/2014 JHU Intersession Course - C. W. Yip
Types of Label
• Class
– Binary: 0, 1
– Galaxy types: E, S, Sa, … etc.
• Physical Quantities
– E.g. Redshift of a galaxy
1/21/2014 JHU Intersession Course - C. W. Yip
Basic Concepts in Machine Learning
• Label and Unlabeled data
• Datasets: training set and test set
• Feature space
• Distance between points
• Cost Function (or called Error Function)
• Shape of the data distribution
• Outliers
1/21/2014 JHU Intersession Course - C. W. Yip
Training Set & Test Set
• Training Set
– Data that are used to build the model.
• Validation Set
– Data that used to evaluate the model.
– Data were not used in training.
– (Sometimes omitted.)
• Test Set
– Similar to Validation Set but for the final model.
1/21/2014 JHU Intersession Course - C. W. Yip
Aspects in Supervised Learning
• What are the popular algorithms?
• How to minimize cost function (numerically)?
– Where a model is already given.
• How to select the appropriate model?
– Where we have many candidate models.
1/21/2014 JHU Intersession Course - C. W. Yip
Supervised Learning: Find Decision Boundary in Labeled Data
X
X
X
X
𝑥1
𝑥2
1/21/2014 JHU Intersession Course - C. W. Yip
Supervised Learning: Find Decision Boundary in Labeled Data
X
X
X
X
𝑥1
𝑥2
Decision Boundary
Feature Space
1/21/2014 JHU Intersession Course - C. W. Yip
Algorithms for Supervised Learning: Classification and Regression
• Principal Component Analysis
• KNN
• Support Vector Machine
• Decision Tree
• Regression Tree
• Random Forest
• Etc.
1/21/2014 JHU Intersession Course - C. W. Yip
K Nearest Neighbor (KNN)
• One of the simplest supervised algorithms.
• Idea: use k-neighbors to estimate 𝒚 given 𝒙.
• The value of k can be determined (see “Model Selection”).
1/21/2014 JHU Intersession Course - C. W. Yip
KNN
• Given {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑛, 𝒚𝑛)}, find 𝒚 for a given 𝒙.
• E.g., 𝒙 has two components (𝒙 ∈ 𝑅2), 𝒚 is 0 or 1.
• Estimated Class: – From the majority vote of k nearest
neighbors
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
𝒙𝟏
𝒙𝟐
. 𝒙
.
k = 1
1/21/2014 JHU Intersession Course - C. W. Yip
KNN
• Given {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑛, 𝒚𝑛)}, find 𝒚 for a given 𝒙.
• E.g., 𝒙 has two components (𝒙 ∈ 𝑅2), 𝒚 is 0 or 1.
• Estimated Class: – From the majority vote of k nearest
neighbors
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
𝒙𝟏
𝒙𝟐
. 𝒙
.
k = 3
1/21/2014 JHU Intersession Course - C. W. Yip
Support Vector Machine (SVM): Finding Hyperplanes in the Feature
Space • Map data into higher dimensional feature space. • The decision boundaries, or the hyperplanes, separate
the feature into classes. – 1D data: a point – 2D data: a line – Higher-dimensional data: a hyperplane
• More than one hyperplane can do the job. • Support vectors are data points located closest to the
hyperplanes. – They are the most difficult to classify.
• SVM chooses the hyperplanes which maximize the margin of the support vectors.
1/21/2014 JHU Intersession Course - C. W. Yip
Many Hyperplanes can Separate the Data
X
X
X
X
𝑥1
𝑥2 Feature Space
1/21/2014 JHU Intersession Course - C. W. Yip
SVM finds the Hyperplane which Maximize the Margin (Perpendicular to Hyperplane)
X
X
X
X
𝑥1
𝑥2 Feature Space
Support Vector
Support Vector
1/21/2014 JHU Intersession Course - C. W. Yip
SVM in Galaxy Classification
(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.)
𝐶𝑜𝑟𝑟
𝑒𝑐𝑡
𝐶𝑙𝑎
𝑠𝑠𝑖𝑓
𝑖𝑐𝑎𝑡𝑖
𝑜𝑛
𝐶𝑜𝑢𝑛𝑡
𝑇𝑜𝑡𝑎
𝑙 𝐷
𝑎𝑡𝑎
𝐶𝑜𝑢𝑛𝑡
Signal-to-Noise Ratio of Data
1/21/2014 JHU Intersession Course - C. W. Yip
SVM in Galaxy Classification
Rose Criterion
(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.) 1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
• A decision tree partitions the data by features.
• A decision tree has
– Root Node: top most decision node
– Decision Nodes: has two or more branches (“YES” or “NO”; “x > 0.8” or “x = 0.8” or “x < 0.8”).
– Left Node: where we make decision (classification or regression).
1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
• Given {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑛, 𝒚𝑛)}, find 𝒚 for a given 𝒙.
• E.g., 𝒙 has two components, 𝒚 is 0 or 1.
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
𝒙𝟏
𝒙𝟐
1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
• Given {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑛, 𝒚𝑛)}, find 𝒚 for a given 𝒙.
• E.g., 𝒙 has two components, 𝒚 is 0 or 1.
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
𝒙𝟏
𝒙𝟐
1 1.5
1.2
1/21/2014 JHU Intersession Course - C. W. Yip
Binary-Decision Tree
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
𝒙𝟏
𝒙𝟐
1 1.5
1.2
𝒙𝟏 < 1?
Yes No
(𝟕, 𝟎) 𝒙𝟐 > 1.2?
Yes No
(𝟑, 𝟏) 𝒙𝟏 < 1.5?
Yes No
(𝟏, 𝟐) (𝟎, 𝟒)
Here 𝟑, 𝟏 represents: 3 “0” votes and 1 “1” vote.
1/21/2014 JHU Intersession Course - C. W. Yip
Regression Tree
• E.g., 𝒙 has one component (= 𝑥); 𝒚 is real.
x x
x x x
x
x
x
x
x
𝑥
𝑦
x x
x
x
x
x
1/21/2014 JHU Intersession Course - C. W. Yip
Regression Tree
• E.g., 𝒙 has one component (= 𝑥); 𝒚 is real.
x x
x x x
x
x
x
x
x
𝑥
𝑦
x x
x
x
x
x
. . 1 1.5
0.8
1.6 1.7
𝑥 < 1?
Yes No
𝑦 = 1.7 𝑥 > 1.5?
Yes No
𝑦 = 1.6 𝑦 = 0.8
1/21/2014 JHU Intersession Course - C. W. Yip
Random Forest: Reduce the Variance of Estimator
• Idea: use many trees instead of just one for estimation.
• Algorithm: – Sample a subset from data. – Construct i-th tree from the subset. – At each node, choose a random subset of m features.
The split the data based on those features. – Repeat for i = 1, 2, … , B trees. – Given 𝑥
• Take majority vote from trees (Classification) • Take average from trees (Regression)
1/21/2014 JHU Intersession Course - C. W. Yip
Photometric Redshift Using Random Forest
• Use photometry (e.g., using SDSS filters u,g,r,i,z, or 5 data points) and galaxy inclination to estimate redshift (distance) of galaxies (Yip, Carliles & Szalay et al. 2011).
z(spec) is the true redshift, the redshift we obtained from galaxy spectra. z(photo) is the estimated redshift, the redshift we obtained from galaxy photometry.
Galaxy Inclination 1/21/2014 JHU Intersession Course - C. W. Yip
Comparison of Supervised Learning Algorithms
• Caruana & Niculescu-Mizil (2006) compared algorithms over 11 problems and 8 performance metrics.
Columns are the probability that an algorithm would perform at 1st, 2nd , 3rd, …, etc.
1/21/2014 JHU Intersession Course - C. W. Yip
Cost Function
• In many machine learning algorithms, the idea is to find the model parameters 𝜽 which minimize the cost function 𝐽 𝜽 :
m is the size of the training set. • That is, we want 𝑚𝑜𝑑𝑒𝑙 as close to 𝑑𝑎𝑡𝑎 as
possible. • Note that 𝑚𝑜𝑑𝑒𝑙 depends on 𝜽.
𝐽 𝜽 =1
𝑚 𝑚𝑜𝑑𝑒𝑙 𝑑𝑎𝑡𝑎𝑖 − 𝑑𝑎𝑡𝑎𝑖 2𝑚
𝑖=1
1/21/2014 JHU Intersession Course - C. W. Yip
Cost Function Examples: Unsupervised vs. Supervised Learning
𝐽 𝜽 =1
𝑚 𝑚𝑜𝑑𝑒𝑙 𝑑𝑎𝑡𝑎𝑖 − 𝑑𝑎𝑡𝑎𝑖 2𝑚
𝑖=1
Supervised:
𝐽 𝜽 =1
𝑚 𝑓 𝒙𝒊 − 𝒙𝑖 2𝑚
𝑖=1
𝑚 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 𝒙 = 𝑑𝑎𝑡𝑎 𝑓 𝒙 = 𝑚𝑜𝑑𝑒𝑙 𝑡𝑜 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡
Unsupervised:
𝐽 𝜽 =1
𝑚 𝑔 𝒙𝒊 − 𝒙𝑖 2𝑚
𝑖=1
𝑚 = # 𝑜𝑓 𝑏𝑖𝑛𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎 ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚 𝒙 = ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 𝑔 𝒙 = 𝑚𝑜𝑑𝑒𝑙 𝑡𝑜 𝑡ℎ𝑒 ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚
1/21/2014 JHU Intersession Course - C. W. Yip
Minimization of Cost Function in order to Find Modeled Parameters
• Maximum Likelihood Estimate
• Bayesian Parameter Estimate
• Gradient Descent
• Etc.
1/21/2014 JHU Intersession Course - C. W. Yip
Gradient Descent: Idea
(Source: Andrew Ng)
The minimum is local – for a different starting point, we may get different local minimum.
1/21/2014 JHU Intersession Course - C. W. Yip
Minimization of Cost Function: Gradient Descent Algorithm
• Start at a point in the parameter space.
• Look around locally, take a baby step which has the largest descent.
• Repeat until we find the minimum cost.
• Mathematically, we keep refining 𝜃𝑗 until convergence, in
accord to:
𝜃𝑗 := 𝜃𝑗 − α
𝜕
𝜕𝜃𝑗𝐽(𝜽)
• 𝛼 is the Learning rate. • j runs from 1 to the number of parameters in the model. 1/21/2014 JHU Intersession Course - C. W. Yip
Example: Gradient Descent in 1D Data
• The learning rate is positive.
𝜃 ≔ 𝜃 − α𝑑
𝑑θ𝐽(𝜃)
Start from the right of the minimum. 𝜕
𝜕θ𝐽(𝜃) = slope > 0.
𝜃 decreases as the algorithm progresses.
𝜃
𝐽(𝜃)
. 𝜃𝑠𝑡𝑎𝑟𝑡
. 𝜃𝑠𝑡𝑒𝑝1
𝜃
𝐽(𝜃)
. 𝜃𝑠𝑡𝑎𝑟𝑡
. 𝜃𝑠𝑡𝑒𝑝1
Start from the left of the minimum.
. 𝜃𝑒𝑛𝑑 𝜃𝑒𝑛𝑑
.
1/21/2014 JHU Intersession Course - C. W. Yip
Model Selection in Machine Learning
• The goal of Cost Function minimization is to select a model.
• Occam’s Razor: the best explanation is usually the simplest one.
• Some model selection approaches: – Bayesian Approach
– Cross Validation
– Akaike Information Criterion (AIC)
– Bayesian Information Criterion (BIC)
1/21/2014 JHU Intersession Course - C. W. Yip
Model Selection: How to Select from Models with Different Complexity?
• Notice that the complexity parameters are not well defined and may need user’s judgment.
• Complexity parameter of a model, e.g.:
– The number of nearest neighbors (k) in KNN.
– The number of basis functions in regression.
1/21/2014 JHU Intersession Course - C. W. Yip
Example: Fitting a combination of Basis Functions
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
B = 1
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
B = 1
B = 3
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
B = 1
B = 10
Large Model Complexity: Potential Overfitting
1/21/2014 JHU Intersession Course - C. W. Yip
Bias-Variance Decomposition
• Suppose we have an estimator of the model parameters.
• An estimator is denoted as 𝜃 .
• Mean Square Error of an estimator can be expressed as:
MSE = bias2 + var
where 𝑀𝑆𝐸 =1
𝑁 (𝜃 − 𝜃𝑇𝑟𝑢𝑒)2𝑁
𝑖=1
1/21/2014 JHU Intersession Course - C. W. Yip
Best Model
The Best Model: Tradeoff between Bias and Variance
Model Complexity Large
Model Complexity Small
bias2 variance
MSE
1/21/2014 JHU Intersession Course - C. W. Yip
Intuition: Bias-Variance Tradeoff
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
B = 1
𝑥′
Variance of estimated 𝑦′𝒔
𝑦(𝒕𝒓𝒖𝒆)
Dotted: model from a different, random draw of data
1/21/2014 JHU Intersession Course - C. W. Yip
Intuition: Bias Variance Tradeoff
• E.g., Polynomial functions: {∅0, ∅1, ∅2, ⋯ , ∅𝐵} where ∅𝑘 = 𝑥𝑘
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
𝑥
𝑦
B = 1
B = 10 𝑥′
𝑦(𝒕𝒓𝒖𝒆) Variance of estimated 𝑦′𝒔
1/21/2014 JHU Intersession Course - C. W. Yip
Demonstration of Overfitting in R
1/21/2014 JHU Intersession Course - C. W. Yip
Some Variations of Machine Learning
• Semi-Supervised Learning:
– Given data {(𝒙1, 𝒚1), (𝒙2, 𝒚2), (𝒙3, 𝒚3), ⋯, (𝒙𝑘 , 𝒚𝑘), 𝒙𝑘+1, ⋯, 𝒙𝑛} predicts labels 𝒚𝑘+1, ⋯, 𝒚𝑛 for 𝒙𝑘+1, ⋯, 𝒙𝑛.
• Active Learning:
– Similar to Semi-Supervised but we can ask for extra labels 𝒚𝒊 for particular data points 𝒙𝒊 as the algorithm runs.
1/21/2014 JHU Intersession Course - C. W. Yip