bioinformatics ii theoretical bioinformatics and machine ... · bioinformatics ii - machine...
TRANSCRIPT
Bioinformatics II Theoretical Bioinformatics and Machine Learning
Part 1
Sepp Hochreiter Institute of Bioinformatics
Johannes Kepler University, Linz, Austria
Bioinformatics II - Machine Learning Sepp Hochreiter
Course
6 ECTS 4 SWS VO (class) 3 ECTS 2 SWS UE (exercise) Basic Course of Master Bioinformatics Class: Mo 15:30-17:00 (S3 318) and Thu 15:30-17:00 (S3 318) Exercise: Fr 11:00-12:45 (S3 318) VO: final exam (oral if few students subscribe) UE: weekly homework (evaluated) Other Courses of the Masters in Bioinformatics: • Struc. BI and Gene Analysis: Fr 08:30 - 11:00 (SI 048) • Infor. Systems: 5./12./19.03.2013 8:30-11:45 (S3 047)
Exercise: Thu 8:30 – 10:00 (S3 048) • Intro. to R (instead of Math. Modeling I): We 15:30-17:00 (S3 057) • Alg. Disc. Meth.: Thu 13:45-15:15 (HS 12) • Population Genetics: Thu 10:15-11:45 (S3 318)
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
1 Introduction 2 Basics of Machine Learning 3 Theoretical Background of Machine Learning 4 Support Vector Machines 5 Error Minimization and Model Selection 6 Neural Networks 7 Bayes Techniques 8 Feature Selection 9 Hidden Markov Models 10 Unsupervised Learning: Projection Methods and Clustering **11 Model Selection **12 Non-parametric methods:Decision trees and k-nearest neighbors **13 Graphical Models / Belief networks / Bayes Networks
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
1 Introduction 2 Basics of Machine Learning 2.1 Machine Learning in Bioinformatics 2.2 Introductory Example 2.3 Supervised and Unsupervised Learning 2.4 Reinforcement Learning 2.5 Feature Extraction, Selection, and Construction 2.6 Parametric vs. Non-Parametric Models 2.7 Generative vs. descriptive Models 2.8 Prior and Domain Knowledge 2.9 Model Selection and Training 2.10 Model Evaluation, Hyperparameter Selection, and Final Model
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
3 Theoretical Background of Machine Learning 3.1 Model Quality Criteria 3.2 Generalization error 3.3 Minimal Risk for a Gaussian Classification Task 3.4 Maximum Likelihood 3.5 Noise Models 3.6 Statistical Learning Theory
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
4 Support Vector Machines 4.1 Support Vector Machines in Bioinformatics 4.2 Linear Separable Problems 4.3 Linear SVM 4.4 Linear SVM for Non-Linear Separable Problems 4.5 Average Error Bounds for SVMs 4.6 nu-SVM 4.7 Non-Linear SVM and the Kernel Trick 4.8 Example: Face Recognition 4.9 Multiclass SVM 4.10 Support Vector Regression 4.11 One Class SVM 4.12 Least Square SVM 4.13 Potential Support Vector Machine 4.14 SVM Optimization and SMO 4.15 Designing Kernels for Bioinformatic Applications 4.16 Kernel Principal Component Analysis 4.17 Kernel Discriminant Analysis 4.18 Software
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
5 Error Minimization and Model Selection 5.1 Search Methods and Evolutionary Approaches 5.2 Gradient Descent 5.3 Step-size Optimization 5.4 Optimization of the Update Direction 5.5 Levenberg-Marquardt Algorithm 5.6 Predictor Corrector Methods for R(w) = 0 5.7 Convergence Properties 5.8 On-line Optimization
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
6 Neural Networks 6.1 Neural Networks in Bioinformatics 6.2 Motivation of Neural Networks 6.3 Linear Neurons and Perceptron 6.4 Multi Layer Perceptron 6.5 Radial Basis Function Networks 6.6 Reccurent Neural Networks
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
7 Bayes Techniques 7.1 Likelihood, Prior, Posterior, Evidence 7.2 Maximum A Posteriori Approach 7.3 Posterior Approximation 7.4 Error Bars and Confidence Intervals 7.5 Hyper-parameter Selection: Evidence Framework 7.6 Hyper-parameter Selection: Integrate Out 7.7 Model Comparison 7.8 Posterior Sampling
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
8 Feature Selection 8.1 Feature Selection in Bioinformatics 8.2 Feature Selection Methods 8.3 Microarray Gene Selection Protocol 9 Hidden Markov Models 9.1 Hidden Markov Models in Bioinformatics 9.2 Hidden Markov Model Basics 9.3 Expectation Maximization for HMM: Baum-Welch Algorithm 9.4 Viterby Algorithm 9.5 Input Output Hidden Markov Models 9.6 Factorial Hidden Markov Models 9.7 Memory Input Output Factorial Hidden Markov Models 9.8 Tricks of the Trade 9.9 Profile Hidden Markov Models
Bioinformatics II - Machine Learning Sepp Hochreiter
Outline
10 Unsupervised Learning: Projection Methods and Clustering 10.1 Introduction 10.2 Principal Component Analysis 10.3 Independent Component Analysis 10.4 Factor Analysis 10.5 Projection Pursuit and Multidimensional Scaling 10.6 Clustering
Bioinformatics II - Machine Learning Sepp Hochreiter
Literature
•ML: Duda, Hart, Stork; Pattern Classification; Wiley & Sons, 2001 •NN: C. M. Bishop; Neural Networks for Pattern Recognition, Oxford Univ. Press, 1995 •SVM: Schölkopf, Smola; Learning with kernels, MIT Press, 2002 •SVM: V. N. Vapnik; Statistical Learning Theory, Wiley & Sons, 1998 •Statistics: S. M. Kay; Fundamentals of Statistical Signal Processing, Prent. Hall, 1993 •Bayes Nets: M. I. Jordan; Learning in Graphical Models, MIT Press, 1998 •ML: T. M. Mitchell; Machine Learning, Mc Graw Hill, 1997 •NN: R. M. Neal, Bayesian Learning for Neural Networks, Springer, 1996 •Feature Selection: Guyon, Gunn, Nikravesh, Zadeh; Feature Extraction - Foundations and Applications, Springer, 2006 •BI: Schölkopf, Tsuda, Vert ; Kernel Methods in Computational Biology, MIT, 2003
Chapter 1
Introduction
Bioinformatics II - Machine Learning Sepp Hochreiter
Introduction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• part of curriculum “master of science in bioinformatics” • many fields in bioinformatics are based on machine learning - sequencing data: RNA-Seq, copy numbers - microarrays: data preprocessing, gene selection, prediction - DNA data: alternative splicing, nucleosome position, gene regulation • methods: neural networks, support vector machines, kernel approaches, projection method, belief networks • goals: noise reduction, feature selection, structure extraction, classification / regression, modeling
Bioinformatics II - Machine Learning Sepp Hochreiter
Introduction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• Examples: - cancer treatment outcomes / microarrays - classification of novel protein sequences into structural or functional classes - dependencies between DNA markers (SNP - single nucleotide polymorphisms) and diseases (schizophrenia, autism, multiple sclerosis) • only the most prominent machine learning techniques • Goals: - how to chose appropriate methods from a given pool - understand and evaluate the different approaches - where to obtain and how to use them - adapt and modify standard algorithms
Chapter 2
Basics of Machine Learning
Bioinformatics II - Machine Learning Sepp Hochreiter
Basics of Machine Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• deductive: programmer must understand the problem and find a solution and implement it • inductive: solution to a problem is found by a machine which learns • inductive is data driven: biology, chemistry, biophysics, medicine, and other fields in life sciences possess a huge amount of data • learning: automatically finds structures in the data • algorithms that automatically improve a solution with more data
Bioinformatics II - Machine Learning Sepp Hochreiter
Basics of Machine Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Machine Learning: • classification and regression (prediction) • structure extraction (clustering, components) • compression (redundancy reduction) • visualization • filtering (feature selection) • data modeling (generative models)
Bioinformatics II - Machine Learning Sepp Hochreiter
Machine Learning in Bioinformatics
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• gene recognition • microarray data: normalization • protein structure and function classification • alternative splice site recognition • prediction of nucleosome positions • single nucleotide polymorphism (SNP) and diseases • copy numbers and diseases • chromatin structure and methylation and diseases
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Example from ``Pattern Classification'', Duda, Hart, and Stork, 2001, John Wiley \& Sons, Inc. • salmons must be distinguished from sea bass given images • automated system to separate fishes in a fish-packing company • Given: a set of pictures with known fishes, the training set • Goal: in the future, automatically separate images of salmon from images of sea bass, that is generalization
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• First step: preprocessing and feature extraction • Preprocessing: contrast / brightness correction, segmentation, alignment • Features: length of the fish, lightness • Length:
optimal decision boundary: minimal mis- classifications
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• Lightness:
Different features may be differently suited for the problem Misclassifcations are weighted equally (otherwise new optimal boundary
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• Width of the fishes:
width may only be suited in combination with other features Hypothesis: Lightness changes with age, width indicates age
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• optimal lightness: nonlinear function of the width that is optimal boundary is a nonlinear curve
new fish at “?”, we would guess salmon but system fails: low generalization, one outlier sea bass changed the curve
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• one sea bass has lightness and width typically for salmon • complex boundary curve also catches this outlier and assign surrounding space to sea bass • future examples in this region will be wrongly classified
decision boundary with high generalization
Bioinformatics II - Machine Learning Sepp Hochreiter
Introductionary Example
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• we selected the features which are best suited • bioinformatics applications: number of features is large • selecting the best feature by visual inspections is impossible • certain cancer type must be chosen from 30,000 human genes • feature selection is important: machine selects the features • construct new features from the old ones: feature construction • question of cost: how expensive is a certain error • measurement noise: features • classification noise: what errors of human labeling are to expect • first example of too complex model overspecialized to training data
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• in our fish example an expert characterized the data by labeling them • supervised learning : desired output (target) for each object is given • unsupervised learning : no desired output per object • supervised: error value on each object classification / regression / time series analysis
fish example: classification salmon vs. see bass regression predict age of the fish time series prediction growth from past
• unsupervised: - cumulative error over all objects (entropy, statistical independence, information content, etc.) - probability of model producing the data: likelihood - principal component analysis (PCA), independent component analysis (ICA), factor analysis, projection pursuit, clustering (k-means), mixture models, density estimation, hidden Markov models, belief networks
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• projection: representation of objects, down-project feature vectors , PCA: orthogonal maximal data variation components, ICA: statistically mutual independent components, factor analysis: PCA with noise • density estimation: density model of observed data • clustering: extract clusters – regions data accumulation (typical data) • clustering and (down-)projection: feature construction, compact representation of the data, non-redundant, noise removal
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Isomap: method for down-projecting data
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
3. Basics of Machine Leaning 3.1 Bioinformatics 3.2 Example 3.3 Un-/Supervised 3.4 Reinforcement 3.5 Feature Extract. 3.6 Non-/Parametric 3.7 Generat. / des. 3.8 Prior Knowl. 3.9 Model Selection 3.10 Model Evaluat. 3.11 Error Bounds 3.12 Support Vector Machines 3.12.1 SVM / Bioinf. 3.12.2 Linear Separable 3.12.3 Linear SVM 3.12.4 Nonli. Sepa. 3.12.5 Example 3.12.6 Software
Original:
Mixtures:
Demixed by ICA:
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
ICA: on images
Bioinformatics II - Machine Learning Sepp Hochreiter
Supervised and Unsupervised Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
ICA: on video components
Bioinformatics II - Machine Learning Sepp Hochreiter
Reinforcement Learning
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Not considered because not relevant for bioinformatics: • reinforcement learning: - model produces output sequence - reward or a penalty at sequence end or during the sequence (no target output) • neither supervised nor unsupervised learning • model: policy • learning: world model or value function • two learning techniques : direct policy optimization vs. policy / value iteration (world model) • exploitation / exploration trade-off: better to learn or to gain reward • methods: Q-learning, SARSA, Temporal Difference (TD), Monte Marlo estimation
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• our example salmon - sea bass: features must be extracted • fMRI brain images and EEG measurements:
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Feature Selection: • features are directly measured • huge number of features: microarray 30,000 genes • other measurements with many features: peptide arrays, protein arrays, mass spectrometry, SNPs • many features not related to the task (genes relevant for cancer)
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• features without target correlation may be helpful • feature with highest target correlation may be a suboptimal selection
Bioinformatics II - Machine Learning Sepp Hochreiter
Feature Extraction, Selection, and Construction
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Feature Construction: • combine features to a new features - PCA or ICA - averaging out • kernel methods map another space where new features are used • example: sequence of amino acids may be presented by - occurrence vector - certain motifs - their similarity to other sequences
Bioinformatics II - Machine Learning Sepp Hochreiter
Parametric vs. Non-Parametric Models
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• important step in machine learning is to select a model class • parametric models: each parameter vector represents a model - neural networks, where the parameter are the synaptic weights - support vector machines • learning: paths through the parameter space • disadvantages: - different parameterizations of the same function - model complexity and class via the parameters • nonparametric models: model is locally constant / superimpositions - k-nearest-neighbor (k is hyperparameter – not adjusted) - kernel density estimation - decision tree • constant models (rules) must be a priori selected that is hyperparameters must be fixed (k, kernel width, splitting rules)
Bioinformatics II - Machine Learning Sepp Hochreiter
Generative vs. descriptive Models
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• descriptive model: additional description or another representation of the data • projection methods (PCA, ICA) • generative models: model should produce the distribution observed for the real world data points • describing or representing random components which drive the process • prior knowledge about the world or desired model • predict new states of the data generation process (brain, cell)
Bioinformatics II - Machine Learning Sepp Hochreiter
Prior and Domain Knowledge
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• reasonable distance measures for k-nearest-neighbor • construct problem-relevant features • extract appropriate features from the raw data • bioinformatics: distances based on alignment - string-kernel - Smith-Waterman-kernel - local alignment kernel - motif kernel • bioinformatics: secondary structure prediction with recurrent networks 3.7 amino acid period of a helix in the input • bioinformatics: knowledge about the microarray noise (log-values) • bioinformatics: 3D structure prediction of proteins disulfidbonds
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Selection and Training
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• Goal: select model with highest generalization performance, that is with the best performance on future data, from the model class • model selection is training is learning • model which best explains or approximates the training set • remember: salmon vs. sea bass the model which perfectly explains the training data had low generalization performance • “overfitting”: model is fitted (adapted) to special training characteristics - noisy measurements - outliers - labeling errors
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Selection and Training
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• “underfitting”: training data cannot be fitted well enough • trade-off between underfitting and overfitting
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Selection and Training
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• overfitting bounded: model class (k in k-nearest-neighbor, number of units in neural networks, maximal weights, etc.) • model class often chosen a priori • Sometimes model class can be adjusted during training • structural risk minimization • model selection parameters may influence the model complexity - nonlinearity of neural networks is increased during training - model selection procedure cannot find complex models • hyperparameters: parameters controlling the model complexity
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Evaluation, Hyperparameter Selection, and Final Model
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• how to select the hyperparameters? ( number of features) • kernel density estimation (KDE): best hyperparameter (the kernel width) can be computed under certain assumptions • n-fold cross-validation for hyperparameter selection: - training set is divided into n parts - n runs where in the i-th run part i is used for test - average error over all runs for all hyperparameter combinations - chose parameter combination with smallest average error • cross-validation error approximates generalization error, but - cross validation training sets are overlapping - points from the withhold fold are predicted with the same model so that an outlier would have multiple influence on the result • leave-one-out cross validation: only one data point is removed • assumption: trainings size is not important (one fold is removed)
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Evaluation, Hyperparameter Selection, and Final Model
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
• How to estimate the performance of a model? • n-fold cross validation, but - another k-fold cross validation on each training set to select the hyperparameters - also feature selection and feature ranking must be done for each training set, i.e. for each fold • well know error: feature selection on all data and then cross-validation - from equal relevant features the ones which are relevant also on the test fold are ranked higher
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Evaluation, Hyperparameter Selection, and Final Model
1 Introduction 2 Basics 2.1 Bioinformatics 2.2 Example 2.3 Un-/Supervised 2.4 Reinforcement 2.5 Feature Extraction 2.6 Non-/Parametric 2.7 Generative descriptive 2.8 Prior Knowledge 2.9 Model Selection 2.10 Model Evaluation
Comparing models • type I and type II error: - Type I: wrongly detect a difference - Type II: miss a difference • methods for testing the performance: - paired t-test: > multiply dividing the data into test and training set > to many type I errors - k-fold cross-validated paired t-test: fewer type I errors than p. t-test - McNemar's test: type I and type II errors well estimated - 5x2CV (5 times two fold cross-validation): comparable to McNemar > two fold: many test points, no overlapping training • other criteria: - space and time complexity - above for training and for testing (practical use) - training time oft not relevant (wait a week to make money) - faster test, then averaging over many runs is possible
Chapter 3
Theoretical Background of Machine Learning
Bioinformatics II - Machine Learning Sepp Hochreiter
Theoretical Background of Machine Learning
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• quality criteria goal for model selection / learning • approximations • unsupervised learning: Maximum Likelihood • concepts: bias and variance, efficient estimator, Fisher information • supervised learning considered in an unsupervised framework: error model
Bioinformatics II - Machine Learning Sepp Hochreiter
Theoretical Background of Machine Learning
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• does learning from examples help in the future? • “empirical risk minimization” (ERM) • complexity is restricted and dynamics fixed • “learning helps”: more training examples improve the model • converges to the best model for all future data • convergence is fast • complexity of a model class: VC-dimension (Vapnik-Chervonenkis) • “structural risk minimization” (SRM): complexity and model quality • bounds on the generalization error
Bioinformatics II - Machine Learning Sepp Hochreiter
Model Quality Criteria
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• learning equivalent to model selection • quality criteria: future data is optimally processed • other concepts: visualization, modeling, data compression • Kohonen networks: no scalar quality criterion (potential function) • advantage quality criteria: - comparison of different models - quality during learning known • supervised quality criteria: rate of misclassifications or squared error • unsupervised criteria: - likelihood - ratio of between and within cluster distance - independence of the components - information content - expected reconstruction error
Bioinformatics II - Machine Learning Sepp Hochreiter
Generalization Error
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Now: supervised • performance of a model on future data: generalization error • error on one example: loss or error • expected loss: risk or generalization error
Bioinformatics II - Machine Learning Sepp Hochreiter
Definition of the Generalization Error/Risk
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Training set: Label or target value: Simple: and
Training set: Matrix notation for training inputs: Vector notation for labels: Matrix notation for training set:
Bioinformatics II - Machine Learning Sepp Hochreiter
Definition of the Generalization Error/Risk
The loss function
quadratic loss:
zero-one loss:
Generalization error:
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Definition of the Generalization Error/Risk
y is a function of x (target function: y = f(x)) plus noise:
Now the risk can be computed as
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Definition of the Generalization Error/Risk
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Definition of the Generalization Error/Risk
The noise-free case is
simplifies to:
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
• p(z) is unknown
• especially p(y|x)
• risk cannot be computed • practical applications: approximation of the risk • model performance estimation for the user
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Empirical Estimation of the Generalization Error
Bioinformatics II - Machine Learning Sepp Hochreiter
Test Set
Test set approximation:
expectation can be approximated using
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
with test set:
Bioinformatics II - Machine Learning Sepp Hochreiter
Cross-Validation
Cross-validation folds:
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• not enough data for test set (needed for training) • cross-validation
Bioinformatics II - Machine Learning Sepp Hochreiter
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Cross-Validation
n-fold cross-validation (here 5-fold):
Bioinformatics II - Machine Learning Sepp Hochreiter
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Cross-Validation
cross-validation is an almost unbiased estimator for the generalization error:
Generalization error on trainings size without one fold l – l/n can be estimated by cross-validation on training data l by n-fold cross-validation
Bioinformatics II - Machine Learning Sepp Hochreiter
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Cross-Validation
• advantage: test examples only once used (better than multiple dividing the data into training and test set) • disadvantage: - training sets are overlapping - one fold on same model test examples dependent - these dependencies cv has high variance (one outlier influences all estimates) • special case: leave-one-out cross-validation (LOO-CV) - l-fold cross-validation, where each fold is one example - test examples to not use the same model - training sets are maximal overlapping
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
Class y = 1 data points are drawn according to
and class y = -1 according to
where the Gaussian has density
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Linear transformations of Gaussians lead to Gaussians
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
• probability of observing a point at x:
y is “integrated out” - here “summed out”
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• probability of observing a point from class y=1 at x: • probability of observing a point from class y=-1 at x:
• Conditional probability:
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
• two-dimensional classification task • data for each class from a Gaussian (black: class 1, red: class -1) • optimal discriminant functions are two hyperbolas
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Bayes rule for probability of x belonging to class y = 1:
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
Risk: 3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Loss function contributions:
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
The minimal risk is
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Optimal discriminant (see later) function:
at each position x take smallest value
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
• discriminant function g: g(x)>0 then x is assigned to y = 1 g(x)<0 then x is assigned to y = -1
• classification functions :
• optimal discriminant functions (minimal risk):
or
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
For Gaussians:
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
2D
3D
1D
Bioinformatics II - Machine Learning Sepp Hochreiter
Minimal Risk for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Maximum Likelihood
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• One of the major objectives if learning generative models • It has certain theoretical properties • Theoretical concepts like efficient estimator or biased estimator are introduced • Even supervised methods can be viewed as special case of maximum likelihood
Bioinformatics II - Machine Learning Sepp Hochreiter
Loss for Unsupervised Learning
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
First we consider different loss functions which are used for unsupervised learning Generative approaches maximum likelihood Projection methods low information loss plus desired property Parameter estimation difference of estimated parameter vector to the optimal parameter vector
Bioinformatics II - Machine Learning Sepp Hochreiter
Projection Methods
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• data projection into another space with desired requirements
Bioinformatics II - Machine Learning Sepp Hochreiter
Projection Methods
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• “Principal Component Analysis” (PCA): projection to a low dimensional space under maximal information conservation • “Independent Component Analysis” (ICA): projection into a space with statistically indpendent components (factorial code) often characteristics of a factorial distribution are optimized: - maximal entropy (given variance) - cummulants or prototype distributions should be matched: - product of special super-Gaussians • “Projection Pursuit”: components are maximally non-Gaussian
Bioinformatics II - Machine Learning Sepp Hochreiter
Generative Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
“generative model”: model simulates the world and produces the same data as the world
Bioinformatics II - Machine Learning Sepp Hochreiter
Generative Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• data generation process is probabilistic: underlying distribution • generative model attempts at approximation this distribution • loss function the distance between model output distribution and the distribution of the data generation process • Examples: “Factor Analysis”, “Latent Variable Models”, “Boltzmann Machines”, “Hidden Markov Models”
Bioinformatics II - Machine Learning Sepp Hochreiter
Parameter Estimation
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• parameterized model known • task: estimate actual parameters • loss: difference between true and estimated parameter • evaluate estimator: expected loss
Bioinformatics II - Machine Learning Sepp Hochreiter
Mean Squared Error, Bias, and Variance
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Theoretical concepts of parameter estimation • training data: where
simply (the matrix of training data) • true parameter vector:
• estimate of :
Bioinformatics II - Machine Learning Sepp Hochreiter
Mean Squared Error, Bias, and Variance
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• unbiased estimator:
on average (over training set) the true parameter is obtained • bias: • variance: • mean squared error (MSE, different to supervised loss):
expected squared error between the estimated and true parameter
Bioinformatics II - Machine Learning Sepp Hochreiter
Mean Squared Error, Bias, and Variance
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Only depends on
zero
Bioinformatics II - Machine Learning Sepp Hochreiter
Mean Squared Error, Bias, and Variance
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Averaging reduces variance – each of the subsets has examples which gives examples in total
Average is where Unbiased: Variance:
Bioinformatics II - Machine Learning Sepp Hochreiter
Mean Squared Error, Bias, and Variance
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• averaging: training sets are independent, therefore covariance between them vanishes • Minimal Variance Unbiased (MVU) estimator: construct from all unbiased estimators the one with minimal variance • MVU estimator does not always exist • methods to check whether a given estimator is a MVU
Bioinformatics II - Machine Learning Sepp Hochreiter
Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• We will find a lower bound for the variance of an unbiased estimator: Cramer-Rao Lower Bound (that is a lower bound for the MSE) • We need the Fisher information matrix :
Bioinformatics II - Machine Learning Sepp Hochreiter
Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
If satisfies then the Fisher information matrix is Fisher information: information of observation about parameter upon which the parameterized density function of depends
Bioinformatics II - Machine Learning Sepp Hochreiter
Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• efficient estimator: reaches the CRLB (efficiently uses the data) • MVU estimator can be efficient but need not
dashed: CRLB
Bioinformatics II - Machine Learning Sepp Hochreiter
Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
dashed: CRLB
Bioinformatics II - Machine Learning Sepp Hochreiter
Maximum Likelihood Estimator
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• MVU estimator is unknown or does not exist • Maximum Likelihood Estimator (MLE) • MLE can be applied to a broad range of problems • MLE approximates the MVU estimator for large data sets • MLE is even asymptotically efficient and unbiased • MLE does everything right and this efficiently (enough data)
Bioinformatics II - Machine Learning Sepp Hochreiter
Maximum Likelihood Estimator
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
The likelihood of the data set : probability of the model to produce the data iid (independent identical distributed) data: Negative log-likelihood:
Bioinformatics II - Machine Learning Sepp Hochreiter
Maximum Likelihood Estimator
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• likelihood is based on finite many densities values which have zero measure: problem? • assume instead of the volume element (region around ) • MLE popular: - simple use - properties
Bioinformatics II - Machine Learning Sepp Hochreiter
Properties of Maximum Likelihood Estimator
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
MLE: • invariant under parameter change • asymptotically unbiased and efficient asymptotically optimal • consistent for zero CRLB
Bioinformatics II - Machine Learning Sepp Hochreiter
MLE is Invariant under Parameter Change
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
MLE is Asymptotically Unbiased and Efficient
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
The maximum likelihood estimator is asymptotically unbiased: The maximum likelihood estimator is asymptotically efficient:
Bioinformatics II - Machine Learning Sepp Hochreiter
MLE is Asymptotically Unbiased and Efficient
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
MLE is Asymptotically Unbiased and Efficient
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• practical applications: finite examples MLE performance unknown • Example: general linear model where
MLE is which is efficient and MUV Note the noise covariance must be known
Bioinformatics II - Machine Learning Sepp Hochreiter
MLE is Consistent for Zero CRLB
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• consistent:
for large training sets the estimator approaches the true value (difference to unbiased variance decreases) • Later more formal definition for consistency as Thus, the MLE is consistent if the CRLB is zero
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• likelihood can be optimized by gradient descent methods • likelihood cannot be computed analytically: -- hidden states -- many-to-one output mapping -- non-linearities
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• hidden variables, latent variables, unobserved variables • likelihood is determined by all mapped to
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Expectation Maximization (EM) algorithm: -- joint probability is easier to compute than likelihood -- estimate by
Jensen's inequality
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• EM algorithm is an iteration between ”E”-step and “M”-step:
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• After E-step:
Proof:
Kullback-Leibler divergence: Zero for:
Bioinformatics II - Machine Learning Sepp Hochreiter
Expectation Maximization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• EM increases the lower bound in both steps • beginning of the M-step: E-step does not change the parameters EM algorithm: -- hidden Markov models -- mixture of Gaussians -- factor analysis -- independent component analysis
Bioinformatics II - Machine Learning Sepp Hochreiter
Noise Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• connecting unsupervised and supervised learning • quality measure • noise on the targets • apply maximum likelihood
Bioinformatics II - Machine Learning Sepp Hochreiter
Noise Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Gaussian target noise • linear model
log-likelihood:
Bioinformatics II - Machine Learning Sepp Hochreiter
Noise Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• minimize least square criterion linear least square estimator derivative with respect to : Setting the derivative to zero (Wiener-Hopf equations):
Bioinformatics II - Machine Learning Sepp Hochreiter
Gaussian Noise
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Noise covariance matrix gives the noise for each measure In most cases we have the same noise for each observation: We obtain : pseudo inverse or Moore-Penrose inverse minimal value:
Bioinformatics II - Machine Learning Sepp Hochreiter
Laplace Noise and Minkowski Error
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Laplace noise assumption: More general Minkowski error: gamma function
Bioinformatics II - Machine Learning Sepp Hochreiter
Laplace Noise and Minkowski Error
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Binary Models
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• noise considerations do not hold for binary target • classification not treated
Bioinformatics II - Machine Learning Sepp Hochreiter
Cross-Entropy
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
classification problem with K classes: Likelihood:
Bioinformatics II - Machine Learning Sepp Hochreiter
Cross-Entropy
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
The log-likelihood: loss function: cross entropy (Kullback-Leibler)
Bioinformatics II - Machine Learning Sepp Hochreiter
Logistic Regression
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
a function g mapping x onto R can be transformed into a probability:
Bioinformatics II - Machine Learning Sepp Hochreiter
Logistic Regression
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
If follows: log-likelihood: maximum likelihood maximizes
Bioinformatics II - Machine Learning Sepp Hochreiter
Logistic Regression
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
derivative of the log-likelihood: similar to the derivative of the quadratic loss function in the regression: instead of
Bioinformatics II - Machine Learning Sepp Hochreiter
Statistical Learning Theory
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Does learning help for future tasks? • Explains a model which explains the training data also new data? • Yes, if complexity is bounded • VC-dimension as complexity measure • statistical learning theory : bounds for the generalization error (future) • bounds comprise training error and complexity • structural risk minimization minimizes both terms simultaneously
Bioinformatics II - Machine Learning Sepp Hochreiter
Statistical Learning Theory
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• statistical learning theory:
-- (1) the uniform law of large numbers (empirical risk minimization) -- (2) complexity constrained models (structural risk minimization) • error bound on the mean squared error: bias-variance formulation
-- bias is training error = empirical risk -- variance is model complexity high complexity more models more solutions large variance
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• We revisit the Gaussian classification task
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds for a Gaussian Classification Task
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Gaussian assumption: Chernoff bound: maximizing with respect to Bhattacharyya bound:
Bioinformatics II - Machine Learning Sepp Hochreiter
Empirical Risk Minimization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity: 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
empirical risk minimization (ERM) principle states: if the training set is explained by the model then the model generalizes to future examples restrict the complexity of the model class empirical risk minimization (ERM): minimize error on training set
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: Finite Number of Functions
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• intuition why complexity matters • complexity is just the number M of functions in model class • difference training error (empirical risk) and test error (risk ) empirical risk: finite set of functions worst case (learning chooses unknown function):
Complexity: Finite Number of Functions
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
: union bound distance of average and expectation: Chernoff inequality (for each j ) where is the empirical mean of the true value for trials we obtain complexity term
Bioinformatics II - Machine Learning Sepp Hochreiter
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: Finite Number of Functions
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: Finite Number of Functions
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
should converge to zero as l increases, therefore
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• we want apply the previous bound for infinite function classes • idea: on training set only finite number of functions is different • example: all discriminant functions g giving the same classification function sign g(.) • parametric models g(.;w) with parameter vector w
• Does minimizing the parameter on the training set convergence to the best solution with increasing training set? • empirical risk minimization (ERM): consistent or not? • do we select better models with larger training sets?
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• parameter which minimizes the empirical risk for l training examples: • ERM is consistent if
convergence in probability Empirical risk and expected risk converge to minimal risk
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• ERM is strictly consistent if for all
holds (convergence in probability) Instead of “strictly consistent” we write “consistent” • maximum likelihood is consistent for a set of densities if
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Under what conditions is the ERM consistent? • New concepts and new capacity measures: -- points to be shattered -- annealed entropy -- entropy (new definition) -- growth function -- VC-dimension Possibilities to label the input data by binary labels shattering the input data complexity of a model class: number different labelings how many points can be shattered
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Note, that each “x” is placed in a circle around its position independent of the other “x”. Therefore each constellation represents a set with non-zero probability mass.
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• number of points a function class can shatter: VC-dimension (later) • function class • shattering coefficient: (# labeling class can shatter) • entropy of a function class: • annealed entropy of a function class: • growth function of a function class:
Jensen supremum
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
ERM fast rate of convergence (exponential convergence):
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• theorems valid for a given probability measure on the observations probability measure enters the formulas via the expectation
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• VC (Vapnik-Chervonenkis) dimension is the largest integer for which holds
If the maximum does not exists: • VC-dimension is the maximum number of vectors that can be shattered by the function class
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• function class with finite VC-dim.: consistent and converges fast -- Linear functions in d-dimensional of the input space: -- Nondecreasing nonlinear one-dimensional functions -- Nonlinear one-dimensional functions:
Bioinformatics II - Machine Learning Sepp Hochreiter
Complexity: VC-Dimension
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
-- Neural Networks: M are the number of units, W is the number of weights, e is the base of the natural logarithm (Baum & Haussler 89, Shawe-Taylor & Anthony 91) inputs restricted to Bartlett & Williamson (1996)
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• idea of deriving the error bounds: set of distinguishable functions cardinality given by • trick of two half-samples and their difference (“symmetrization”):
therefore in the following we use 2 l l example used for complexity definition and l for empirical error
minimal possible risk:
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• complexity measure depend on the ratio • The bound above is from Anthony and Bartlett whereas an older bound from Vapnik is • complexity term decreases with • zero empirical risk then the bound on the risk decreases with •Later: expected risk decreases with
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• bound on the risk • bound is similar to the bias-variance formulation -- bias corresponds to empirical risk -- variance corresponds to complexity
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• In many practical cases the bound is not useful: not tight • However in many practical cases the minimum of the bound is close to the minimum of the test error
Bioinformatics II - Machine Learning Sepp Hochreiter
Error Bounds
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• regression: instead of the shattering coefficient covering number ( covering of the functions with distance epsilon) • growth function is then: bounds on the generalization error: where
Bioinformatics II - Machine Learning Sepp Hochreiter
Structural Risk Minimization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
The Structural Risk Minimization (SRM) principle minimizes the guaranteed risk that is a bound on the risk instead of the empirical risk alone
Bioinformatics II - Machine Learning Sepp Hochreiter
Structural Risk Minimization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• nested set of function classes: where class possesses VC-dimension and
Bioinformatics II - Machine Learning Sepp Hochreiter
Structural Risk Minimization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• Example for SRM: minimum description length - sender transmits a model (once) and the inputs and errors - receiver has to recover the labels goal: minimize transmission costs (description length) • Is the SRM principle consistent? How fast does it converge?
SRM is consistent !! asymptotic rate of convergence: where is the minimal risk of the function class
Bioinformatics II - Machine Learning Sepp Hochreiter
Structural Risk Minimization
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• If the optimal solution belongs to some class then the convergence rate is
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• VC-dimension: restrictions on the class of function • most famous: zero isoline of the discriminant function has minimal distance (margin) to all training data points which are contained in a sphere with radius R
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
• linear discriminant functions • classification function • scaling w and b does not change classification function • classification function: one representative discriminant function • canonical form w.r.t. the training data X:
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
If at least one data point exists for which the discriminant function is positive and at least one data point exists for which it is negative, then we can optimize b and rescale in order to obtain the smallest This gives the tightest bound and smallest VC-dimension After optimizing b and rescaling we have points for which
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
Bioinformatics II - Machine Learning Sepp Hochreiter
Margin as Complexity Measure
3 Theor. background 3.1 Model Quality 3.2 Generalization err. 3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set 3.2.2.2 Cross-Val. 3.3 Exa.: Min. Risk 3.4 Max. Likelihood 3.4.1 Unsupervised L. 3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation 3.4.2 MSE, Bias, Vari. 3.4.3 Fisher/Cramer-R. 3.4.4 ML Estimator 3.4.5 Properties of ML 3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot. 3.4.5.3 MLE Consist. 3.4.6 Expect. Maximi. 3.5 Noise Models 3.5.1 Gaussian Noise 3.5.2 Laplace Noise 3.5.3 Binary Models 3.5.3.1 Cross-Entropy 3.5.3.2 Log. Reg. 3.6 Stati. Learn. Theo. 3.6.1 Error Bounds 3.6.2 Emp. Risk Min. 3.6.2.1 Complexity 3.6.2.2 VC-Dimension 3.6.3 Error Bounds 3.6.4 Struct. Risk Min. 3.6.5 Margin
After this optimization: the distance of and to the boundary function is