introduction to machine learning with scikit-learn
TRANSCRIPT
![Page 1: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/1.jpg)
Introduction to Machine Learning with Scikit-Learn
![Page 2: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/2.jpg)
- Preface- What is Machine Learning?- An Architecture for ML Data Products- What is Scikit-Learn?- Data Handling and Loading- Model Evaluation- Regressions- Classification- Clustering- Workshop
Plan of Study
![Page 3: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/3.jpg)
PrefaceOn teaching Machine Learning...
![Page 4: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/4.jpg)
Is Machine Learning a one semester course?
Statistics Artificial Intelligence
Computer Science
Probability
Normalization
Distributions
Smoothing
Bayes Theorem
Regression
Logits
Optimization
Planning
Computer Vision
Natural Language Processing
Reinforcement
Neural Models
Computer Vision
Anomaly Detection
Entropy
Function Approximation
Data Mining
Graph Algorithms
Big Data
![Page 5: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/5.jpg)
Machine Learning Practitioner ContinuumHacker Academic
Machine Learning Practitioner Domains
Statistician Expert Programmer
![Page 6: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/6.jpg)
Context to Data Mining & Statistics
Data
Machine or App
Users
Data Mining
Machine Learning
Statistics
Computer Science
![Page 7: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/7.jpg)
What is Machine Learning?
![Page 8: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/8.jpg)
Learning by Example
Given a bunch of examples (data) extract a meaningful pattern upon which to act.
Problem Domain Machine Learning Class
Infer a function from labeled data Supervised learning
Find structure of data without feedback Unsupervised learning
Interact with environment towards goal Reinforcement learning
![Page 9: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/9.jpg)
How do you make predictions?
![Page 10: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/10.jpg)
What patterns do you see?
![Page 11: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/11.jpg)
What is the Y value?
![Page 12: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/12.jpg)
How do you determine red from blue?
![Page 13: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/13.jpg)
Input training data to fit a model which is then used to predict incoming inputs into ...
Types of Algorithms by Output
Type of Output Algorithm Category
Output is one or more discrete classes Classification (supervised)
Output is continuous Regression (supervised)
Output is membership in a similar group Clustering (unsupervised)
Output is the distribution of inputs Density Estimation
Output is simplified from higher dimensions Dimensionality Reduction
![Page 14: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/14.jpg)
Classification
Given labeled input data (with two or more labels), fit a function that can determine for any input, what the label is.
![Page 15: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/15.jpg)
Regression
Given continuous input data fit a function that is able to predict the continuous value of input given other data.
![Page 16: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/16.jpg)
Clustering
Given data, determine a pattern of associated data points or clusters via their similarity or distance from one another.
![Page 17: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/17.jpg)
“Model” is an overloaded term.
• Model family describes, at the broadest possible level, the connection between the variables of interest.
• Model form specifies exactly how the variables of interest are connected within the framework of the model family.
http://had.co.nz/stat645/model-vis.pdf
Hadley Wickham (2015)
• A fitted model is a concrete instance of the model form where all parameters have been estimated from data, and the model can be used to generate predictions.
![Page 18: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/18.jpg)
Dimensions and Features
In order to do machine learning you need a data set containing instances (examples) that are composed of features from which you compose dimensions.
Instance: a single data point or example composed of fieldsFeature: a quantity describing an instance Dimension: one or more attributes that describe a property from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data # X.shape == (n_samples, n_features)
y = digits.target # y.shape == (n_samples,)
![Page 19: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/19.jpg)
Feature SpaceFeature space refers to the n-dimensions where your variables live (not including a target variable or class). The term is used often in ML literature because in ML all variables are features (usually) and feature extraction is the art of creating a space with decision boundaries.
Target1. Y ≡ Thickness of car tires after some testing period
Variables1. X1 ≡ distance travelled in test2. X2 ≡ time duration of test3. X3 ≡ amount of chemical C in tires
The feature space is R3, or more accurately, the positive quadrant in R3 as all the X variables can only be positive quantities.
http://stats.stackexchange.com/questions/46425/what-is-feature-space
![Page 20: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/20.jpg)
MappingsDomain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, X4 (this is the feature extraction part):
X4 = X1*X2 ≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4.
A mapping is a function, ϕ, from R3 to R4:
ϕ(x1,x2,x3) = (x1,x2,x3,x1x2)
http://stats.stackexchange.com/questions/46425/what-is-feature-space
![Page 21: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/21.jpg)
Your Task
Given a data set of instances of size N, create a model that is fit from the data (built) by extracting features and dimensions. Then use that model to predict outcomes …
1. Data Wrangling (normalization, standardization, imputing)
2. Feature Analysis/Extraction3. Model Selection/Building4. Model Evaluation5. Operationalize Model
![Page 22: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/22.jpg)
A Tour of Machine Learning Algorithms
![Page 23: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/23.jpg)
Models: Instance Methods
Compare instances in data set with a similarity measure to find best matches. - Suffers from curse of dimensionality. - Focus on feature representation and
similarity metrics between instances
● k-Nearest Neighbors (kNN)● Self-Organizing Maps (SOM)● Learning Vector Quantization (LVQ)
![Page 24: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/24.jpg)
Models: Regression
Model relationship of independent variables, X to dependent variable Y by iteratively optimizing error made in predictions.
● Ordinary Least Squares● Logistic Regression● Stepwise Regression● Multivariate Adaptive Regression Splines (MARS)● Locally Estimated Scatterplot Smoothing (LOESS)
![Page 25: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/25.jpg)
Models: Regularization Methods
Extend another method (usually regression), penalizing complexity (minimize overfit)- simple, popular, powerful - better at generalization
● Ridge Regression● LASSO (Least Absolute Shrinkage & Selection Operator)● Elastic Net
![Page 26: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/26.jpg)
Models: Decision Trees
Model of decisions based on data attributes. Predictions are made by following forks in a tree structure until a decision is made. Used for classification & regression.
● Classification and Regression Tree (CART)● Decision Stump● Random Forest● Multivariate Adaptive Regression Splines (MARS)● Gradient Boosting Machines (GBM)
![Page 27: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/27.jpg)
Models: Bayesian
Explicitly apply Bayes’ Theorem for classification and regression tasks. Usually by fitting a probability function constructed via the chain rule and a naive simplification of Bayes.
● Naive Bayes● Averaged One-Dependence Estimators (AODE)● Bayesian Belief Network (BBN)
![Page 28: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/28.jpg)
Models: Kernel Methods
Map input data into higher dimensional vector space where the problem is easier to model. Named after the “kernel trick” which computes the inner product of images of pairs of data.
● Support Vector Machines (SVM)● Radial Basis Function (RBF)● Linear Discriminant Analysis (LDA)
![Page 29: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/29.jpg)
Models: Clustering Methods
Organize data into into groups whose members share maximum similarity (defined usually by a distance metric). Two main approaches: centroids and hierarchical clustering.
● k-Means● Affinity Propegation● OPTICS (Ordering Points to Identify Cluster Structure)● Agglomerative Clustering
![Page 30: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/30.jpg)
Models: Artificial Neural Networks
Inspired by biological neural networks, ANNs are nonlinear function approximators that estimate functions with a large number of inputs.- System of interconnected neurons that activate - Deep learning extends simple networks recursively
● Perceptron● Back-Propagation● Hopfield Network● Restricted Boltzmann Machine (RBM)● Deep Belief Networks (DBN)
![Page 31: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/31.jpg)
Models: Ensembles
Models composed of multiple weak models that are trained independently and whose outputs are combined to make an overall prediction.
● Boosting● Bootstrapped Aggregation (Bagging)● AdaBoost● Stacked Generalization (blending)● Gradient Boosting Machines (GBM)● Random Forest
![Page 32: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/32.jpg)
Models: Other
The list before was not comprehensive, other algorithm and model classes include:
● Conditional Random Fields (CRF)● Markovian Models (HMMs)● Dimensionality Reduction (PCA, PLS)● Rule Learning (Apriori, Brill)● More ...
![Page 33: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/33.jpg)
An Architecture for Operationalizing Machine Learning Algorithms
![Page 34: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/34.jpg)
Operational Phase
Build Phase
Architecture of Machine Learning Operations
Training Data
Labels
Feature Vectors
EstimationAlgorithm
New Data Feature Vector
Predictive Model Prediction
![Page 35: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/35.jpg)
Feedback!
![Page 36: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/36.jpg)
The Learning Part of Machine Learning
Model Building
Initial Fixtures
Service/API
Feedback
![Page 37: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/37.jpg)
Deploying Machine Learning as a Web Service
![Page 38: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/38.jpg)
Annotation Service Example
![Page 39: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/39.jpg)
Architecture Demohttps://github.com/DistrictDataLabs/product-classifier
![Page 40: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/40.jpg)
![Page 41: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/41.jpg)
What is Scikit-Learn?
Extensions to SciPy (Scientific Python) are called SciKits. SciKit-Learn provides machine learning algorithms. ● Algorithms for supervised & unsupervised learning● Built on SciPy and Numpy● Standard Python API interface● Sits on top of c libraries, LAPACK, LibSVM, and Cython● Open Source: BSD License (part of Linux)
Probably the best general ML framework out there.
![Page 42: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/42.jpg)
Where did it come from?
Started as a Google summer of code project in 2007 by David Cournapeau, then used as a thesis project by Matthieu Brucher.
In 2010, INRIA pushed the first public release, and sponsors the project, as do Google, Tinyclues, and the Python Software Foundation.
![Page 43: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/43.jpg)
Who uses Scikit-Learn?
![Page 44: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/44.jpg)
Primary Features
- Generalized Linear Models- SVMs, kNN, Bayes, Decision Trees, Ensembles- Clustering and Density algorithms- Cross Validation- Grid Search- Pipelining- Model Evaluations- Dataset Transformations- Dataset Loading
![Page 45: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/45.jpg)
A Guide to Scikit-Learn
![Page 46: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/46.jpg)
Object-oriented interface centered around the concept of an Estimator:
“An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.”
- Scikit-Learn Tutorial
Scikit-Learn API
![Page 47: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/47.jpg)
The Scikit-Learn Estimator API
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data. """
# set state of ``self``
return self
def predict(self, X):
"""Predict response of ``X``. """
# compute predictions ``pred``
return pred
![Page 48: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/48.jpg)
- fit(X,y) sets the state of the estimator.- X is usually a 2D numpy array of shape (num_samples, num_features).
- y is a 1D array with shape (n_samples,)- predict(X) returns the class or value- predict_proba() returns a 2D array of
shape (n_samples, n_classes)
Estimators
![Page 49: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/49.jpg)
Basic methodology
from sklearn import svm
estimator = svm.SVC(gamma=0.001)
estimator.fit(X, y)
estimator.predict(x)
![Page 50: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/50.jpg)
We’ve already discussed a broad workflow, the following is a development workflow:
Wrapping fit and predict
Load & Transform Data
Raw Data Feature Extraction
Build Model Evaluate Model
Feature Evaluation
![Page 51: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/51.jpg)
Transformers
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
from sklearn import preprocessing
Xt = preprocessing.normalize(X) # Normalizer
Xt = preprocessing.scale(X) # StandardScaler
imputer =Imputer(missing_values='Nan',
strategy='mean')
Xt = imputer.fit_transform(X)
![Page 52: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/52.jpg)
Evaluation
![Page 53: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/53.jpg)
Underfitting
Not enough information to accurately model real life. Can be due to high bias, or just a too simplistic model.
Solution: Cross Validation
![Page 54: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/54.jpg)
OverfittingCreate a model with too many parameters or is too complex. “Memorization of the data” - and the model can’t generalize very well.
Solution: Benchmark Testing, Ridge Regression, Feature Analyses, Dimensionality Reduction
![Page 55: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/55.jpg)
Error: Bias vs Variance
Bias: the difference between expected (average) prediction of the model and the correct value.
Variance: how the predictions for a given point vary between different realizations for the model.
http://scott.fortmann-roe.com/docs/BiasVariance.html
![Page 56: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/56.jpg)
Bias vs. Variance Trade-Off
Related to model complexity:
The more parameters added to the model (the more complex), Bias is reduced, and variance increased.
Sources of complexity:- k (nearest neighbors)- epochs (neural nets)- # of features- learning rate
http://scott.fortmann-roe.com/docs/BiasVariance.html
![Page 57: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/57.jpg)
Assess how model will generalize to independent data set (e.g. data not in the training set).
1. Divide data into training and test splits2. Fit model on training, predict on test3. Determine accuracy, precision and recall4. Repeat k times with different splits then average as F1
Cross Validation (classification)
Predicted Class A Predicted Class B
Actual A True A False B #A
Actual B False A True B #B
#P(A) #P(B) total
![Page 58: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/58.jpg)
https://en.wikipedia.org/wiki/Precision_and_recall
accuracy = true positives + true negatives / total
precision = true positives / (true positives + false positives)
recall = true positives / (false negatives + true positives)
F1 score = 2 * ((precision * recall) / (precision + recall))
![Page 59: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/59.jpg)
Cross Validation in Scikit-Learn
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = ClassifierEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print metrics.classification_report(expected, predicted)
print metrics.confusion_matrix(expected, predicted)
print metrics.f1_score(expected, predicted)
![Page 60: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/60.jpg)
MSE & Coefficient of Determination
In regressions we can determine how well the model fits by computing the mean square error and the coefficient of determination.
MSE = np.mean((predicted-expected)**2)
R2 is a predictor of “goodness of fit” and is a value ∈ [0,1] where 1 is perfect fit.
![Page 61: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/61.jpg)
K-Part Cross Validation
from sklearn import metrics
from sklearn import cross_validation as cv
splits = cv.train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = RegressionEstimator()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(y_test)
print metrics.mean_squared_error(expected, predicted)
print metrics.r2_score(expected, predicted)
![Page 62: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/62.jpg)
How to evaluate clusters? Visualization (but only in 2D)
Other Evaluation
![Page 63: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/63.jpg)
Unstable Data
Randomness is a significant part of data in the real world but problems with data can significantly affect results:
- outliers- skew- missing information- incorrect data
Solution: seam testing/integration testing
![Page 64: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/64.jpg)
Unpredictable Future
Machine learning models attempt to predict the future as new inputs come in - but human systems and processes are subject to change.
Solution: Precision/Recall tracking over time
![Page 65: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/65.jpg)
Standardized Data Model Demo(Wheat Kernel Sizes)
![Page 66: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/66.jpg)
A Tour of Scikit-Learn
![Page 67: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/67.jpg)
Workshop
Select a data set from:https://archive.ics.uci.edu/ml/index.html
- Layout the data in our data model- Choose regression, classification, or clustering
and build the best model you can from it. - Report an evaluation of the model built- Visualize aspects of your model - Compare and contrast different algorithms
Submit your code via pull-request to repository!
![Page 68: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/68.jpg)
Advanced Scikit-Learn
![Page 69: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/69.jpg)
sklearn.pipeline.Pipeline(steps)
- Sequentially apply repeatable transformations to final estimator that can be validated at every step.
- Each step (except for the last) must implement Transformer, e.g. fit and transform methods.
- Pipeline itself implements both methods of Transformer and Estimator interfaces.
Pipelines
![Page 70: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/70.jpg)
The Scikit-Learn Transformer API
class Transformer(Estimator):
def transform(self, X):
"""Transforms the input data. """
# transform ``X`` to ``X_prime``
return X_prime
![Page 71: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/71.jpg)
The most common use for the Pipeline is to combine multiple feature extraction methodologies into a single, repeatable processing step.
- FeatureUnion- SelectKBest- TruncatedSVD- DictVectorizer
An example of a distance based ML pipeline:
https://github.com/mclumd/shaku/blob/master/Shaku.ipynb
Pipelined Feature Extraction
![Page 72: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/72.jpg)
Pipelined Model
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline
>>> model = make_pipeline(PolynomialFeatures(2), linear_model.
Ridge())
>>> model.fit(X_train, y_train)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=2,
include_bias=True, interaction_only=False)), ('ridge', Ridge
(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver='auto', tol=0.001))])
>>> mean_squared_error(y_test, model.predict(X_test))
3.1498887586451594
>>> model.score(X_test, y_test)
0.97090576345108104
![Page 73: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/73.jpg)
Grid Search
![Page 74: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/74.jpg)
- Prevent overfit/collinearity by penalizing the size of coefficients - minimize the penalized residual sum of squares:
- Said another way, shrink the coefficients to zero.
- Where > 0 is complexity parameter that controls shrinkage. The larger , the more robust the model to collinearity.
- Said another way, this is the bias/variance tradeoff: the larger the ridge alpha, the higher the bias and the lower the variance.
Ridge Regression
![Page 75: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/75.jpg)
Ridge Regression
>>> clf = linear_model.Ridge(alpha=0.5)
>>> clf.fit(X_train, y_train)
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver='auto', tol=0.001)
>>> print mean_squared_error(y_test, clf.predict(X_test))
8.34260312032
>>> clf.score(X_test, y_test)
0.92129741176557278
![Page 76: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/76.jpg)
We can search for the best parameter using the RidgeCV which is a form of Grid Search, but uses a more efficient form of leave-one-out cross-validation.
>>> import numpy as np
>>> n_alphas = 200
>>> alphas = np.logspace(-10, -2, n_alphas)
>>> clf = linear_model.RidgeCV(alphas=alphas)
>>> clf.fit(X_train, y_train)
>>> print clf.alpha_
0.0010843659686896108
>>> clf.score(X_test, y_test)
0.92542477512171173
Choosing alpha
![Page 77: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/77.jpg)
Error as a function of alpha
>>> clf = linear_model.Ridge(fit_intercept=False)
>>> errors = []
>>> for alpha in alphas:
... splits = tts(dataset.data, dataset.target('Y1'), test_size=0.2)
... X_train, X_test, y_train, y_test = splits
... clf.set_params(alpha=alpha)
... clf.fit(X_train, y_train)
... error = mean_squared_error(y_test, clf.predict(X_test))
... errors.append(error)
...
>>> axe = plt.gca()
>>> axe.plot(alphas, errors)
>>> plt.show()
![Page 78: Introduction to Machine Learning with SciKit-Learn](https://reader030.vdocuments.net/reader030/viewer/2022013108/55a525281a28abdf0e8b4622/html5/thumbnails/78.jpg)
Questions, Comments?