![Page 1: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/1.jpg)
1
Advanced Scikit-Learn
Andreas Mueller (NYU Center for Data Science, scikit-learn)
![Page 2: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/2.jpg)
2
Me
![Page 3: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/3.jpg)
3
ClassificationRegressionClusteringSemi-Supervised LearningFeature SelectionFeature ExtractionManifold LearningDimensionality ReductionKernel ApproximationHyperparameter OptimizationEvaluation MetricsOut-of-core learning…...
![Page 4: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/4.jpg)
4
![Page 5: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/5.jpg)
5
Overview
● Reminder: Basic sklearn concepts● Model building and evaluation:
– Pipelines and Feature Unions
– Randomized Parameter Search
– Scoring Interface
● Out of Core learning– Feature Hashing
– Kernel Approximation
● New stuff in 0.16.0– Overview
– Calibration
![Page 6: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/6.jpg)
6
Training Data
Training Labels
Model
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
![Page 7: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/7.jpg)
7
Training Data
Test Data
Training Labels
Model
Prediction
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
![Page 8: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/8.jpg)
8
clf.score(X_test, y_test)
Training Data
Test Data
Training Labels
Model
Prediction
Test Labels Evaluation
Supervised Machine Learning
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
![Page 9: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/9.jpg)
9
pca = PCA(n_components=3)
pca.fit(X_train) Training Data Model
Unsupervised Transformations
![Page 10: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/10.jpg)
10
pca = PCA(n_components=3)
pca.fit(X_train)
X_new = pca.transform(X_test)
Training Data
Test Data
Model
Transformation
Unsupervised Transformations
![Page 11: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/11.jpg)
11
Basic API
estimator.fit(X, [y])
estimator.predict estimator.transform
Classification Preprocessing
Regression Dimensionality reduction
Clustering Feature selection
Feature extraction
![Page 12: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/12.jpg)
12
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)print(scores)
>> [ 0.92 1. 1. 1. 1. ]
![Page 13: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/13.jpg)
13
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)print(scores)
>> [ 0.92 1. 1. 1. 1. ]
cv_ss = ShuffleSplit(len(X_train), test_size=.3, n_iter=10)scores_shuffle_split = cross_val_score(SVC(), X, y, cv=cv_ss)
![Page 14: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/14.jpg)
14
Cross-Validation
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(SVC(), X, y, cv=5)print(scores)
>> [ 0.92 1. 1. 1. 1. ]
cv_ss = ShuffleSplit(len(X_train), test_size=.3, n_iter=10)scores_shuffle_split = cross_val_score(SVC(), X, y, cv=cv_ss)
cv_labels = LeaveOneLabelOut(labels)scores_pout = cross_val_score(SVC(), X, y, cv=cv_labels)
![Page 15: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/15.jpg)
15
Cross -Validated Grid Search
![Page 16: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/16.jpg)
16
Cross -Validated Grid Search
from sklearn.grid_search import GridSearchCVfrom sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
param_grid = {'C': 10. ** np.arange(-3, 3), 'gamma': 10. ** np.arange(-3, 3)}
grid = GridSearchCV(SVC(), param_grid=param_grid)grid.fit(X_train, y_train)grid.predict(X_test)grid.score(X_test, y_test)
![Page 17: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/17.jpg)
17
Training DataTraining Labels
Model
![Page 18: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/18.jpg)
18
Training DataTraining Labels
Model
![Page 19: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/19.jpg)
19
Training DataTraining Labels
Model
FeatureExtraction
Scaling
FeatureSelection
![Page 20: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/20.jpg)
20
Training DataTraining Labels
Model
FeatureExtraction
Scaling
FeatureSelection
Cross Validation
![Page 21: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/21.jpg)
21
Training DataTraining Labels
Model
FeatureExtraction
Scaling
FeatureSelection
Cross Validation
![Page 22: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/22.jpg)
22
Pipelines
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), SVC())pipe.fit(X_train, y_train)pipe.predict(X_test)
![Page 23: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/23.jpg)
23
Combining Pipelines andGrid Search
Proper cross-validation
param_grid = {'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)}
scaler_pipe = make_pipeline(StandardScaler(), SVC())grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)grid.fit(X_train, y_train)
![Page 24: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/24.jpg)
24
Combining Pipelines andGrid Search II
Searching over parameters of the preprocessing step
param_grid = {'selectkbest__k': [1, 2, 3, 4], 'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)}
scaler_pipe = make_pipeline(SelectKBest(), SVC())grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5)grid.fit(X_train, y_train)
![Page 25: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/25.jpg)
25
Feature Union
Training DataTraining Labels
Model
FeatureExtraction I
FeatureExtraction II
![Page 26: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/26.jpg)
26
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"), CountVectorizer(analyzer="word"))
text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))
param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)
![Page 27: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/27.jpg)
27
Feature Union
char_and_word = make_union(CountVectorizer(analyzer="char"), CountVectorizer(analyzer="word"))
text_pipe = make_pipeline(char_and_word, LinearSVC(dual=False))
param_grid = {'linearsvc__C': 10. ** np.arange(-3, 3)}grid = GridSearchCV(text_pipe, param_grid=param_grid, cv=5)
param_grid2 = {'featureunion__countvectorizer-1__ngram_range': [(1, 3), (1, 5), (2, 5)], 'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': 10. ** np.arange(-3, 3)}
![Page 28: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/28.jpg)
28
Randomized Parameter Search
![Page 29: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/29.jpg)
29
Randomized Parameter Search
Source: Bergstra and Bengio
![Page 30: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/30.jpg)
30
Randomized Parameter Search
Source: Bergstra and Bengio
Step-size free for continuous parametersDecouples runtime from search-space sizeRobust against irrelevant parameters
![Page 31: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/31.jpg)
31
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': 10. ** np.arange(-3, 3)}
![Page 32: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/32.jpg)
32
Randomized Parameter Search
params = {'featureunion__countvectorizer-1__ngram_range':[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range': [(1, 1), (1, 2), (2, 2)], 'linearsvc__C': expon()}
![Page 33: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/33.jpg)
33
Randomized Parameter Search
rs = RandomizedSearchCV(text_pipe,param_distributions=param_distributins, n_iter=50)
params = {'featureunion__countvectorizer-1__ngram_range':[(1, 3), (1, 5), (2, 5)],
'featureunion__countvectorizer-2__ngram_range':[(1, 1), (1, 2), (2, 2)],
'linearsvc__C': expon()}
![Page 34: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/34.jpg)
34
Randomized Parameter Search
● Always use distributions for continuous variables.
● Don't use for low dimensional spaces.● Future: Bayesian optimization based search.
![Page 35: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/35.jpg)
35
Generalized Cross-Validation and Path Algorithms
![Page 36: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/36.jpg)
36
rfe = RFE(LogisticRegression())
![Page 37: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/37.jpg)
37
rfe = RFE(LogisticRegression())param_grid = {'n_features_to_select': range(1, n_features)}gridsearch = GridSearchCV(rfe, param_grid)grid.fit(X, y)
![Page 38: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/38.jpg)
38
rfe = RFE(LogisticRegression())param_grid = {'n_features_to_select': range(1, n_features)}gridsearch = GridSearchCV(rfe, param_grid)grid.fit(X, y)
![Page 39: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/39.jpg)
39
rfe = RFE(LogisticRegression())param_grid = {'n_features_to_select': range(1, n_features)}gridsearch = GridSearchCV(rfe, param_grid)grid.fit(X, y)
rfecv = RFECV(LogisticRegression())
![Page 40: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/40.jpg)
40
rfe = RFE(LogisticRegression())param_grid = {'n_features_to_select': range(1, n_features)}gridsearch = GridSearchCV(rfe, param_grid)grid.fit(X, y)
rfecv = RFECV(LogisticRegression())rfecv.fit(X, y)
![Page 41: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/41.jpg)
41
![Page 42: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/42.jpg)
42
Linear Models Feature Selection Tree-Based models [possible]
LogisticRegressionCV [new] RFECV [DecisionTreeCV]
RidgeCV [RandomForestClassifierCV]
RidgeClassifierCV [GradientBoostingClassifierCV]
LarsCV
ElasticNetCV...
![Page 43: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/43.jpg)
43
Scoring Functions
![Page 44: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/44.jpg)
44
For GridSeachCVcross_val_scorevalidation_curve
Typical: accuracy
Default:Accuracy (classification)R2 (regression)
GridSeachCVRandomizedSearchCVcross_val_score...CV
![Page 45: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/45.jpg)
45
Scoring with imbalanced datacross_val_score(SVC(), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
![Page 46: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/46.jpg)
46
Scoring with imbalanced datacross_val_score(SVC(), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
![Page 47: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/47.jpg)
47
Scoring with imbalanced datacross_val_score(SVC(), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")array([ 0.99961591, 0.99983498, 0.99966247])
![Page 48: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/48.jpg)
48
Scoring with imbalanced datacross_val_score(SVC(), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
cross_val_score(DummyClassifier("most_frequent"), X_train, y_train)>>> array([ 0.9, 0.9, 0.9])
cross_val_score(SVC(), X_train, y_train, scoring="roc_auc")array([ 0.99961591, 0.99983498, 0.99966247])
![Page 49: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/49.jpg)
49
Available metrics
print(SCORERS.keys())
>> ['adjusted_rand_score', 'f1', 'mean_absolute_error', 'r2', 'recall', 'median_absolute_error', 'precision', 'log_loss', 'mean_squared_error', 'roc_auc', 'average_precision', 'accuracy']
![Page 50: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/50.jpg)
50
Defining your own scoring
def my_super_scoring(est, X, y):return accuracy_scorer(est, X, y) - np.sum(est.coef_ != 0)
![Page 51: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/51.jpg)
51
Out of Core Learning
![Page 52: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/52.jpg)
52
Or: save ourself the effort
![Page 53: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/53.jpg)
53
Think twice!
● Old laptop: 4GB Ram● 1073741824 float32● Or 1mio data points with 1000 features● EC2 : 256 GB Ram● 68719476736 float32● Or 68mio data points with 1000 features
![Page 54: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/54.jpg)
54
Supported Algorithms
● All SGDClassifier derivatives
● Naive Bayes● MinibatchKMeans
● IncrementalPCA
● MiniBatchDictionaryLearning
![Page 55: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/55.jpg)
55
Out of Core Learning
sgd = SGDClassifier()
for i in range(9): X_batch, y_batch = cPickle.load(open("batch_%02d" % i)) sgd.partial_fit(X_batch, y_batch, classes=range(10))
Possibly go over the data multiple times.
![Page 56: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/56.jpg)
56
Stateless Transformers
● Normalizer
● HashingVectorizer
● RBFSampler (and other kernel approx)
![Page 57: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/57.jpg)
57
Text data and the hashing trick
![Page 58: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/58.jpg)
58
Bag Of Word Representations
“You better call Kenny Loggins”
CountVectorizer / TfidfVectorizer
![Page 59: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/59.jpg)
59
Bag Of Word Representations
“You better call Kenny Loggins”
['you', 'better', 'call', 'kenny', 'loggins']
CountVectorizer / TfidfVectorizer
tokenizer
![Page 60: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/60.jpg)
60
Bag Of Word Representations
“You better call Kenny Loggins”
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ]better call youaardvak zyxst
['you', 'better', 'call', 'kenny', 'loggins']
CountVectorizer / TfidfVectorizer
tokenizer
Sparse matrix encoding
![Page 61: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/61.jpg)
61
Hashing Trick
“You better call Kenny Loggins”
['you', 'better', 'call', 'kenny', 'loggins']
HashingVectorizer
tokenizer
hashing
[hash('you'), hash('better'), hash('call'), hash('kenny'), hash('loggins')]= [832412, 223788, 366226, 81185, 835749]
[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, ... 0 ]
Sparse matrix encoding
![Page 62: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/62.jpg)
62
Out of Core Text Classification
sgd = SGDClassifier()hashing_vectorizer = HashingVectorizer()
for i in range(9): text_batch, y_batch = cPickle.load(open("text_%02d" % I)) X_batch = hashing_vectorizer.transform(text_batch) sgd.partial_fit(X_batch, y_batch, classes=range(10))
![Page 63: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/63.jpg)
63
Kernel Approximations
![Page 64: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/64.jpg)
64
Reminder: Kernel Trick
x
![Page 65: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/65.jpg)
65
Reminder: Kernel Trick
![Page 66: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/66.jpg)
66
Reminder: Kernel TrickClassifier linear → need only
![Page 67: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/67.jpg)
67
Reminder: Kernel TrickClassifier linear → need only
Linear:
Polynomial:
RBF:
Sigmoid:
![Page 68: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/68.jpg)
68
Complexity
● Solving kernelized SVM:~O(n_samples ** 3)
● Solving linear (primal) SVM:~O(n_samples * n_features)
n_samples large? Go primal!
![Page 69: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/69.jpg)
69
Undoing the Kernel Trick
● Kernel approximation:
● k =
= RBFSampler
![Page 70: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/70.jpg)
70
Usage
sgd = SGDClassifier()kernel_approximation = RBFSampler(gamma=.001, n_components=400)
for i in range(9): X_batch, y_batch = cPickle.load(open("batch_%02d" % i)) if i == 0: kernel_approximation.fit(X_batch) X_transformed = kernel_approximation.transform(X_batch) sgd.partial_fit(X_transformed, y_batch, classes=range(10))
![Page 71: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/71.jpg)
71
Highlights from 0.16.0
![Page 72: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/72.jpg)
72
Highlights from 0.16.0
● Multinomial Logistic Regression, LogisticRegressionCV.
● IncrementalPCA.● Probability callibration of classifiers.● Birch clustering.● LSHForest.● More robust integration with pandas.
![Page 73: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/73.jpg)
73
Probability CalibrationSVC().decision_function()→ CalibratedClassifierCV(SVC()).predict_proba()
RandomForestClassifier().predict_proba()→ CalibratedClassifierCV(RandomForestClassifier()).predict_proba()
![Page 74: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/74.jpg)
74
![Page 75: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/75.jpg)
75
CDS is hiring Research Engineers
![Page 77: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/77.jpg)
77
Bias Variance Tradeoff(why we do cross validation and grid searches)
![Page 78: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/78.jpg)
78
Overfitting and Underfitting
Model complexity
Accuracy
Training
![Page 79: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/79.jpg)
79
Overfitting and Underfitting
Model complexity
Accuracy
Training
Generalization
![Page 80: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/80.jpg)
80
Overfitting and Underfitting
Model complexity
Accuracy
Training
Generalization
Underfitting Overfitting
Sweet spot
![Page 81: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/81.jpg)
81
Linear SVM
![Page 82: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/82.jpg)
82
Linear SVM
![Page 83: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/83.jpg)
83
(RBF) Kernel SVM
![Page 84: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/84.jpg)
84
(RBF) Kernel SVM
![Page 85: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/85.jpg)
85
(RBF) Kernel SVM
![Page 86: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/86.jpg)
86
(RBF) Kernel SVM
![Page 87: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/87.jpg)
87
Decision Trees
![Page 88: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/88.jpg)
88
Decision Trees
![Page 89: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/89.jpg)
89
Decision Trees
![Page 90: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/90.jpg)
90
Decision Trees
![Page 91: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/91.jpg)
91
Decision Trees
![Page 92: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/92.jpg)
92
Decision Trees
![Page 93: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/93.jpg)
93
Random Forests
![Page 94: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/94.jpg)
94
Random Forests
![Page 95: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/95.jpg)
95
Random Forests
![Page 96: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/96.jpg)
96
Know where you are on the bias-variance tradeoff
![Page 97: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/97.jpg)
97
Validation Curves
train_scores, test_scores = validation_curve(SVC(), X, y, param_name="gamma", param_range=param_range)
![Page 98: Nyc open-data-2015-andvanced-sklearn-expanded](https://reader034.vdocuments.net/reader034/viewer/2022042701/55a706891a28ab58348b4869/html5/thumbnails/98.jpg)
98
train_sizes, train_scores, test_scores = learning_curve( estimator, X, y,train_sizes=train_sizes)
Learning Curves