image classification and retrieval on spark
TRANSCRIPT
SPARK MBUTODesign & Engineering Machine Learning Pipelines
Gianvito Siciliano
Use Case: Image Classification and Retrieval
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
OUTLINE1. Spark ‘Mbuto intro
• Abstractions
• Basic Examples
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
SPARK MBUTO• Spark poc to (easy) create, run and test pipelines and
workflow
• Pipelines are made by sequential steps in a SparkJobApp
• Each steps is a SparkJob
• Each job share the same Spark/SQL context
• Jobs are consecutively run by JobRunner
SPARKJOB
JOBRUNNER
SPARKJOBAPP
PIPELINE
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
JOB READY TO USE
READABLE APP
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
PERFORMANCE LOOKUP
A
JobR
J
J
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
• Classification
• Retrieval
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
IMAGE CLASSIFICATION• Multiclass image classification:
1. Choose model (NN, SVM, TREE…)
2. Train/test model (with labeled images)
3. Predict the label of new images
4. Tune the model
IMAGE RETRIEVAL• Multiclass image classification:
1. Choose metric (Euclidean, cosine…)
2. Build dictionary
3. Train/test the model
4. Query and search
5. Tune the model
WHAT CHANGES?
• Pipelines architecture
• Classification logic
• How to update the model?
CLASSIFICATION PIPELINE
DATA
TRAIN CLASSIFIER
MODELNEW DATA
PREDICTION
RETRIEVAL PIPELINE
DATA
TRAIN CLASSIFIER
MODEL QUERY
PREDICTION
OUTLINE1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
CLASSIFICATION & RETRIEVAL• Keypoints extraction from each images
• Clustering on the keypoints universe
• Represent each image with weighted cluster vector
• Train & Test the model
• Query the model (finding the most similar images)
Features Engineering
Build the Dictionary
Build theclassifier
Query the model
C. & R. JOBS• Load whole dataset
• Extract keypoints
• Reduce the keypoints universe
• Transform the features space
• Create the dictionary (aka Codebook)
• Train, test & evaluate the classifier
• Query and get prediction
DATA
TRAIN CLASSIFIER
MODEL
PREDICTION
KMeansCLASSIFIER
ImageLOADER
.transform
SiftEXTRACTOR
KMeansQUANTISER
.fit
CLUSTERS
CfIifTRANSFORMER
ClusterVectorPIVOTER
CODEBOOK
Features Engineering
Build the Dictionary
DICTIONARY
TRANSFORMER
ESTIMATOR
VectorASSEMBLER
.transform
LabelINDEXER
KNNCLASSIFIER
.fit
.transform
.fit
KMeansCLASSIFIER
TRAIN TEST
.split
EVALUATOR
Trainclassifier
Evaluateclassifier
INSAMPLE PREDICTION
OUTSAMPLE PREDICTION
CLASSIFIER
TRANSFORMER
ESTIMATOR
KNN IMPLEMENTATION• Is a comparison model: the similarity metric is crucial!
• Nearest Neighbour search (in the codebook) is the panic point:
• KDTree: not parallel (anche se…)
• LSH: hyperparams difficult to tune
• Metric Tree: disjoint features points area
• Spill tree: too many shared points
=> Hybrid Tree
HYBRID TREE• TopTree is a Metric tree
• SubLeaf Tree are Spill tree, trained in parallel
• Nodes can be:
• OVERLAP => defeatist search
• NON OVERLAP => backtracking
NEURAL NETWORK
• Convolutional works well with images
• Hyperparameters tuning is the panic point, but can be automatised (guarda il nuovo algo)
• Training is not trivial, update the model is easy to complain
WHAT MORE?• Features engineering
• Hyperparameters tuning
• Parallel optimizations
• Persist/update steps
• Ensemble models
DATA
Combiner
PREDICTION
Normalizer
pipelineModel
Cross Validator