a beginner's guide to machine learning with scikit-learn
DESCRIPTION
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014. Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.TRANSCRIPT
A Beginner’s Guide to Machine Learning with Scikit-LearnSarah Guido
PyTennessee 2014
All about me
• Grad student at the University of Michigan• Data analyst for HathiTrust• Organizer of Ann Arbor PyLadies chapter
My talk
• Machine learning and scikit-learn• Supervised and unsupervised learning• Preprocessing, validation and testing, strategies for machine learning
What is machine learning?
• Application of algorithms that learn from examples
• Representation and generalization
Why should we care?
• Useful in every day life• Email spam, handwriting analysis, stock market
analysis, Netflix
• Especially useful in data analysis• Feature extraction, linear regression, classification,
clustering
Machine Learning Vocab
• Instance• Feature• Class• Categorical
• Nominal• Ordinal
• Continuous
Machine Learning VocabFeature Class
Instance
Scikit-Learn
• Machine learning module• Open-source• Built-in datasets• Good resources for learning
Scikit-Learn
• Model = EstimatorObject()• Model.fit(dataset.data, dataset.target)
• dataset.data = dataset• dataset.target = labels
• Model.predict(dataset.data)
Scikit-Learn
• Supervised• Unsupervised• Semi-supervised• Reinforcement learning• Neural networks• …and many more!
Supervised learning
• Labeled data• You know what you’re looking for• Classification: predict categorical labels• Regression: predict continuous target variables
Classification
• Categorical variables• Relationship between instance and feature• Classification algorithms == classifiers
Classification
• Naïve Bayes classifier• Features are independent• Fast performance• Decent classifier
Classification
• Car evaluation dataset-UCI• Features: buying price, the maintenance price, the number of doors, the number of seats, the size of the trunk, and the safety ranking
• Labels: unacceptable, acceptable, good, or very good
Classification
Classification
Classification
Unsupervised algorithms
• Unlabeled data• You might have no idea what you’re looking for• Clustering: splitting observations into groups• Dimensionality reduction: flatten data to fewer dimensions
Clustering
• Exploring the data• Similar objects in the same group• Distance between data points
Clustering
• K-means clustering• Three steps
• Chooses initial cluster centers• Assigns data instance to cluster• Recalculates cluster center
• Efficient
Clustering
Clustering
Clustering
Data preprocessing
• Encoding categorical features
Data preprocessing
Data preprocessing
Data preprocessing
• Split the dataset into training and test data
Validation and testing
• Model evaluation
• Cross-validation
Good strategies
• Avoid overfitting• Use lots of data• Intuition fails in high dimensions
My materials
• Scikit-learn.org documentation and tutorials• Machine learning class at U of M• Scikit-learn talks
Resources
• Scikit-learn documentation and tutorials• scikit-learn.org/stable/documentation.html
• Other resources• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org
• Videos• Scikit-learn tutorial: http://vimeo.com/53062607• Intro to scikit-learn: http://vimeo.com/72859487
Contact me!
• @sarah_guido• Linkedin.com/sarahguido• github.com/sarguido