introduction to data science
DESCRIPTION
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning. See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)TRANSCRIPT
![Page 1: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/1.jpg)
INTRODUCTION TO DATA SCIENCENIKO VUOKKO
JYVÄSKYLÄ SUMMER SCHOOL
AUGUST 2013
![Page 2: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/2.jpg)
DATA SCIENCE WITH A BROAD BRUSH
Concepts and methodologies
![Page 3: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/3.jpg)
DATA SCIENCE IS AN UMBRELLA, A FUSION
• Databases and infrastructure
• Pattern mining
• Statistics
• Machine learning
• Numerical optimization
• Stochastic modeling
• Data visualization
… of specialties needed
for data-driven
business optimization
![Page 4: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/4.jpg)
DATA SCIENTIST
• Data scientist is defined as DS : business problem data solution
• Combination of strong programming, math, computational and business skills
• Recipe for success
1. Convert vague business requirements into measurable technical targets
2. Develop a solution to reach the targets
3. Communicate business results
4. Deploy the solution in production
![Page 5: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/5.jpg)
UNDERSTANDING DATAMonday 19 August 2013
![Page 6: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/6.jpg)
PATTERN MINING AND DATA ANALYSIS
![Page 7: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/7.jpg)
UNSUPERVISED LEARNING
• Could be called pattern recognition or structure discovery
• What kind of a process could have produced this data?
• Discovery of “interesting” phenomena in a dataset
• Now how do you define interesting?
• Learning algorithms exist for a huge collection of pattern types
• Analogy: You decide if you want to see westerns or comedies,
but the machine picks the movies
• But does “interesting” imply useful and significant?
![Page 8: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/8.jpg)
EXAMPLES OF STRUCTURES IN DATA
• Clustering and mixture models: separation of data into parts
• Dictionary learning: a compact grammar of the dataset
• Single class learning: learn the natural boundaries of data
Example: Early detection of machine failure or network intrusion
• Latent allocation: learn hidden preferences driving purchase decisions
• Source separation: find independent generators of the data
Example: Independent phenomena affecting exchange rates
![Page 9: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/9.jpg)
MORE EXAMPLES OF “INTERESTING” PATTERNS
• { charcoal, mustard } ⇒ sausage
• Grocery customer types with differing paths around the trading floor
• Pricing trend change in a web ad exchange
• Communities and topics in a social network
• Distinct features of a person’s face and fingerprints
• Objects emerging in front of a moving car
![Page 10: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/10.jpg)
KNOW YOUR EIGENS AND SINGULARS
• Eigenvalue and singular value decompositions are central data analysis tools
• They describe the energy distribution and static core structures of data
Examples
• Face detection, speaker adaptation
• Google PageRank is basically just the world’s largest EVD
• Zombie outbreak risk is determined by its eigenvalues
• As a sub-component in every second learning algorithm
![Page 11: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/11.jpg)
DIMENSION REDUCTION
• Some applications encounter large dimension counts up to millions
• Dimension reduction may either
1. Retain space: preserve the most “descriptive” dimensions
2. Transform space: trade interpretability for powerful rendition
• Usually transformations work oblivious to data (they are simple functions)
• Curvilinear transformations try to see how the data is “folded” and build new
dimensions specific to the given dataset
![Page 12: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/12.jpg)
DIMENSION REDUCTION EXAMPLE
• Singular value decomposition is commonly used to remove the “noise
dimensions” with little energy
• Example: gene expression data and movie preferences have lots of these
• After this more complex methods can be used for unfolding the data
![Page 13: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/13.jpg)
DIMENSION REDUCTION EXAMPLE
![Page 14: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/14.jpg)
BLIND SOURCE SEPARATION
• Find latent sources that generated the data
• Tries to discover the real truth beneath all noise and convolution
• Examples:
• Air defense missile guidance systems
• Error-correcting codes
• Language modeling
• Brain activity factors
• Industrial process dynamics
• Factors behind climate change
![Page 15: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/15.jpg)
(STATISTICAL) SIGNIFICANCE TESTING
• Example: Rejection rate increase in a manufacturing plant
• “What is the probability of observing this increase if everything was OK?”
• “What is the probability of having a valid alert if there really was something
wrong?”
• Reliability of significance testing results is wholly dependent on correct
modeling of the data source and pattern type
• Statistical significance is different from material significance
![Page 16: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/16.jpg)
CORRELATION IS NOT CAUSALITY
A correlation may hide an almost arbitrary truth
• Cities with more firemen have more fires
• Companies spending more in marketing have higher revenues
• Marsupials exist mainly in Australia
• However, making successful predictions does not require causality
![Page 17: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/17.jpg)
MACHINE LEARNING
Basics
![Page 18: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/18.jpg)
SUPERVISED LEARNING
• Simplistically task is to find function f : f(input) = output
• Examples: spam filtering, speech recognition, steel strength estimation
• Risks for different types of errors can be very skewed
• Complex inputs may confuse or slow down models
• Unsupervised methods often useful in improving results by simplifying the input
![Page 19: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/19.jpg)
SEMI-SUPERVISED LEARNING
• Only a part of data is labeled
• Needed when labeling data is expensive
• Understanding the structure of unlabeled data enhances learning by bringing
diversity and generalization and by constraining learning
• Relates to multi-source learning, some sources labeled, some not
• Examples:
• Object detection from a video feed
• Web page categorization
• Sentiment analysis
• Transfer learning between domains
![Page 20: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/20.jpg)
TRAINING, TESTING, VALIDATION
• A model is trained using a training dataset
• The quality of the model is measured by using it on a separate testing dataset
• A model often contains hyper-parameters chosen by the user
• A separate validation dataset is split off from the training data
• Validation data is used for testing and finding good hyper-parameter values
• Cross-validation is common practice and asymptotically unbiased
![Page 21: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/21.jpg)
BIAS AND VARIANCE
• Squared error of predictions consists of bias and variance (and noise)
• BIAS Model incapability of approximating the underlying truth
• VARIANCE Model reliance on whims of the observed data
• Complex models often have low bias and high variance
• Simple models often have high bias and low variance
• Having more data instances (rows) may reduce variance
• Having more detailed data (variables) may reduce bias
• Testing different types of models can explain how to improve your data
![Page 22: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/22.jpg)
TRAINING AND TESTING, BIAS AND VARIANCE
Complex modelSimple model
Minimal testing error
Minimal training error
![Page 23: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/23.jpg)
MACHINE LEARNING
Learning new tricks
![Page 24: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/24.jpg)
THE KERNEL TRICK
• Many learning methods rely on inner products of data points
• The “kernel trick” maps the data to an implicitly defined, high dimension space
• Kernel is the matrix of the new inner products in this space
• Mapping itself often left unknown
• Example: Gaussian kernel associates local Euclidean neighborhoods to similarity
• Example: String kernels are used for modeling DNA sequence structure
• Kernels can be combined and custom built to match expert knowledge
A kernel is a dataset-specific space transformation,
success depends on good understanding of the dataset
![Page 25: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/25.jpg)
ENSEMBLE LEARNING
• The power of many: combine multiple models into one
• Wide and strong proof of superior performance
• Extra bonus: often trivially parallelizable
OUR EXPERIENCE IS THAT MOST EFFORTS SHOULD BE CONCENTRATED IN
DERIVING SUBSTANTIALLY DIFFERENT APPROACHES, RATHER THAN REFINING
A SINGLE TECHNIQUE.
Netflix $1M prize winner (ensemble of 107 models)
“
“
![Page 26: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/26.jpg)
ENSEMBLE LEARNING IN PRACTICE
• Boosting: weigh (⇒ low bias) focused (⇒ low bias) simple models (⇒ low bias)
• Bagging: average (⇒ low variance) results of simple models (⇒ low bias)
• What aspect of the data am I still missing?
• Variable mixing, discretized jumps, independent factors, transformations, etc.
• Questions about practical implementability and ROI
• Failure: Netflix winner solution never taken to production
• Success: Official US hurricane model is an ensemble of 43
![Page 27: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/27.jpg)
RANDOMIZED LEARNING
• Motivation: random variation beats expert guidance surprisingly often
• Introducing randomness can improve generalization performance (smaller
variance)
• Randomness allows methods to discover unexpected success
• Examples: genetic models, simulated annealing, parallel tempering
• Increasingly useful to allow scale-out for large datasets
• Many successful methods combine random models as an ensemble
• Example: combining random projections or transformations can often beat optimized
unsupervised models
![Page 28: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/28.jpg)
ONLINE LEARNING
• Instead of ingesting a training dataset, adjust the data model after every
incoming (instance, label) pair
• Allows quick adaptation and “always-on” operation
• Finds good models fast, but may miss the great one
⟹ suitable also as a burn-in for other models
• Useful especially for the present trend towards analyzing data streams
![Page 29: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/29.jpg)
BAYESIAN BASICS
• Bayesians see data as fixed and parameters as distributions
• Parameters have prior assumptions that can encode expert knowledge
• Data is used as evidence for possible parameter values
• Final output is a set of posterior distributions for the parameters
• Models may employ only the most probable parameter values or their full
probability distribution
• Variational Bayes approximates the posterior with a simpler distribution
![Page 30: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/30.jpg)
MODEL COMPLEXITY
• Limiting model size and complexity can be used to avoid excessive bias
• Minimum description length and Akaike/Bayesian information criteria are the
Occam’s razor of data science
• VC dimension of a model provides a theoretical limit for generalization error
• Regularization can limit instance weights or parameter sizes
• Bayesian models use hyper-parameters to limit parameter overfit
![Page 31: Introduction to Data Science](https://reader034.vdocuments.net/reader034/viewer/2022051210/54c6e8584a79590e788b45e1/html5/thumbnails/31.jpg)
THE END