jaroslav vážný: the hitch-hacker’s guide to data science
DESCRIPTION
TRANSCRIPT
ScienceData Acquisition
MachinesToolBox
Conclusion
The Hitch-Hackers Guide to Data Science... or what I wish I’d known when I was younger
Jaroslav Vážný
Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz
3. dubna 2014
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
1 Science
2 Data Acquisition
3 Machines
4 ToolBox
5 Conclusion
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
What is Science?
The whole of science is nothing more than a refinementof everyday thinking. Albert Einstein
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
More than Science
Mistakes/FeedbackNo pain no gainPain == gain?Everything is hard until someone makes it easy
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
MOOC == new era?
https://www.khanacademy.org/
https://www.coursera.org/
https://www.udacity.com/
https://www.edx.org/
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Reproducibility
http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/
http://nbviewer.ipython.org/
http://pdos.csail.mit.edu/scigen/ ;-)
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
We are all humans
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
We are all humans/animals
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
We are all humans/animals/idiots
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Probability
Test your intuition!
Roll dice. 5 times you got 6. What is P(6)=?Monty Hall problemShow examples in IPython!
1 2
? ?
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Bayes’s theorem
Suppose the probability (for anyone) to have AIDS is:P(AIDS) = 0.001P(no AIDS) = 0.999Consider an AIDS test: result is + or -P(+|AIDS) = 0.98P(-|AIDS) = 0.02P(+|no AIDS) = 0.03P(-|no AIDS) = 0.97
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Bayes’s theorem solution
P(AIDS|+) =P(+|AIDS)P(AIDS)
P(+|AIDS)P(AIDS) + P(+|noAIDS)P(noAIDS)
=0.98× 0.001
0.98× 0.001+ 0.03× 0.999= 0.032
Your viewpoint: my degree of belief that I have AIDS is 3.2%Your doctor’s viewpoint: 3.2% of people like this will have AIDS
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
We are all humans/animals/idiots/liars
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Data Avalanche?
Large Synoptic Survey Telescope20 TB per night60 PB for the raw data (after 10 years)15 PB for the catalog database
The total data volume after processing will be several hundred PBCERN
1 PB per day
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Sloan Digital Sky Survey
Why is it important?Lots of data (>106 objects)Perfect documentationTools to access the data
Where I can learn it?http://www.sdss3.org/
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Virtual Observatory
Why is it important?Uniform access to astronomy dataBased on Web standardsMany tools with vo support (Topcat, Aladin, Tapsh)
Where I can learn it?http://physics.muni.cz/~vazny/wiki/index.php/Diploma_work
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
What is
Machine Learning (Data astrology)Data MiningArtificial Inteligence
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Supervised Machine Learning
TrainingText,
Documents,Images,
etc.
FeatureVectors
MachineLearning
Algorithm
New Text,Document,
Image,etc.
FeatureVector
PredictiveModel
Labels
ExpectedLabel
Supervised Learning Model
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Overfit/underfit
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Unsupervised Machine Learning
TrainingText,
Documents,Images,
etc.
FeatureVectors
MachineLearning
Algorithm
New Text,Document,
Image,etc.
FeatureVector
PredictiveModel
Likelihoodor Cluster ID
or BetterRepresentation
Unsupervised Learning Model
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Star spectrum
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Example of feature extraction
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Example: Decison Tree
1 ug <= 0.6636682 | gr <= -0.191208: 1 (7.0)3 | gr > -0.191208: 3 (104.0/5.0)4 ug > 0.6636685 | ri <= 0.285854: 1 (88.0/5.0)6 | ri > 0.2858547 | | ri <= 0.3146578 | | | gr <= 0.692108: 2 (6.0)9 | | | gr > 0.692108: 1 (3.0)
10 | | ri > 0.314657: 2 (90.0/2.0)
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Example: Suport Vector Machine
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Data exploration
http://ipython.org/
http://scikit-learn.org/stable/
http://pandas.pydata.org/
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Developement
https://github.com/
TestsFunny hathttps://www.python.org/
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
References
http://ipython.org/http://www.greenteapress.com/thinkstats/http://www.greenteapress.com/thinkpython/http://scikit-learn.org/stable/http://pandas.pydata.org/ http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/http://www.galaxyzoo.org/http://www.planethunters.org/ http://www.sdss3.org/
Jaroslav Vážný Practical approach
ScienceData Acquisition
MachinesToolBox
Conclusion
Discussion
Jaroslav Vážný Practical approach