jaroslav vážný: the hitch-hacker’s guide to data science  

28
Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish I’d known when I was younger Jaroslav Vážný Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz 3. dubna 2014 Jaroslav Vážný Practical approach

Upload: kisk-ff-mu

Post on 26-Jan-2015

108 views

Category:

Data & Analytics


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

The Hitch-Hackers Guide to Data Science... or what I wish I’d known when I was younger

Jaroslav Vážný

Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz

3. dubna 2014

Jaroslav Vážný Practical approach

Page 2: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

1 Science

2 Data Acquisition

3 Machines

4 ToolBox

5 Conclusion

Jaroslav Vážný Practical approach

Page 3: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

What is Science?

The whole of science is nothing more than a refinementof everyday thinking. Albert Einstein

Jaroslav Vážný Practical approach

Page 4: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

More than Science

Mistakes/FeedbackNo pain no gainPain == gain?Everything is hard until someone makes it easy

Jaroslav Vážný Practical approach

Page 5: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

MOOC == new era?

https://www.khanacademy.org/

https://www.coursera.org/

https://www.udacity.com/

https://www.edx.org/

Jaroslav Vážný Practical approach

Page 6: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Reproducibility

http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/

http://nbviewer.ipython.org/

http://pdos.csail.mit.edu/scigen/ ;-)

Jaroslav Vážný Practical approach

Page 7: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

We are all humans

Jaroslav Vážný Practical approach

Page 8: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

We are all humans/animals

Jaroslav Vážný Practical approach

Page 9: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

We are all humans/animals/idiots

Jaroslav Vážný Practical approach

Page 10: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Probability

Test your intuition!

Roll dice. 5 times you got 6. What is P(6)=?Monty Hall problemShow examples in IPython!

1 2

? ?

Jaroslav Vážný Practical approach

Page 11: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Bayes’s theorem

Suppose the probability (for anyone) to have AIDS is:P(AIDS) = 0.001P(no AIDS) = 0.999Consider an AIDS test: result is + or -P(+|AIDS) = 0.98P(-|AIDS) = 0.02P(+|no AIDS) = 0.03P(-|no AIDS) = 0.97

Jaroslav Vážný Practical approach

Page 12: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Bayes’s theorem solution

P(AIDS|+) =P(+|AIDS)P(AIDS)

P(+|AIDS)P(AIDS) + P(+|noAIDS)P(noAIDS)

=0.98× 0.001

0.98× 0.001+ 0.03× 0.999= 0.032

Your viewpoint: my degree of belief that I have AIDS is 3.2%Your doctor’s viewpoint: 3.2% of people like this will have AIDS

Jaroslav Vážný Practical approach

Page 13: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

We are all humans/animals/idiots/liars

Jaroslav Vážný Practical approach

Page 14: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Data Avalanche?

Large Synoptic Survey Telescope20 TB per night60 PB for the raw data (after 10 years)15 PB for the catalog database

The total data volume after processing will be several hundred PBCERN

1 PB per day

Jaroslav Vážný Practical approach

Page 15: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Sloan Digital Sky Survey

Why is it important?Lots of data (>106 objects)Perfect documentationTools to access the data

Where I can learn it?http://www.sdss3.org/

Jaroslav Vážný Practical approach

Page 16: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Virtual Observatory

Why is it important?Uniform access to astronomy dataBased on Web standardsMany tools with vo support (Topcat, Aladin, Tapsh)

Where I can learn it?http://physics.muni.cz/~vazny/wiki/index.php/Diploma_work

Jaroslav Vážný Practical approach

Page 17: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

What is

Machine Learning (Data astrology)Data MiningArtificial Inteligence

Jaroslav Vážný Practical approach

Page 18: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Supervised Machine Learning

TrainingText,

Documents,Images,

etc.

FeatureVectors

MachineLearning

Algorithm

New Text,Document,

Image,etc.

FeatureVector

PredictiveModel

Labels

ExpectedLabel

Supervised Learning Model

Jaroslav Vážný Practical approach

Page 19: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Overfit/underfit

Jaroslav Vážný Practical approach

Page 20: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Unsupervised Machine Learning

TrainingText,

Documents,Images,

etc.

FeatureVectors

MachineLearning

Algorithm

New Text,Document,

Image,etc.

FeatureVector

PredictiveModel

Likelihoodor Cluster ID

or BetterRepresentation

Unsupervised Learning Model

Jaroslav Vážný Practical approach

Page 21: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Star spectrum

Jaroslav Vážný Practical approach

Page 22: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Example of feature extraction

Jaroslav Vážný Practical approach

Page 23: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Example: Decison Tree

1 ug <= 0.6636682 | gr <= -0.191208: 1 (7.0)3 | gr > -0.191208: 3 (104.0/5.0)4 ug > 0.6636685 | ri <= 0.285854: 1 (88.0/5.0)6 | ri > 0.2858547 | | ri <= 0.3146578 | | | gr <= 0.692108: 2 (6.0)9 | | | gr > 0.692108: 1 (3.0)

10 | | ri > 0.314657: 2 (90.0/2.0)

Jaroslav Vážný Practical approach

Page 24: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Example: Suport Vector Machine

Jaroslav Vážný Practical approach

Page 25: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Data exploration

http://ipython.org/

http://scikit-learn.org/stable/

http://pandas.pydata.org/

Jaroslav Vážný Practical approach

Page 26: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Developement

https://github.com/

TestsFunny hathttps://www.python.org/

Jaroslav Vážný Practical approach

Page 27: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

References

http://ipython.org/http://www.greenteapress.com/thinkstats/http://www.greenteapress.com/thinkpython/http://scikit-learn.org/stable/http://pandas.pydata.org/ http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/http://www.galaxyzoo.org/http://www.planethunters.org/ http://www.sdss3.org/

Jaroslav Vážný Practical approach

Page 28: Jaroslav Vážný: The Hitch-Hacker’s Guide to Data Science  

ScienceData Acquisition

MachinesToolBox

Conclusion

Discussion

Jaroslav Vážný Practical approach