data science
DESCRIPTION
Tutorial on data science, what's it like to be a data scientist, big data, the data scientific method, probabilistic algorithms, map-reduce, sensor data analysis, visualization of twitter and foursquare feeds, open source tools (R, Python, NoSQL)TRANSCRIPT
![Page 1: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/1.jpg)
Data Sciencejoaquin vanschoren
![Page 2: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/2.jpg)
![Page 3: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/3.jpg)
WHAT IS DATA SCIENCE?
![Page 4: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/4.jpg)
Hacking skills
Maths & StatsExpertise
![Page 5: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/5.jpg)
Expertise
!
Maths &
Stats
Hacking skills !
!
!Danger zone!
Machine Learning
Research
Data Science
[Drew Conway]
![Page 6: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/6.jpg)
(data) science officer
![Page 7: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/7.jpg)
Maths &
Stats !
!
Hacking skills
!
!
!
!
Evil
Expertise
danger zone!
!machine learning
!James Bond
villain
data science outside
committee member
identity thief
thesis advisor
office mate
NSA
[Joel Grus]
![Page 8: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/8.jpg)
– David Coallier
“Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, then turn it into a
rainbow with a unicorn dancing on it.”
THE HYPE
– Harvard Business Review“Data Scientist: The Sexiest Job of the 21st Century”
![Page 9: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/9.jpg)
THE REALITY
• You’ll clean a lot of data. A LOT
• A lot of mathematics. Get over it
• Some days will be long. Get more coffee
• Not everything is about Big Data
• Most people don’t care about data
• Spend time finding the right questions
[David Coallier]
![Page 10: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/10.jpg)
BIG DATA THE END OF THEORY?
– Chris Anderson, WIRED
Out with every theory of human behavior, from linguistics to sociology. Who knows why people do what they do? The point is
they do it… With enough data, the numbers speak for themselves.
![Page 11: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/11.jpg)
All models are wrong.
But some are useful.
![Page 12: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/12.jpg)
All models are wrong, and
increasingly you can succeed
without them.
![Page 13: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/13.jpg)
Big data is a step forward.
But our problems are
not lack of access to data, but
understanding them.
![Page 14: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/14.jpg)
Data Scientific Method
[DJ Patil, J Elman]
![Page 15: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/15.jpg)
START WITH A QUESTION
Based on an observation
![Page 16: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/16.jpg)
START WITH A QUESTION
What (just) happened? Why did it happen?
What will/could happen next?
![Page 17: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/17.jpg)
ANALYSE CURRENT DATA
Create an Hypothesis
![Page 18: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/18.jpg)
CREATE FEATURES, EXPERIMENT
Test Hypothesis
![Page 19: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/19.jpg)
ANALYSE RESULTS
Won’t be pretty, repeat
![Page 20: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/20.jpg)
LET DATA FRAME THE CONVERSATION
Data gives you the what Humans give you the why
![Page 21: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/21.jpg)
LET DATA FRAME THE CONVERSATION
Let the dataset change your mindset
![Page 22: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/22.jpg)
CONVERSE• What data is missing? Where can
we get it?
• Automate data collection
• Clean data, then clean it more
• Visualize data: the brain sees
• Merge various sources of information
• Reformulate hypotheses
• Reformulate questions
![Page 23: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/23.jpg)
DATA SCIENCE TOOLS
![Page 24: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/24.jpg)
We can't solve problems at the same level of thinking with which
we've created them…
![Page 25: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/25.jpg)
Probabilistic algorithmsWhen polynomial time is just too slow
![Page 26: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/26.jpg)
[1,54,853,23,4,73,…] Have we seen 73?
min H(1,54,853,…) < H(73) ?min H2(1,54,853,…) < H2(73) ?min H3(1,54,853,…) < H3(73) ?
![Page 27: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/27.jpg)
MINHASHINGBloom filters
Find similar documents, photos, …
![Page 28: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/28.jpg)
MapReduce
![Page 29: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/29.jpg)
map
map
map
split
reduce
reduce
reduce
shuffle (remote read)
read
data
(H
DFS
)
data
(H
DFS
)w
rite
read
read
worker nodes (local) worker nodes (local)
split
1sp
lit 2
split
n
![Page 30: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/30.jpg)
mapper node
reducer node
remote read
![Page 31: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/31.jpg)
![Page 32: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/32.jpg)
![Page 33: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/33.jpg)
![Page 34: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/34.jpg)
![Page 35: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/35.jpg)
![Page 36: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/36.jpg)
![Page 37: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/37.jpg)
Mapper
<a,apple> <a’,slices>
Reducer
Input file
Intermediate file (local)
Output file
![Page 38: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/38.jpg)
<p,pineapple><p’,slices>
<a,apple>
<o,orange>
<a’,slices>
<o’,slices>
Input file Output fileIntermediate file
![Page 39: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/39.jpg)
split 0
split 1
split 2
1 mapper/split 1 reducer/key(set)
shufflesplit
![Page 40: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/40.jpg)
map
map
map
split
reduce
reduce
reduce
shuffle (remote read)
read
data
(H
DFS
)
data
(H
DFS
)w
rite
read
read
worker nodes (local) worker nodes (local)
split
1sp
lit 2
split
n
![Page 41: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/41.jpg)
map
map
map
split
reduce
reduce
shuffle + parallel sort (remote read)
read
data
(H
DFS
)
data
(H
DFS
)w
rite
read
read
split
1sp
lit 2
split
n
master nodeassigns map/reduce jobs
reassigns if nodes fail
![Page 42: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/42.jpg)
![Page 43: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/43.jpg)
![Page 44: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/44.jpg)
![Page 45: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/45.jpg)
![Page 46: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/46.jpg)
Nearest bar
? nearest within distance d?
Input graph (node,label)
![Page 47: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/47.jpg)
Nearest bar
Map ∀ , search graph with radius d < ,{ ,distance} >
Input graph (node,label)
![Page 48: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/48.jpg)
Nearest bar
Map ∀ , search graph < ,{ ,distance} >
Shuffle/ Sort by id
Input graph (node,label)
![Page 49: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/49.jpg)
Nearest bar
Map ∀ , search graph < ,{ ,distance} >
Input graph (node,label)
Shuffle/ Sort by id
Reduce < ,[{ ,distance}, { ,distance}] > -> min()
Output < , > < , > marked graph
![Page 50: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/50.jpg)
Sensor data
![Page 51: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/51.jpg)
Vibration
Strain (longitudinal)
Strain (transverse)
Temperature
![Page 52: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/52.jpg)
NOISE
![Page 53: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/53.jpg)
MATHSConvolution
signal
kernel
convolution
✐✐
“nr3” — 2007/5/1 — 20:53 — page 642 — #664 ✐✐
✐ ✐
642 Chapter 13. Fourier and Spectral Applications
sj0
0
0
N − 1rk
(r*s)j
N − 1
N − 1
Figure 13.1.2. Convolution of discretely sampled functions. Note how the response function for negativetimes is wrapped around and stored at the extreme right end of the array rk .
Example: A response function with r0 D 1 and all other rk’s equal to zerois just the identity filter. Convolution of a signal with this response function givesidentically the signal. Another example is the response function with r14 D 1:5 andall other rk’s equal to zero. This produces convolved output that is the input signalmultiplied by 1:5 and delayed by 14 sample intervals.
Evidently, we have just described in words the following definition of discreteconvolution with a response function of finite duration M :
.r ! s/j "M=2X
kD!M=2C1sj!k rk (13.1.1)
If a discrete response function is nonzero only in some range #M=2 < k $ M=2,where M is a sufficiently large even integer, then the response function is called afinite impulse response (FIR), and its duration is M . (Notice that we are defining Mas the number of nonzero values of rk ; these values span a time interval of M # 1sampling times.) In most practical circumstances the case of finite M is the case ofinterest, either because the response really has a finite duration, or because we chooseto truncate it at some point and approximate it by a finite-duration response function.
The discrete convolution theorem is this: If a signal sj is periodic with periodN , so that it is completely determined by theN values s0; : : : ; sN!1, then its discreteconvolution with a response function of finite durationN is a member of the discreteFourier transform pair,
✐✐
“nr3” — 2007/5/1 — 20:53 — page 602 — #624 ✐✐
✐ ✐
602 Chapter 12. Fast Fourier Transform
h.at/” 1
jajH.f
a/ time scaling (12.0.5)
1
jbjh.t
b/” H.bf / frequency scaling (12.0.6)
h.t ! t0/” H.f / e2! if t0 time shifting (12.0.7)
h.t/ e!2! if0t” H.f ! f0/ frequency shifting (12.0.8)
With two functions h.t/ and g.t/, and their corresponding Fourier transformsH.f / andG.f /, we can form two combinations of special interest. The convolutionof the two functions, denoted g " h, is defined by
g " h #Z 1
!1g.!/h.t ! !/ d! (12.0.9)
Note that g " h is a function in the time domain and that g " h D h " g. It turns outthat the function g " h is one member of a simple transform pair,
g " h” G.f /H.f / convolution theorem (12.0.10)
In other words, the Fourier transform of the convolution is just the product of theindividual Fourier transforms.
The correlation of two functions, denoted Corr.g; h/, is defined by
Corr.g; h/ #Z 1
!1g.! C t /h.!/ d! (12.0.11)
The correlation is a function of t , which is called the lag. It therefore lies in the timedomain, and it turns out to be one member of the transform pair:
Corr.g; h/” G.f /H".f / correlation theorem (12.0.12)
[More generally, the second member of the pair is G.f /H.!f /, but we are restrict-ing ourselves to the usual case in which g and h are real functions, so we take theliberty of setting H.!f / D H".f /.] This result shows that multiplying the Fouriertransform of one function by the complex conjugate of the Fourier transform of theother gives the Fourier transform of their correlation. The correlation of a functionwith itself is called its autocorrelation. In this case (12.0.12) becomes the transformpair
Corr.g; g/”jG.f /j2 Wiener-Khinchin theorem (12.0.13)
The total power in a signal is the same whether we compute it in the timedomain or in the frequency domain. This result is known as Parseval’s theorem:
total power #Z 1
!1jh.t/j2 dt D
Z 1
!1jH.f /j2 df (12.0.14)
Frequently one wants to know “how much power” is contained in the frequencyinterval between f and f C df . In such circumstances, one does not usually distin-guish between positive and negative f , but rather regards f as varying from 0 (“zerofrequency” or D.C.) toC1. In such cases, one defines the one-sided power spectraldensity (PSD) of the function h as
Ph.f / # jH.f /j2 C jH.!f /j2 0 $ f <1 (12.0.15)
![Page 54: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/54.jpg)
COMPLEXITY
![Page 55: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/55.jpg)
MATHS, AGAINscale space decomposition
signal
kernel 1
convolution 2
kernel 2
convolution 1
![Page 56: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/56.jpg)
SCALE-SPACE
![Page 57: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/57.jpg)
Baseline (σ64)
Traffic Jams (σ16 -σ64)
Slowdown (σ4 -σ16)Vehicles (σ0 -σ4)
SCALE-SPACE DECOMPOSITION
![Page 58: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/58.jpg)
VOLUMEMapReduce
145 sensors 100Hz 5GB/day 2TB/year 50MB/s disk I/O
![Page 59: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/59.jpg)
Reduce Reduce Reduce Reduce Reduce
Map Map Map MapMapBuild
windowsBuild
windowsBuild
windowsBuild
windowsBuild
windows
Shuffle
Convolute Convolute Convolute Convolute Convolute
CONVOLUTION
![Page 60: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/60.jpg)
CONVOLUTE-ADD
Map (convolute with 0-padding)
Reduce (add)
Add values in overlapping regions
0 00 0
0 0
A A+B B B+C C
![Page 61: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/61.jpg)
SEGMENTATION
• You don’t need 100Hz data for everything
• Approximate signal with linear segments
• Key points: 0-crossings of 1st, 2nd, 3rd derivative
• Maths: derivative of smoothed signal = convolution with derivative of kernel
![Page 62: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/62.jpg)
1st, 2nd,3rd degree derivatives
signalconvolutionsegmentation
![Page 63: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/63.jpg)
SEGMENTATION RESULT
![Page 64: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/64.jpg)
VISUALIZATION
![Page 65: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/65.jpg)
Twitter data
![Page 66: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/66.jpg)
TRACKING NEWS STORIES
![Page 67: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/67.jpg)
Geospacial data
![Page 68: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/68.jpg)
OPEN SOURCE TOOLS
![Page 69: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/69.jpg)
modelling, testing, prototyping
lubridate, zoo: dates, time series reshape2: reshape data ggplot2: visualize data RCurl, RJSONIO: find more data HMisc: miscellaneous DMwR, mlr: machine learning Forecast: time series forecasting garch: time series modelling quantmod: statistical financial trading xts: extensible time series igraph: study networks maptools: read and view maps
R
![Page 70: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/70.jpg)
scientific computing
numpy: linear algebra scipy: optimization, signal/image processing, … scikits: toolkits for scipy scikit-learn: machine learning toolkit statsmodels: advanced statistic modelling matplotlib: plotting NLTK: natural language processing PyBrain: more machine learning PyMC: Bayesian inference Pattern: Web mining NetworkX: Study networks Pandas: easy-to-use data structures
PYTHON
![Page 71: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/71.jpg)
CouchDB
OTHER
D3.js
![Page 72: Data science](https://reader038.vdocuments.net/reader038/viewer/2022103014/54b700ca4a7959aa2a8b46d5/html5/thumbnails/72.jpg)