digital signal processing in hadoop

48
digital signal processing in hadoop with scalding Adam Laiacano [email protected] @adamlaiacano Thursday, October 17, 13

Upload: adam-laiacano

Post on 15-Aug-2015

4.565 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Digital Signal Processing in Hadoop

digital signal processing

in hadoop

with scalding

Adam [email protected]

@adamlaiacanoThursday, October 17, 13

Page 2: Digital Signal Processing in Hadoop

Overview

• Intro to digital signals and filters

• sampling

• frequency domain

• FIR / IIR filters

• Very quick intro to Scalding

• Filtering tons of signals at once

• Application: Finding trending blogs on tumblr

Thursday, October 17, 13

Page 3: Digital Signal Processing in Hadoop

1 sample / day

Thursday, October 17, 13

Page 4: Digital Signal Processing in Hadoop

7-day average 1 sample / day

Thursday, October 17, 13

Page 5: Digital Signal Processing in Hadoop

Some DefinitionsSignal - Any series of data (Volts, posts, etc) that is measured at regular intervals.

Sampling period, Ts - Time between samples (my example was Ts = 1 day)

Sampling frequency fs - 1 / Ts

Nyquist frequency - Highest frequency that can be represented = fs/2

Filter - A system to reduce or enhance certain aspects (phase, magnitude) of a signal.

Stopband - The frequency range we want to eliminate

Passband - The frequency range we want to preserve

Cutoff frequency, fc - The boundary of the stopband/passband

Thursday, October 17, 13

Page 6: Digital Signal Processing in Hadoop

SignalsThursday, October 17, 13

Page 7: Digital Signal Processing in Hadoop

Sampling

Orignal Analog 10 samples/period

Thursday, October 17, 13

Page 8: Digital Signal Processing in Hadoop

Sampling

1 sample/period 2 samples/period

Thursday, October 17, 13

Page 9: Digital Signal Processing in Hadoop

FiltersThursday, October 17, 13

Page 10: Digital Signal Processing in Hadoop

FILTER

Thursday, October 17, 13

Page 11: Digital Signal Processing in Hadoop

Passband Stopband

fc fn

Low-Pass Filter

Thursday, October 17, 13

Page 12: Digital Signal Processing in Hadoop

Passband Stopband

fc fn

Low-Pass Filter

Closer to reality

Thursday, October 17, 13

Page 13: Digital Signal Processing in Hadoop

Moving Average Filter

y[t] = 1/7 * x[t] + 1/7 * x[t-1] + 1/7 * x[t-2] + ... 1/7 * x[t-6]

Thursday, October 17, 13

Page 14: Digital Signal Processing in Hadoop

FIR Digital Filter

y[t] = h[0] * x[t] + h[1] * x[t-1] + h[2] * x[t-2] + ... h[N-1] * x[t-N-1]

y <- filter(x, h)

R code:

Thursday, October 17, 13

Page 15: Digital Signal Processing in Hadoop

Frequency Domain

x = 1.0 * sin(0.5*2*pi*t) + 0.5 * sin(250*2*pi*t) + 0.1 * sin(400*2*pi*t)

Thursday, October 17, 13

Page 16: Digital Signal Processing in Hadoop

Frequency Domain

Thursday, October 17, 13

Page 17: Digital Signal Processing in Hadoop

Frequency Domain

h = [-0.0201, -0.0584, -0.0612, -0.0109, 0.0513, 0.0332, -0.0566, -0.0857, 0.0634, 0.3109, 0.4344, 0.3109, 0.0634, -0.0857, -0.0566, 0.0332, 0.0513, -0.0109, -0.0612, -0.0584, -0.0201]

21-point low-pass filter with 250Hz cutoff

http://t-filter.appspot.com/fir/index.htmlThursday, October 17, 13

Page 18: Digital Signal Processing in Hadoop

FIR vs IIR

y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1]

y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1] -

g[1] * y[t-1] - ... g[M] * y[t-M]

1

Thursday, October 17, 13

Page 19: Digital Signal Processing in Hadoop

Delta Function

Thursday, October 17, 13

Page 20: Digital Signal Processing in Hadoop

FIR vs IIR

y[t] = 1/7 * x[t] + 1/7 * x[t-1] + ... 1/7 * x[t-6]

y[t] = 1/2 * x[t] + 1/2 * y[t-1]

Thursday, October 17, 13

Page 21: Digital Signal Processing in Hadoop

IIR - be careful!

y[t] = 0.5 * x[t] + 1.1 * y[t-1]

Thursday, October 17, 13

Page 22: Digital Signal Processing in Hadoop

Impulse Response

Thursday, October 17, 13

Page 23: Digital Signal Processing in Hadoop

Recap - FIR Filters

• FIR filters are weighted sums of previous input.

• Can think of them as a generalized Moving Average

• Required to apply:

• Filter h of length N

• Previous N inputs x

Thursday, October 17, 13

Page 24: Digital Signal Processing in Hadoop

Thursday, October 17, 13

Page 25: Digital Signal Processing in Hadoop

Super General Overview

• DSL on top of Cascading, written in scala

• Cascading: Workflow language for dealing with lots of data. Often in hadoop.

• Similar to pig or hive, but easier to extend (no UDFs! one language!).

• Feels like “real programming” - compiler! types!

• Is awesome

Thursday, October 17, 13

Page 26: Digital Signal Processing in Hadoop

Less General Overview

• Similar to split/apply/combine paradigm (plyr, pandas)

• Load data into Pipes (like data.frames)

• Each pipe has one or more Fields (columns)

• Perform row-wise operations with map (d$a+d$b)

• Perform field-wise operations in groupBy

Thursday, October 17, 13

Page 27: Digital Signal Processing in Hadoop

Hello World

Thursday, October 17, 13

Page 28: Digital Signal Processing in Hadoop

Scalding Resources

• The best resource is the Scalding wiki page.https://github.com/twitter/scalding/wiki/Fields-based-API-Reference

• Edwin Chen’s post about recommendations.http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

• Source code is FULL of undocumented features!https://github.com/twitter/scalding

Thursday, October 17, 13

Page 29: Digital Signal Processing in Hadoop

import Matrix._

Thursday, October 17, 13

Page 30: Digital Signal Processing in Hadoop

DataVector

SlidingFilter

= Sliding subset of input

Thursday, October 17, 13

Page 31: Digital Signal Processing in Hadoop

000000000000000000000000

0

DataVector

0000000000000000000000000

0000000000000

00000000000

0000000000000000000000000

... ...

SlidingFilter

= Sliding filter

0

00000000000000000000000

Thursday, October 17, 13

Page 32: Digital Signal Processing in Hadoop

00 00 0 00 0 0 00 0 0 0 00 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0 0 0 0 0 0 00 0 0 0 0 0

0 0 0 0 00 0 0 0

0 0 00 0

0

T

DataVector

FilterMatrix

* =

FilteredOutput

T

Thursday, October 17, 13

Page 33: Digital Signal Processing in Hadoop

00 00 0 00 0 0 00 0 0 0 00 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

0 0 0 0 0 0 00 0 0 0 0 0

0 0 0 0 00 0 0 0

0 0 00 0

0

T

DataVector

FilterMatrix

* =

FilteredOutput

T

Thursday, October 17, 13

Page 34: Digital Signal Processing in Hadoop

N = number of blogsM = number of samples

N MxM N

* =

T T

M M

X * H = YT T

Thursday, October 17, 13

Page 35: Digital Signal Processing in Hadoop

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 36: Digital Signal Processing in Hadoop

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 37: Digital Signal Processing in Hadoop

Matrix Filter: Square Waves

Thursday, October 17, 13

Page 38: Digital Signal Processing in Hadoop

• Scalding has a Matrix library!

• Stores data in a Pipe as ('row, 'col, 'val)

• Ideal for sparse matricies

• L0, L1, L2 norm, inverse, +, -, *

• QR Factorization: http://bit.ly/1hxWF17

• More!

import Matrix._

Thursday, October 17, 13

Page 39: Digital Signal Processing in Hadoop

Tumblr Social Graph

• 140+ Million Nodes

• 3.5 Billion Edges

• About 100GB of raw text data

• 3 columns: fromId, toId, timestamp

GOAL: Calculate followers / day for every blog, apply 1-week moving average.

Thursday, October 17, 13

Page 40: Digital Signal Processing in Hadoop

Apply Low-Pass filter to 140,000,000 blogs

Thursday, October 17, 13

Page 41: Digital Signal Processing in Hadoop

Find blogs who have accelerating follower countsfor the most consecutive days.

“Accelerating”:New Followers Today > New Followers Yesterday

Thursday, October 17, 13

Page 42: Digital Signal Processing in Hadoop

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 43: Digital Signal Processing in Hadoop

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 44: Digital Signal Processing in Hadoop

Blog A

Blog B

Blog C

Thursday, October 17, 13

Page 45: Digital Signal Processing in Hadoop

• Binary input: 1 if more followers today than yesterday, otherwise 0

• Filter the binary signal to produce a value between 0 and 1

• Anything above a threshold (0.75) is “accelerating”

Days of consecutive acceleration

Thursday, October 17, 13

Page 46: Digital Signal Processing in Hadoop

15 days

36 days

48 days

Days of consecutive acceleration(filtered signal)

Thursday, October 17, 13

Page 47: Digital Signal Processing in Hadoop

15 days

36 days

48 days

Days of consecutive acceleration(filtered signal)

Thursday, October 17, 13

Page 48: Digital Signal Processing in Hadoop

Thanks!

@adamlaiacanoadamlaiacano.tumblr.com

github.com/alaiacano/dsp-scalding

Thursday, October 17, 13