proteomics informatics – signal processing i: analysis of mass spectra (week 3)

68
Proteomics Informatics – Signal processing I: analysis of mass spectra (Week 3)

Upload: horatio-parker

Post on 18-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Proteomics Informatics – Signal processing I:

analysis of mass spectra (Week 3)

Example data – MALDI-TOF

m/z1000 4500

Inte

nsity

1800

0

D:\Users\Fenyo\Desktop\ATP.txt (15:42 02/03/11)Description: none available m/z2280 2400

Inte

nsi

ty

700

0

D:\Users\Fenyo\Desktop\ATP.txt (15:46 02/03/11)Description: none available

m/z1300 1460In

ten

sity

45

0

D:\Users\Fenyo\Desktop\ATP.txt (15:50 02/03/11)Description: none available

m/z1444.0 1458.0

Inte

nsi

ty

35

0

D:\Users\Fenyo\Desktop\ATP.txt (15:54 02/03/11)Description: none available

m/z2378.0 2394.0

Inte

nsi

ty

700

0

D:\Users\Fenyo\Desktop\ATP.txt (16:07 02/03/11)Description: none available

Peptide intensity vs m/z

Fragment intensity vs m/z

Example data – ESI-LC-MS/MS

Time

m/z

m/z

% R

ela

tive

Ab

un

da

nce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

9071020663 778 1080

1022

MS/MS

Peptide intensity vs m/z vs time

Sinus

amplitude

Wave length

b

ac

a

ca /)sin(

Sinus and Cosinus

b

ac

a

ca /)sin( cb /)cos(

Two Frequencies

Fourier Transform

dxxff eix 2^

)()(

)2sin()2cos(2

iiiei

Fourier Transform

from numpy import *x=2.0*pi*arange(1000.0)/100000.0sin1 = sin(1000.0*x)sin2 = 0.2*sin(10000.0*x)sin12=sin1+sin2

fft12=fft.rfft(sin12)

Frequency

Inverse Fourier Transform

dfxf exi2^

)()(

Frequency

Inverse Fourier Transform

from numpy import *x=2.0*pi*arange(1000.0)/100000.0sin1 = sin(1000.0*x)sin2 = 0.2*sin(10000.0*x)sin12=sin1+sin2fft12=fft.rfft(sin12)

sin12_=fft.irfft(fft12,len(sin12))

Frequency

Inverse Fourier Transform

Frequency

A Peak

centroid

full width at half

maximum (FWHM)

area

height

maximum

meanvarianceskewnesskurtosis

Inte

nsit

y

Mean and variance

)(xxf

)()(22xfx

Mean

Variance

)(xfA peak is defined by and 1)( xf

Skewness and kurtosis

3/)(44

)( xfx

Skewness

Kurtosis

33/)()( xfx

A Gaussian Peak

def gaussian(x,x0,s):return exp(-(x-x0)**2/(2*s**2))

x = linspace(-1,1,1000)y=gaussian(x,0,0.1)ffty=fft.rfft(y)

Frequency

A Gaussian Peak

Skewness = 0

Kurtosis = 0

2log22FWHM

2heightarea

Frequency

Peak with a longer tail

2FWHM

heightarea

)( 01

1)(

2

xxxf

Frequency

A skewed peak

def pdf(x): return 1/sqrt(2*pi) * exp(-x**2/2)

def cdf(x): return (1 + erf(x/sqrt(2))) / 2

def skew(x,e=0,w=1,a=0): t = (x-e) / w return 2 / w * pdf(t) * cdf(a*t)

Frequency

Normal noise

x = linspace(-1,1,1000)y=0.2*random.normal(size=len(x))

If the noise is not normally distributed, try to find a transform that makes it normal

Frequency

Lognormal noise

x = linspace(-1,1,1000)y=0.2*random.lognormal(size=len(x))

Frequency

Skewed noise

x=random.uniform(-1.0,1.0,size=10*len(x))y=random.uniform(0.0,1.0,size=10*len(x))yskew=skew(x,-0.1,0.2,10)/max(yskew)yn_skew=x_test[y<yskew][:len(x)]

Frequency

Gaussian peak with normal noise

Frequency

Frequency

Frequency

Removing High Frequences

Frequency

Convolution

http://en.wikipedia.org/wiki/Convolution

)()())(*( tgftgf

Describes the response of a linear and time-invariant system to an input signal

The inverse Fourier transform of the pointwise product in frequency space

Smoothing by convolution

Smoothing

w=ones(2*width+1,'d')convolve(w/w.sum(),y,'valid‘)

Frequency Frequency Frequency

Inte

nsit

y

Smoothing

Smoothing

Adaptive Background Correction (unsharp masking)

wlk

wlk

kIw

dwdlI )(

12),,('

Unsharp masking

Original

wi = linspace(1,window_len,window_len)w = 1 / ( 2*r_[wi[::-1],0,wi] + 1 )x_ = x - d*convolve(w/w.sum(),x,'valid')

Adaptive Background Correction

Smoothing and Adaptive Background Correction

Savitsky-Golay smoothingPolynomial order = 3

Bin size = 25

Bin size = 75

Bin size = 150

Polynomial order = 5 Polynomial order = 7

Background

Frequency

Frequency

Background Subtraction Using Smoothing

Bin size = 100 Bin size = 200 Bin size = 300

Smooting Smooting Smooting

Background subtractionBackground subtractionBackground subtraction

Root Mean Square Deviation (RMSD)

22

2

//||

))((w

wlkIkI

The Root Mean Square Deviation (RMSD) is often constant for the noise and larger for the peak if the window size is approximately the size of the peak.

Background Subtraction using RMSDBin size = 100 Bin size = 200 Bin size = 300

RM

SD

RM

SD

RM

SD

Inte

nsit

y

Inte

nsit

y

Inte

nsit

y

Convolution, Cross-correlation, and Autocorrelation

http://en.wikipedia.org/wiki/Convolution

Convolution describes the response of a linear andtime-invariant system to an input signal.

The inverse Fourier transform of the pointwise product in frequency space.

Cross-correlation is a measure of similarity of two signals.

It can be used for finding a shift between two signals.

Auto-correlation is the cross-correlation of a signal with itself.

It can be used for finding periodic signals obscured by noise.

Cross-correlation and autocorrelation

)()())(( tgftgf

http://en.wikipedia.org/wiki/Convolution

)()())(*( tfftff

Autocorrelation

Autocorrelation

Signal

Same signal

Cross-correlation

Cross-correlation

Signal

Shifted signal

Cross-correlation

Cross-correlation

Signal

Half of the peaks shifted

How similar are two signals?

Dot product),...,,(

21 aaa nA

),...,,(21 bbb n

B

cos

BA

BA iiiba

Identical vectors: 1,0 BAPerpendicular vectors: 0,

2 BA

)()()0)(( gfgf

The dot product is the came as the cross-correation at zero:

What are the characteristics of the dot product?

10 3 1 0.3 0.1 S/N 10

100

1000

Dimensions

Signal+Noise

Noise

Autocorrelation

Autocorrelation

Signal

Shifted signal

Sum of signal and shifted

signal

Coincidence – enhances the signal

The signal to noise can be dramatically increased by measuring several independent signals of the same phenomenon and combining these signals.

Ideal signal

Product of the four measurements

Four measurements

Coincidence – supresses and transforms the noise

Noise in productOriginal noise

Coincidence – supresses interference

Ideal signal

Product of the four measurements

Four measurements with interference

Peak Finding

The derivative of a function is zero at its minima and maxima.

The second derivative is negative at maxima and positive at minima.

Peak Finding

1. Characterize the signal and the noise2. Make a model of the data3. Select detection method4. Select parameters using simulations

Inte

nsit

y

Peak Finding: Characterizing the noise

Inte

nsit

y

Let’s first try without removing the peaks

Peak Finding: Characterizing the noise

Inte

nsit

y

Removing the peaks by looking for outliers in the root mean square deviation (RMSD)

RMSD

Peak Finding: Characterizing the peaks

Inte

nsit

y

Peak Finding: Model of data

points=1000x = linspace(-1,1,points)y=noise*random.normal(size=len(x))y+=signal*gaussian(x,0,0.01)

S/N=1 S/N=2 S/N=4

Peak Finding: Detection method

S/N=1 S/N=2 S/N=4

Peaks can be detected by finding maxima in the moving average with a window size similar to the peak width

wlk

wlk

kIlS )()(

Peak Finding: Detection method – moving average

S/N=1

S/N=2

S/N=4

Bin size = 5 Bin size = 20 Bin size = 80 Signal

Peak Finding: Detection method – RMSD

S/N=1

S/N=2

S/N=4

Bin size = 5 Bin size = 20 Bin size = 80 Signal

Peak Finding: Information about the Peak

centroid(mean)

full width at half

maximum (FWHM)

area

height

maximum

meanvarianceskewnesskurtosis

Inte

nsit

y

Information about a Peak

)(

)(

xf

xxf

)(xfarea

Centroid or mean

)(xfA peak is defined by

))(max( xfheight

To calculate any of these measures we needto know where the peak starts and ends.

Where does a peak start and end?

Estimating peptide quantity

Peak heightCurve fittingPeak area

Peak heightCurve fitting

m/z

Inte

ns

ity

Time dimension

m/z

Inte

ns

ity

Tim

e

m/z

Tim

e

Sampling

Retention Time

Inte

nsi

ty

0

5

10

15

20

25

30

0.8 0.85 0.9 0.95 1

3 points

0

20

40

60

80

100

120

140

0.8 0.85 0.9 0.95 1

3 points

5%

Acquisition time = 0.05s

5%

Sampling

0.5

0.6

0.7

0.8

0.9

1

1.1

1 2 3 4 5 6 7 8 9 10

Th

res

ho

lds

(90

%)

# of points

Sampling

What is the best way to estimate quantity?

Peak height - resistant to interference- poor statistics

Peak area - better statistics - more sensitive to

interference

Curve fitting - better statistics- needs to know the peak

shape- slow

Homework: Background Subtraction Using Smoothing

Summary

Fourier transform - transformation to frequency space and back

Signal – how do we detect and characterize signals?

Noise – how do we characterize noise?

Modeling signal and noise

Simulation to select thresholds and select parameters

Filters – fitering by low-pass (i.e. smoothing) and high-pass filters

(e.g. adaptive background correction)

Detection methods based on moving average and RMSD

Convolution - describes the response of a linear and

time-invariant system to an input signal

Cross-correlation is a measure of similarity of two signals

Autocorrelation can be used for finding periodic signals obscured by

noise

The dot product can be used to determine how similar two signals

are

Coincidence measurements enhance the signal and supresses noise

The quantity associated with a peak – height and area

Sampling – how often do we need to sample a peak to get a good

estimate of its area?

Proteomics Informatics – Signal processing I:

analysis of mass spectra (Week 3)