ee269 signal processing for machine learning - cepstrum
TRANSCRIPT
![Page 1: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/1.jpg)
EE269Signal Processing for Machine Learning
Cepstrum
Instructor : Mert Pilanci
Stanford University
October 11 2021
![Page 2: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/2.jpg)
Linear systems and additive noise
I Linear systems, e.g., filters, can easily separate additive noisefrom useful information when we know the frequency range ofthe noise and information
y[n] = x[n] + w[n]
I In vector notation
Hy = Hx+Hw
![Page 3: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/3.jpg)
Multiplicative or convolutive noise
I This is harder if the signal and noise are convoluted, e.g., inspeech processing
y[n] = x[n] ∗ w[n]
I w[n] is the flowing air (noise source)
I h[n] is the vocal tract (filter)
We can develop an operator that can separate convolutedcomponents by transforming convolution into addition
![Page 4: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/4.jpg)
Cepstrum
I Developed to separate convoluted signals
y[n] = x[n] ∗ w[n]
Discrete Fourier Domain:
Y [k] = X[k]W [k]
I Take logarithms
log[Y [k]] = logX[k] + logW [k]
I we can apply a linear filter to log Y [k] to separate
I equivalently we can take DFT of log Y [k] and process infrequency domain
cepstrum is the DFT (or DCT) of the log spectrum
![Page 5: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/5.jpg)
![Page 6: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/6.jpg)
![Page 7: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/7.jpg)
![Page 8: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/8.jpg)
Application: Mel-frequency spectrum
I perceptual scale of pitches
I 1 mels = 1000 Hz
I a formula to convert f hertz into m mels
m = 2595 log10
⇣1 + f
700
⌘
![Page 9: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/9.jpg)
Application: Mel-frequency spectrum
I weighted DFT magnitude
I mel-frequency spectrum MF [r] is defined as
MF [r] =X
k
|Vr[k]X[k]|2
I Vr[k] is the triangular weighting function for the rth filter.
I bandwidths are constant for center frequencies ¡ 1kHz and
then increase exponentially
I identical to convolutions with 22 filters
![Page 10: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/10.jpg)
Application: Mel-frequency spectrum
I weighted DFT magnitude
I mel-frequency spectrum MF [r] is defined as
MF [r] =X
k
|Vr[k]X[k]|2
I Vr[k] is the triangular weighting function for the rth filter.
I bandwidths are constant for center frequencies ¡ 1kHz and
then increase exponentially
I identical to convolutions with 22 filters
![Page 11: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/11.jpg)
Application: Mel-frequency spectrum
MF[r] =X
k
|Vr[k]X[k]|2
I Mel Frequency Cepstral Coe�cient (MFCC)
MFCC[m] =RX
r=1
log(MF[r]) cos
2⇡
R
✓r +
1
2
◆m
�(1)
I i.e., inner-product with cosines MFCC[m] = hlogMF[r], cm[r]i
![Page 12: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/12.jpg)
Application: Speaker Identification
I train a k-Nearest Neighbor classifier to classify frames
![Page 13: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/13.jpg)
Application: Speaker Identification
I AN4 dataset (CMU): 5 male and 5 female subjects speaking
words and numbers
I collect the training samples into frames of 30 ms with an
overlap of 75%
I calculate MFCC
I train a k-Nearest Neighbor classifier on the frames
I for a given test signal, predictions are made every frame
I most frequently occurring label is declared as the speaker
![Page 14: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/14.jpg)
Application: Speaker Identification
speaker 1 (blue) and speaker 2 (red) time domain signals
frame based MFCC features
![Page 15: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/15.jpg)
Application: Speaker Identification
![Page 16: EE269 Signal Processing for Machine Learning - Cepstrum](https://reader033.vdocuments.net/reader033/viewer/2022051217/6278c44f48c35451f7583866/html5/thumbnails/16.jpg)
Application: Speaker Identification
I average accuracy is 92.93%