mel$frequency)cepstral)) coeﬃcients)(mfccs))sgn14006/pdf/s04-mfcc.pdf ·...

SGN$14006 Audio and Speech Processing Mel$frequency cepstral coefficients (MFCCs) Slides for this lecture are based on those created by Katariina Mahkonen for TUT course ”PuheenkäsiKelyn menetelmät” in Spring 2013. IntroducQon • MFCC coefficients model the spectral energy distribuQon in a perceptually meaningful way • MFCCs are the most widely$used acousQc feature for speech recogniQon, speaker recogniQon, and audio classificaQon • MFCCs take into account certain properQes of the human auditory system – CriQcal$band frequency resoluQon (approximately) – Log$power (dB magnitudes) Spectrogram of piano notes C1 – C8 Note that the fundamental frequency doubles in each octave and the spacing between harmonic parQals doubles too. f0 f0 f0 Mel scale • Mel$frequency scale represents subjecQve (perceived) pitch. It is one of the perceptually moQvated frequency scales (see figure below). – Mel$scale is constructed using pairwise comparisons of sinusoidal tones: a reference frequency is fixed and then a test subject (human listener) is asked to adjust the frequency of the other tone to be twice higher or lower – Models the non$linear percepQon of frequencies in the human auditory system • For comparison, the Bark criQcal$band scale has been constructed based on the masking properQes of nearby frequency components. – Constructed by filling the audible bandwidth with adjacent criQcal bands 1…26 • Note that all the scales are related and: f Mel ≈ 100f Bark (very roughly) mm. on basilar membrane frequency / kHz frequency / mel frequency / Bark

Upload: others

Post on 26-Mar-2020

7 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Mel$frequency)cepstral)) coeﬃcients)(MFCCs))sgn14006/PDF/S04-MFCC.pdf · CalculaQon)of)MFCC)coeﬃcients) – Deﬁne)triangular)”bandpass)ﬁlters”)uniformly)distributed) on)the)Mel)scale)(usually)about)40)ﬁlters)in)range)0…8kHz).)

SGN$14006)Audio)and)Speech)Processing)

)

Mel$frequency)cepstral))

coefficients)(MFCCs))

Slides)for)this)lecture)are)based)on)those)created)by)Katariina)Mahkonen)for)TUT)course)”PuheenkäsiKelyn)menetelmät”)in)Spring)2013.)

)

IntroducQon)

•  MFCC)coefficients)model)the)spectral)energy)

distribuQon)in)a)perceptually)meaningful)way)

•  MFCCs)are)the)most)widely$used)acousQc)feature)

for)speech)recogniQon,)speaker)recogniQon,)and)

audio)classificaQon)

•  MFCCs)take)into)account)certain)properQes)of)the)

human)auditory)system)

–  CriQcal$band)frequency)resoluQon)(approximately))

–  Log$power)(dB)magnitudes))

Spectrogram)of)piano)notes)C1)–)C8)

Note)that)the)fundamental)frequency)doubles)in)each)octave)and))

the)spacing)between)harmonic)parQals)doubles)too.)

f0)

Mel)scale)

•  Mel$frequency)scale)represents)subjecQve)(perceived))pitch.)It)is)one)of)the)

perceptually)moQvated)frequency)scales)(see)figure)below).))

–  Mel$scale)is)constructed)using)pairwise)comparisons)of)sinusoidal)tones:)a)reference)

frequency)is)fixed)and)then)a)test)subject)(human)listener))is)asked)to)adjust)the)

frequency)of)the)other)tone)to)be)twice)higher)or)lower)

–  Models)the)non$linear)percepQon)of)frequencies)in)the)human)auditory)system)

•  For)comparison,)the)Bark)criQcal$band)scale)has)been)constructed)based)on)

the)masking)properQes)of)nearby)frequency)components.)

–  Constructed)by)filling)the)audible)bandwidth)with)adjacent)criQcal)bands)1…26)•  Note)that)all)the)scales)are)related)and:))fMel)≈)100fBark)))(very)roughly))

mm. on basilar membrane frequency / kHz frequency / mel frequency / Bark

Page 2: Mel$frequency)cepstral)) coeﬃcients)(MFCCs))sgn14006/PDF/S04-MFCC.pdf · CalculaQon)of)MFCC)coeﬃcients) – Deﬁne)triangular)”bandpass)ﬁlters”)uniformly)distributed) on)the)Mel)scale)(usually)about)40)ﬁlters)in)range)0…8kHz).)

Mel)scale)

)))The)anchor)point)for)Mel)scale)is)chosen)so)that)1000)Hz)=)1000)Mel))

)700

1(log2595 10Hz

Melff +=

Piano)tones)C1)–)C8)

)

Mel$frequency)

spectrogram)

)

and)

)

Bark$scale)

spectrogram)

))

•  Weber)rule)says)that)the)perceived)change)in)a)physical)

quanQty)is)proporQonal)to)the)relaQve)change:)

)

•  Therefore)it)makes)sense)to)measure)sound)levels)in)

decibels:)LI = 10log10(I)

ProperQes)of)human)hearing)–

percepQon)of)loudness)differences)

Now let’s get back to the calculation of MFCC coefficients… The most widely-used

acoustic feature used to represent a speech frame (in speech recognition for example)

Page 3: Mel$frequency)cepstral)) coeﬃcients)(MFCCs))sgn14006/PDF/S04-MFCC.pdf · CalculaQon)of)MFCC)coeﬃcients) – Deﬁne)triangular)”bandpass)ﬁlters”)uniformly)distributed) on)the)Mel)scale)(usually)about)40)ﬁlters)in)range)0…8kHz).)

CalculaQon)of)MFCC)coefficients)

–  Define)triangular)”bandpass)filters”)uniformly)distributed)

on)the)Mel)scale)(usually)about)40)filters)in)range)0…8kHz).)

CalculaQon)of)MFCC)coefficients)

–  Define)triangular)”bandpass)filters”)uniformly)distributed)

on)the)Mel)scale)(usually)about)40)filters)in)range)0…8kHz).)

–  DFT)bin)energies)within)the)passband)of)each)filter)are)cumulated)(J(z))is)the)triangular)response):) E(k) = J(ω)S(ω) 2

b=minbin

maxbin

∑

CalculaQon)of)MFCC)coefficients)

–  Define)triangular)”bandpass)filters”)uniformly)distributed)

on)the)Mel)scale)(usually)about)40)filters)in)range)0…8kHz).)

–  DFT)bin)energies)within)the)passband)of)each)filter)are)cumulated)(J(z))is)the)triangular)response):)

–  Take)logarigthm)of)each)E(k),)k=1,2,…K)

E(k) = J(ω)S(ω) 2b=minbin

maxbin

∑

CalculaQon)of)MFCC)coefficients)

–  Define)triangular)”bandpass)filters”)uniformly)distributed)

on)the)Mel)scale)(usually)about)40)filters)in)range)0…8kHz).)

–  DFT)bin)energies)within)the)passband)of)each)filter)are)cumulated)(J(z))is)the)triangular)response):)

–  Take)logarigthm)of)each)E(k),)k=1,2,…K)–  Calculate)discrete)cosine)transform)(DCT)II))of)vector)log(E))

E(k) = J(ω)S(ω) 2b=minbin

maxbin

∑

Page 4: Mel$frequency)cepstral)) coeﬃcients)(MFCCs))sgn14006/PDF/S04-MFCC.pdf · CalculaQon)of)MFCC)coeﬃcients) – Deﬁne)triangular)”bandpass)ﬁlters”)uniformly)distributed) on)the)Mel)scale)(usually)about)40)ﬁlters)in)range)0…8kHz).)

CalculaQon)of)MFCC)coefficients)

–  Define)triangular)”bandpass)filters”)uniformly)distributed)

on)the)Mel)scale)(usually)about)40)filters)in)range)0…8kHz).)

–  DFT)bin)energies)within)the)passband)of)each)filter)are)cumulated)(J(z))is)the)triangular)response):)

–  Take)logarigthm)of)each)E(k),)k=1,2,…K)–  Calculate)discrete)cosine)transform)(DCT)II))of)vector)log(E))!)MFCCs)are)DCT)coefficients)of)vector)log(E))

E(k) = J(ω)S(ω) 2b=minbin

maxbin

∑

Why)are)MFCC)coefficients)successful)

in)audio)classificaQon?)

–  Perceptually$moQvated)(near)log$f))frequency)resoluQon)

–  Perceptually$moQvated)decibel$magnitude)scale)

–  Discrete)cosine)transform)decorrelates)the)features)

(improves)staQsQcal)properQes)by)removing)correlaQons)

between)the)features))

–  Convenient)control)of)the)model)order:)picking)only)the)

lowest)N)coefficients)gives)lower$resoluQon)approximaQon)