the fundamental frequency variation spectrum
DESCRIPTION
THE FUNDAMENTAL FREQUENCY VARIATION SPECTRUM. FONETIK 2008 Kornel Laskowski , Mattias Heldner and Jens Edlund interACT , Carnegie Mellon University, Pittsburgh PA, USA Centre for Speech Technology, KTH Stockholm, Sweden. Speaker: Hsiao- Tsung. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
THE FUNDAMENTAL FREQUENCY VARIATION
SPECTRUMFONETIK 2008
Kornel Laskowski, Mattias Heldner and Jens EdlundinterACT, Carnegie Mellon University, Pittsburgh PA, USACentre for Speech Technology, KTH Stockholm, Sweden
Speaker: Hsiao-Tsung
Introduction While speech recognition systems have long ago
transitioned from formant localization to spectral (vector-valued) formant representations.
Prosodic processing continues to rely squarely on a pitch tracker’s ability to identify a peak, corresponding to the fundamental frequency(f0) of the speaker.
Even if a robust, local, analytic, statistical estimate of absolute pitch were available, applications require a representation of pitch variation and go to considerable additional effort to identify a speaker-dependent quantity for normalization
The Fundamental Frequency Variation Spectrum
Instantaneous variation in pitch is normally computed by determining a single scalar, the F0, at two temporally adjacent instants and forming their difference.
The Fundamental Frequency Variation Spectrum
we propose a vector-valued representation of pitch variation, inspired by vanishing-point perspective(透視 )
While the standard inner product between two vectors can be viewed as the summation of pair-wise products with pairs selected by orthonormal projection onto a point at infinity
F: signal’s spectral content (512-point FFT)
The Fundamental Frequency Variation Spectrum
the proposed vanishing-point product induces a 1-point perspective projection onto a point at
The Fundamental Frequency Variation Spectrum
The FFV spectrum is then given by
is undefined over the interval [-T0, +T0]
The Fundamental Frequency Variation Spectrum
A support for which is continuous over In practice, we compute using magnitude rather than
complex spectra
The Fundamental Frequency Variation Spectrum
and are 512-point Fourier transforms, computed every 8 ms.
However, the discrete transforms FL and FR are in general not defind at the corresponding dilate frequencies .
We resort to linear interpolation using the coefficients
The Fundamental Frequency Variation Spectrum
Energy independent
Filterbank
Rapidly changing
slowly changing
Filterbank
Discussion Initial experiments along these lines show that such
HMMs, when trained on dialogue data, corroborate research on human turn-taking behavior in conversations.
does not require peak identification, dynamic time warping, median filtering, landmark detection, linearization, or mean pitch estimation and subtraction
Immediate next steps include fine-tuning the filter banks and the HMM topologies, and testing the results on other tasks where pitch movements are expected to play a role, such as the attitudinal coloring of short feedback utterances, speaker verification, and automatic speech recognition for tonal languages.