Improved ASR in noise using harmonic decomposition
• Introduction
• Pitch-Scaled Harmonic Filter
• Recognition Experiments
• Results
• Conclusion aperiodic contribution
periodic contribution
Production of /z/:
Motivation & Aims
• Most speech sounds are predominantly voiced or unvoiced.
What happens when the two components are “mixed”?
• Voiced and unvoiced components have different natures:
unvoiced: aperiodic signal from turbulence-noise sources
voiced: quasi-periodic signal from vocal-fold vibration
Why not extract their features separately?
Do the two contributions contain complementary information?
• Human speech recognition still performs well in noise.
How? Does it take advantage of harmonic properties?
Introduction
Voiced and unvoiced parts of a speech signal
aperiodic contribution
periodic contribution
Production of /z/:
Introduction
Automatic Speech Recognition
Front EndPattern
Recognitionspeech signal
speech labels
Feature Extraction:
conversion of speech signals to a sequence of parameter vectors
Dynamic Programming:
matching of observation sequences to models of known utterances
Introduction
u(n) v(n)
Harmonic Decomposition
Pitch optimisation
PSHF block diagram
raw pitch
wave-form
+ _
optimised pitch
f0raw f0
opt
aperiodic waveform
s(n)
periodic waveform
Nopt
sw(n)
vw(n)^
window
w(n) w(n)
window
uw(n)^
PSHF
Decomposition example (waveforms)
Ori
gina
lP
erio
dic
part
Ape
riod
ic
part
PSHF
Decomposition example (spectrograms)
Ori
gina
lP
erio
dic
part
Ape
riod
ic
part
PSHF
Decomposition example (MFCC specs.)
Ori
gina
lP
erio
dic
part
Ape
riod
ic
part
PSHF
Parameterisations
SPLIT: MFCC +Δ, +Δ2 catPSHF
PCA26:
PCA78:
PCA13:
PCA39:
MFCC +Δ, +Δ2catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
MFCC +Δ, +Δ2 catPSHF PCA
BASE: MFCCwaveform features
+Δ, +Δ2
Method
Speech Database: Aurora 2.0
• TIdigits database at 8 kHz, filtered with G.712 channel
• Connected English digit strings (male & female speakers)
GroupSignal-to-Noise Ratio
(dB)
clean condition Train
multi-condition 20 15 10 5
set A(same noises)
20 15 10 5 0 -5
set B(different noises)
20 15 10 5 0 -5Test
set C(different channel)
20 15 10 5 0 -5
Method
Description of the experiments
• Baseline experiment: [base]
standard parameterisation of the original waveforms (i.e., MFCC+D+A)
• Split experiments: [split]
adjustment of stream weights (voiced vs. unvoiced)
• PCA experiments: [pca26, pca78, pca13 and pca39]
decorrelation of the feature vectors, and reduction of the number of coefficients
Method
Split experiments resultsResults
Split experiments resultsResults
Split experiments resultsResults
Word Accuracy (%)clean multi overall
base 52.6 78.3 65.4split 77.9 89.1 83.0pca26 71.2 88.8 78.8pca78 61.9 88.1 74.7pca13 72.6 87.6 79.7pca39 70.9 87.5 78.8
Word Accuracy (%) WER (%)clean multi overall abs. rel.
base 52.6 78.3 65.4 -- --split 77.9 89.1 83.0 17.6 50.9pca26 71.2 88.8 78.8 13.4 38.7pca78 61.9 88.1 74.7 9.3 26.9pca13 72.6 87.6 79.7 14.3 41.3pca39 70.9 87.5 78.8 13.4 38.7
Summary of resultsResults
Conclusions
• PSHF module split Aurora’s speech waveforms into two synchronous streams (periodic and aperiodic).
• Used separately, accuracy was slighty degraded, however together, it was substantially increased in noisy conditions.
• Periodic speech segments provide robustness to noise.
• Apply Linear Discriminant Analysis (LDA) to the two-stream feature vector.
• Evaluate the performance of this front end in a more general task, such as phoneme recognition.
• Test the technique for speaker recognition.
Further Work
COLUMBO PROJECT: Harmonic Decomposition applied to ASR
David M. Moreno 1 <[email protected]>
Philip J.B. Jackson 2 <[email protected]>
Javier Hernando 1 <[email protected]>
Martin J. Russell 3 <[email protected]>
http://www.ee.surrey.ac.uk/
Personal/P.Jackson/Columbo/
1 2 3
Pitch Optimisation: vowel /u/
Cost function
Spectrum derived from a 268-point DFT
Harmonic Decomposition: vowel /u/
Word accuracy results (%)
Observation probability, with stream weights