2015/9/131 stress detection j.-s. roger jang ( 張智星 ) mir labmir lab, csie dept., national...
TRANSCRIPT
112/04/19 1
Stress Detection
J.-S. Roger Jang (張智星 )
MIR Lab, CSIE Dept., National Taiwan Univ.
http://mirlab.org/jang
-2-
Intro to Stress Detection
Stress detection (SD) for English Given an English word and its pronunciation Detect the stress position of the pronunciation
Applications Computer-assisted pronunciation training (CAPT)
Similar to… Tone recognition in Mandarin Chinese Intonation scoring
-3-
Examples of Stress in English Words
For multi-syllablic English word, there is a stressed syllable
Example Dictionary: stressed at syllable 1 Tomorrow: stressed at syllable 2 International: stressed at syllable 3
-4-
Steps in Stress Detection
Preprocessing Use forced alignment to find vowel locations
Feature extraction Extract feature for each vowel
Model construction Build a classifier for vowel-based stress detection
Post processing Create a word-based stress detection
-5-
Forced Alignment (1/2)
A process used for align an utterance and the corresponding canonical phonetic alphabets
Example: International
C:/Users/ROGERJ~1/AppData/Local/Temp/tpfa725f5d_eb16_47c7_a5cd_e042eea5d8d4.wav0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Sco
re=9
0.13
df=[
0 0
0 0
0 0
0]
-1
-0.5
0
0.5
1
(sil)
-1
sil
-1
international (ih_n_t_er_n_ae_sh_ah_n_ah_l)
90
ih
66
n
64
t
49
er
100
n
100
ae
100
sh
100
ah
100
n
100
ah
100
l
100
(sil)
-1
sil
-1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Pitc
h
44
46
48
50
52
Pitch1: unbroken
Pitch2: segmented
Play Wave Play pitch Play both Play Pitch2
-6-
Forced Alignment (2/2)
Applications of forced alignment Speech scoring (based on timber only) Utterance verification
Our forced alignment engine ASRA (Automatic Speech Recognition &
Assessment): For voice command recognition and speech assessment (scoring)
-7-
Corpora for Stress Detection
Merriam Webster dictionary Website
Some statistics # pronunciations: 21950 Usable files: 14994
No. of syllables > 1Available in our
dictionaryValid output from ASRA
In-house recordings Recordings from MSAR
for several years Available upon request
-8-
Speech Corpus for Lexical Stress Detection
Merriam Webster Online Dictionary’s Lexical Pronunciation
– http://www.merriam-webster.com
– All utterance are pronunciated by Native SpeakersStress Position
Number of Syllable
2 3 4 5 6 7 8
1 5090 2421 465 36 0 1 0
2 1691 1654 1324 147 9 0 0
3 0 348 926 450 27 0 0
4 0 0 34 242 72 4 2
5 0 0 0 1 30 11 0
6 0 0 0 0 0 7 0
7 0 0 0 0 0 0 0
Total 6781 4423 2749 876 138 23 2
Total utterances 14992
Total Syllables 43212
Stressed Syllables 14992
Unstressed Syllables 28220
Stressed : Unstressed 1 : 1.9
Sample Rate 16000
Resolution 16
Channel mono
-9-
Stress Detection based on Vowel Classification
SD is based on vowel classification due to the following observations Each word has a stressed
syllable Each syllable is usually
composed of a consonant and a vowel
Vowels are always voiced (have pitch)
Therefore Each vowel is classified
into “unstressed” or “stressed”
To determine stressed syllable in an utteranceMax likelihood of the class
“Stressed”Min likelihood of the class
“Unstressed”Difference of the above two
-10-
Features for vowels
Vowel-based features Pitch: min, mean, max, range, std, slope, etc. Volume: min, mean, max, range, std, slope, etc. Duration (normalized by speech rate) Legendre polynomial fitting for pitch & volume Spectral emphasized version of the above …
-11-
Lexical Stress Detection – Experiment 1
Feature SetE : Root Mean Square EnergyD : DurationP : PitchS : Root Mean Square Spectral Emphasis EnergyPS: Pitch SlopeCE: Legendre Coefficient of Root Mean Square Energy ContourCP: Legendre Coefficient of Pitch ContourCS: Legendre Coefficient of Spectral Emphasis Energy Contour
10-fold Cross ValidationClassifier: SVM
-12-
3 Syllables word
1st 2nd 3rd
1st 96.08% 3.10% 0.83%
2nd 8.28% 86.58% 5.14%
3rd 31.90% 5.75% 62.36%
4 Syllables word
1st 2nd 3rd 4th
1st 96.13% 2.37% 1.51% 0%
2nd 8.91% 87.76% 2.34% 0.98%
3rd 21.62% 2.46% 73.95% 0.97%
4th 38.24% 5.88% 2.94% 52.94%
5 Syllables word
1st 2nd 3rd 4th 5th
1st 100% 0% 0% 0% 0%
2nd 8.16% 88.44% 2.72% 0.68% 0%
3rd 19.33% 1.78% 76.67% 1.78% 0.44%
4th 13.64% 13.22% 2.48% 70.66% 0%
5th 100% 0% 0% 0% 0%
2 Syllables word
1st 2nd
1st 95.13% 4.87%
2nd 25.67% 74.33%
-13-
Lexical Stress Detection – Experiment 2
10-fold Cross ValidationClassifier: SVM
Syllable Number-Independent Classifier vs. Syllable Number-dependent Classifier
Feature Set
Max. Root Mean Square Energy
Mean Root Mean Square Energy
Max. Pitch
Median Pitch
Duration
Max. Spectral Emphasis Root Mean Square Energy
Mean Spectral Emphasis Root Mean Square Energy
Pseudo-Slope of Pitch Contour
Legendre Polynomials Coefficients of Pitch Contour
Legendre Polynomials Coefficients of RMS Energy Contour
Legendre Polynomials Coefficients of Spectral Emphasis RMS Energy
-14-
Lexical Stress Detection – Experiment 3
GMMC: Gaussian Mixture Model ClassifierNBC: Naïve Bayes ClassifierQC: Quadratic ClassifierSVMC: Support Vector Machine Classifier
Feature Set
Max. Root Mean Square Energy
Mean Root Mean Square Energy
Max. Pitch
Median Pitch
Duration
Max. Spectral Emphasis Root Mean Square Energy
Mean Spectral Emphasis Root Mean Square Energy
Pseudo-Slope of Pitch Contour
Legendre Polynomials Coefficients of Pitch Contour
Legendre Polynomials Coefficients of RMS Energy Contour
Legendre Polynomials Coefficients of Spectral Emphasis RMS Energy
10-fold Cross Validation
-15-
Lexical Stress Detection – Error Analysis
-16-
Lexical Stress Detection – Error Analysis
-17-
More on Stress Detection
ASRA Chapter 20 of online
tutorial on Audio Signal Processing
DemoRecognition
• goDemoVc.m in ASR
• Web
Assessment• goDemoSa.m in ASR
• Web
Stress detection Application note Demo