hmm-based speech synthesis erica cooper cs4706 spring 2011
TRANSCRIPT
![Page 1: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/1.jpg)
HMM-Based Speech Synthesis
Erica CooperCS4706
Spring 2011
![Page 2: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/2.jpg)
Concatenative Synthesis
![Page 3: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/3.jpg)
HMM Synthesis
A parametric model Can train on mixed data from many
speakers Model takes up a very small amount of
space Speaker adaptation
![Page 4: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/4.jpg)
HMMs
Some hidden process has generated some visible observation.
![Page 5: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/5.jpg)
HMMs
Some hidden process has generated some visible observation.
![Page 6: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/6.jpg)
HMMs
Hidden states have transition probabilities and emission probabilities.
![Page 7: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/7.jpg)
HMM Synthesis
Every phoneme+context is represented by an HMM.
The cat is on the mat.The cat is near the door.
< phone=/th/, next_phone=/ax/, word='the', next_word='cat', num_syllables=6, .... >
Acoustic features extracted: f0, spectrum, duration
Train HMM with these examples.
![Page 8: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/8.jpg)
HMM Synthesis
Each state outputs acoustic features (a spectrum, an f0, and duration)
![Page 9: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/9.jpg)
HMM Synthesis
Each state outputs acoustic features (a spectrum, an f0, and duration)
![Page 10: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/10.jpg)
HMM Synthesis
Many contextual features = data sparsity
Cluster similar-sounding phones e.g: 'bog' and 'dog'
the /aa/ in both have similar acoustic features, even though their context is a bit different
Make one HMM that produces both, and was trained on examples of both.
![Page 11: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/11.jpg)
Experiments: Google, Summer 2010
Can we train on lots of mixed data? (~1 utterance per speaker)
More data vs. better data
15k utterances from Google Voice Search as training data
ace hardware rural supply
![Page 12: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/12.jpg)
More Data vs. Better Data
Voice Search utterances filtered by speech recognition confidence scores
50%, 6849 utterances 75%, 4887 utterances 90%, 3100 utterances 95%, 2010 utterances 99%, 200 utterances
![Page 13: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/13.jpg)
Future Work
Speaker adaptation Phonetically-balanced training data Listening experiments Parallelization Other sources of data Voices for more languages
![Page 14: HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011](https://reader036.vdocuments.net/reader036/viewer/2022062423/5697c01e1a28abf838cd0fe6/html5/thumbnails/14.jpg)
Reference
http://hts.sp.nitech.ac.jp