voice conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/dlhlp20/voice... ·...
TRANSCRIPT
![Page 1: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/1.jpg)
Voice ConversionHung-yi Lee
李宏毅
![Page 2: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/2.jpg)
Voice Conversion (VC)
Voice Conversion
speech
Sometimes T = T’
d
T
speech
d
T’
Vocoder
Used in VC, TTS, Denoise, etc. (not today)
• Rule-based: Griffin-Lim algorithm• Deep Learning: WaveNet
![Page 3: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/3.jpg)
Categories
Parallel Data
Unparallel Data
How are you? How are you?
天氣真好 How are you?
Lack of training data:• Model Pre-training• Synthesized data!
[Huang, et al., arXiv’19]
[Biadsy, et al., INTERSPEECH’19]
• This is “audio style transfer”• Borrowing techniques from image
style transfer
![Page 4: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/4.jpg)
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
speaker information
phonetic information
ContentEncoder
SpeakerEncoder
![Page 5: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/5.jpg)
Feature Disentangle
Do you want to study PhD?
ContentEncoder
Do you want to study PhD?
Decoder
SpeakerEncoder
Do you ……
Do you want to study PhD?
![Page 6: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/6.jpg)
Feature Disentangle
Do you want to study PhD?
Good bye
ContentEncoder
Do you want to study PhD?
Decoder
SpeakerEncoder
Do you ……
Do you want to study PhD?
![Page 7: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/7.jpg)
Feature Disentangle
as close as possible (L1 or L2 distance)
• Pre-training encoders• Adding discriminator• Designing network architecture
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
![Page 8: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/8.jpg)
Pre-training Encoders
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
• One-hot vector for each speaker
• Speaker embedding (i-vector, d-vector, x-vector)
Issue: difficult to consider new speakers
• Speech recognition W AH N P AH N CH …
![Page 9: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/9.jpg)
Pre-training Encoders
ContentEncoder
Decoder
reconstructed input audio
• One-hot vector for each speaker
1
0
Speaker A
AB
Speaker A
![Page 10: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/10.jpg)
Pre-training Encoders
ContentEncoder
Decoder
reconstructed input audio
• One-hot vector for each speaker
0
1
Speaker B
AB
Speaker B
![Page 11: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/11.jpg)
Pre-training Encoders
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
• One-hot vector for each speaker
• Speaker embedding (i-vector, d-vector, x-vector)
Issue: difficult to consider new speakers
• Speech recognition W AH N P AH N CH …
![Page 12: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/12.jpg)
Adversarial Training
How are you?
How are you?
Decoder
How are you?
SpeakerClassifier
orLearn to fool the speaker classifier
(Discriminator)
Speaker classifier and encoder are learned iteratively
ContentEncoder
SpeakerEncoder
![Page 13: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/13.jpg)
Instance Normalization
How are you?
ContentEncoder
= instance normalizationIN
SpeakerEncoder
How are you?
Decoder
IN
How are you?
(remove global information)
![Page 14: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/14.jpg)
Instance Normalization
= instance normalizationIN (remove global information)
Phonetic Encoder
![Page 15: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/15.jpg)
Instance Normalization …
…
……
……
……
IN
……
……
……
……
Normalize for each channel
Each channel has zero mean and unit variance
Phonetic Encoder
![Page 16: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/16.jpg)
Instance Normalization
How are you?
ContentEncoder
= instance normalizationIN
SpeakerEncoder
How are you?
Decoder
IN
How are you?
(remove global information)
How are you?
![Page 17: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/17.jpg)
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
How are you?
Decoder
IN
Ad
aIN
How are you?
= instance normalizationIN
AdaIN = adaptive instance normalization
(remove global information)
(only influence global information)
![Page 18: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/18.jpg)
Output of Speaker Encoder
……
……
……
……
IN
……
……
……
……
𝑧1 𝑧2 𝑧3 𝑧4
Decoder
𝑧1′ 𝑧2
′ 𝑧3′ 𝑧4
′
Add Global
𝑧𝑖′ = 𝛾⨀𝑧𝑖 + 𝛽
Ad
aIN
𝛾
𝛽
AdaIN = adaptive instance normalization
(only influence global information)
![Page 19: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/19.jpg)
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
How are you?IN
Training from VCTK
which speaker?
Speaker
Classifier
With IN Without IN
Acc. 0.375 0.658
![Page 20: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/20.jpg)
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
IN
Training from VCTK
Unseen Speaker Utterances
female
male
For more results [Chou, et al., INTERSPEECH 2019]
![Page 21: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/21.jpg)
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
Voice Conversion
• Training without parallel data• Using CycleGAN
![Page 22: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/22.jpg)
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌
Speaker Y
Speaker X
scalar
Input audio belongs to speaker Y?
Become similar to speaker Y
Speaker X
Speaker Y
![Page 23: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/23.jpg)
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌
Speaker Y
Speaker X
scalar
Input audio belongs to speaker Y?
Become similar to speaker Y
Speaker X
Speaker Y
Not what we want!
ignore input
![Page 24: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/24.jpg)
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌 scalar
Input audio belongs to speaker Y or not
𝐺Y→X
as close as possible (L1 or L2 distance)
Cycle consistency
Speaker Y
identity
![Page 25: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/25.jpg)
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to speaker Y or not
scalar: belongs to speaker X or not
![Page 26: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/26.jpg)
StarGAN
speaker 𝑠1 speaker 𝑠2
speaker 𝑠3 speaker 𝑠4
𝐺
speaker 𝑠𝑖
audio of speaker x
audio of speaker 𝑠𝑖
![Page 27: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/27.jpg)
StarGAN
𝐷scalar: belongs to input speaker or not
𝐺
speaker 𝑠𝑗
audio of speaker 𝑠𝑖
audio of speaker 𝑠𝑗
speaker 𝑠𝑖
Each speaker is represented as a vector.
![Page 28: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/28.jpg)
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐷𝑌scalar: belongs to speaker Y or not
𝐺
as close as possible
𝐷scalar: belongs to input speaker or not
speaker 𝑠𝑖
audio of speaker 𝑠𝑘
𝐺
speaker 𝑠𝑘
![Page 29: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/29.jpg)
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
![Page 30: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,](https://reader035.vdocuments.net/reader035/viewer/2022081622/6134b69bdfd10f4dd73be80b/html5/thumbnails/30.jpg)
Reference
• [Huang, et al., arXiv’19] Wen-Chin Huang,Tomoki Hayashi,Yi-Chiao Wu,HirokazuKameoka,Tomoki Toda, Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019
• [Biadsy, et al., INTERSPEECH’19] Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia, Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation, INTERSPEECH, 2019