Voice ConversionHung-yi Lee
李宏毅
Voice Conversion (VC)
Voice Conversion
speech
Sometimes T = T’
d
T
speech
d
T’
Vocoder
Used in VC, TTS, Denoise, etc. (not today)
• Rule-based: Griffin-Lim algorithm• Deep Learning: WaveNet
Categories
Parallel Data
Unparallel Data
How are you? How are you?
天氣真好 How are you?
Lack of training data:• Model Pre-training• Synthesized data!
[Huang, et al., arXiv’19]
[Biadsy, et al., INTERSPEECH’19]
• This is “audio style transfer”• Borrowing techniques from image
style transfer
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
speaker information
phonetic information
ContentEncoder
SpeakerEncoder
Feature Disentangle
Do you want to study PhD?
ContentEncoder
Do you want to study PhD?
Decoder
SpeakerEncoder
Do you ……
Do you want to study PhD?
Feature Disentangle
Do you want to study PhD?
Good bye
ContentEncoder
Do you want to study PhD?
Decoder
SpeakerEncoder
Do you ……
Do you want to study PhD?
Feature Disentangle
as close as possible (L1 or L2 distance)
• Pre-training encoders• Adding discriminator• Designing network architecture
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
Pre-training Encoders
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
• One-hot vector for each speaker
• Speaker embedding (i-vector, d-vector, x-vector)
Issue: difficult to consider new speakers
• Speech recognition W AH N P AH N CH …
Pre-training Encoders
ContentEncoder
Decoder
reconstructed input audio
• One-hot vector for each speaker
1
0
Speaker A
AB
Speaker A
Pre-training Encoders
ContentEncoder
Decoder
reconstructed input audio
• One-hot vector for each speaker
0
1
Speaker B
AB
Speaker B
Pre-training Encoders
ContentEncoder
Decoder
reconstructed SpeakerEncoder
input audio
• One-hot vector for each speaker
• Speaker embedding (i-vector, d-vector, x-vector)
Issue: difficult to consider new speakers
• Speech recognition W AH N P AH N CH …
Adversarial Training
How are you?
How are you?
Decoder
How are you?
SpeakerClassifier
orLearn to fool the speaker classifier
(Discriminator)
Speaker classifier and encoder are learned iteratively
ContentEncoder
SpeakerEncoder
Instance Normalization
How are you?
ContentEncoder
= instance normalizationIN
SpeakerEncoder
How are you?
Decoder
IN
How are you?
(remove global information)
Instance Normalization
= instance normalizationIN (remove global information)
Phonetic Encoder
Instance Normalization …
…
……
……
……
IN
……
……
……
……
Normalize for each channel
Each channel has zero mean and unit variance
Phonetic Encoder
Instance Normalization
How are you?
ContentEncoder
= instance normalizationIN
SpeakerEncoder
How are you?
Decoder
IN
How are you?
(remove global information)
How are you?
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
How are you?
Decoder
IN
Ad
aIN
How are you?
= instance normalizationIN
AdaIN = adaptive instance normalization
(remove global information)
(only influence global information)
Output of Speaker Encoder
……
……
……
……
IN
……
……
……
……
𝑧1 𝑧2 𝑧3 𝑧4
Decoder
𝑧1′ 𝑧2
′ 𝑧3′ 𝑧4
′
Add Global
𝑧𝑖′ = 𝛾⨀𝑧𝑖 + 𝛽
Ad
aIN
𝛾
𝛽
AdaIN = adaptive instance normalization
(only influence global information)
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
How are you?IN
Training from VCTK
which speaker?
Speaker
Classifier
With IN Without IN
Acc. 0.375 0.658
Instance Normalization
How are you?
ContentEncoder
SpeakerEncoder
IN
Training from VCTK
Unseen Speaker Utterances
female
male
For more results [Chou, et al., INTERSPEECH 2019]
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
Voice Conversion
• Training without parallel data• Using CycleGAN
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌
Speaker Y
Speaker X
scalar
Input audio belongs to speaker Y?
Become similar to speaker Y
Speaker X
Speaker Y
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌
Speaker Y
Speaker X
scalar
Input audio belongs to speaker Y?
Become similar to speaker Y
Speaker X
Speaker Y
Not what we want!
ignore input
Cycle GAN
𝐺𝑋→𝑌
𝐷𝑌 scalar
Input audio belongs to speaker Y or not
𝐺Y→X
as close as possible (L1 or L2 distance)
Cycle consistency
Speaker Y
identity
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to speaker Y or not
scalar: belongs to speaker X or not
StarGAN
speaker 𝑠1 speaker 𝑠2
speaker 𝑠3 speaker 𝑠4
𝐺
speaker 𝑠𝑖
audio of speaker x
audio of speaker 𝑠𝑖
StarGAN
𝐷scalar: belongs to input speaker or not
𝐺
speaker 𝑠𝑗
audio of speaker 𝑠𝑖
audio of speaker 𝑠𝑗
speaker 𝑠𝑖
Each speaker is represented as a vector.
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐷𝑌scalar: belongs to speaker Y or not
𝐺
as close as possible
𝐷scalar: belongs to input speaker or not
speaker 𝑠𝑖
audio of speaker 𝑠𝑘
𝐺
speaker 𝑠𝑘
Categories
Parallel Data
Unparallel Data
Direct Transformation
Feature Disentangle
Reference
• [Huang, et al., arXiv’19] Wen-Chin Huang,Tomoki Hayashi,Yi-Chiao Wu,HirokazuKameoka,Tomoki Toda, Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019
• [Biadsy, et al., INTERSPEECH’19] Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia, Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation, INTERSPEECH, 2019