音声合成理論と音声合成システム...音声合成理論と音声合成システム...

音声合成理論と音声合成システム Fundamentals of speech synthesis

東京大学情報理工学系研究科助教

高道慎之介 (Shinnosuke Takamichi)

奈良先端大音情報処理論第6回 (2019/11/15)

/53

本講義の目的 Purpose of this talk

復習 (review of previous talk)

– 音声の生成過程・特徴分析 (speech production & speech analysis)

音声合成 (speech synthesis)

– テキスト音声合成 (text-to-speech)

– 音声変換 (voice conversion)

音声合成に関する近年の発展 (recent revelopment)

– WaveNet, Tacotron, GAN, moment-matchingなど

2

音声をどうやって人工的に作り出すか How to artificially synthesize speech

/53

レポートについて Report

3

Python programming on Google Colab

Submit your codes and results to the submission page. (I will announce the details after this talk.)

復習 (音声の生成過程・特徴分析) REVIEW OF PREVIOUS TALK (SPEECH PRODUCTION & ANALYSIS)

4

/53

Voice

畳み込むと…

Time

音声の生成過程 Speech production

5

声帯を開閉させて，空気を振動させる！ Excite air-flowing from the lungs.

音高の生成

音色の付与口や舌を動かして，音色をつける！ Filter the signal with time- varying vocal tract shapes.

Convolution

/53

音声のスペクトル構造 Structures of the spectrum of voice

6

Frequency

Frequency

Pow

er

基本周波数 (F0) Fundamental frequency

Frequency

Pow

er

音声の周波数特性 Freq. characteristics of voice

微細構造

包絡パ

ワー

Envelope

Detailed structure

/53

音源生成と，音響管としての声道 Source excitation and vocal tract as acoustic tubes

7

音源信号はインパルス列 or 白色雑音，声道は音響管連接

声帯側口唇側

声道

有声音

(パルス間隔がF0の逆数)

* http://ml.cs.yamanashi.ac.jp/media/20151114/1114slide.pptx から一部引用

無声音

音響管の形を変えて，声色を制御音源信号で，音高を制御

Periodic voiced excitation

Aperiodic unvoiced excitation

Vocal tract

Vocal chord Lip

Control pitch by the excitation signals. Control tone by the shapes of the vocal tract.

http://ml.cs.yamanashi.ac.jp/media/20151114/1114slide.pptx


















/53

スペクトログラムの例 (濃いほどパワー大 ) Example of a spectrogram (darker point indicates bigger power)

8 Time

Fre

qu

en

cy

声道の共振 (フォルマント) Vocal track resonance (formant)

基本周波数の影響 Effects by F0

音声合成 SPEECH SPEECH

9

/53

音声合成：音声を人工的に作り出す技術 Speech synthesis: a method to artificially synthesize voices

狭義の音声合成 (speech synthesis in a narrow sense)

– テキスト音声合成 (Text-To-Speech: TTS)

広義の音声合成 (speech synthesis in a wide sense)

– テキスト音声合成

– 音声変換 (Voice Conversion: VC)

– 概念音声合成 (concept-to-speech: CTS)

• 概念 → 言語生成 → 音声合成

– 調音・音響間マッピング (articulatory-to-speech)

• 調音機構特性と音声の変換

– マルチモーダル音声合成 (multi-modal)

• 動画像などを含む音声合成

10

/53

テキスト音声合成と音声変換 Text-to-speech & voice conversion

テキスト音声合成 (Text-To-Speech: TTS)

– テキストなどから音声を合成

– コンピュータとのコミュニケーションのため

音声変換 (Voice Conversion: VC)

– 言語情報を保持したままパラ言語・非言語情報を変換

– 人の発声制約を超えたコミュニケーションのため

11

Text TTS

VC

Speech synthesis from text

For human-computer interaction

Convert para- and non-linguistic information while preserving linguistic information.

For human communication

/53

音声の持つ情報 Speech information

12

言語情報 Linguistic info.

パラ言語情報 Para-linguistic info.

非言語情報 Non-linguistic info.

狭義の音声認識 (speech-to-text)

話者認識など (speaker recognition)

感情認識など (emotion recognition)

テキスト化できる情報

話し手が意図的に付与する，テキスト化できない情報 (例：感情)

話し手の意図とは無関係に付与される，テキスト化できない情報（例：話者性）

Phonetic properties

Non-phonetic properties that the speaker consciously expressed (e.g., emotion)

Non-phonetic properties that the speaker unconsciously expressed (e.g., speaker individuality)

/53

音声変換は何の情報を保持・変換する？ What information does VC preserve and convert?

例：話者変換 (speaker conversion)

例：感情変換 (emotion conversion)

例：音韻変換 (pronunciation conversion)

13

言語

パラ言語

非言語

言語

パラ言語

非言語

言語

パラ言語

非言語

言語

パラ言語

非言語

言語

パラ言語

非言語

言語

パラ言語

非言語

/a/ /i/

/53

テキスト音声合成は何の情報を保持・変換する? What information does TTS preserve and convert?

例：究極の音声翻訳 (ultimate speech-to-speech translation)

14

言語

パラ言語

非言語

言語

パラ言語

非言語

翻訳 translation

音声認識など

感情認識など

話者認識など

テキスト翻訳 (text translation)

音声合成

デモ DEMO

15

/53

多言語音声合成 Multi-lingual text-to-speech synthesis

16

Telugu

Tamil

Marathi

Malayalam

Japanese

Bengali

Hindi

Conventional Ours

[Takamichi15, Kobayashi16]

/53

多方言音声合成 Multi-Japanese-dialect text-to-speech

17

[Akiyama18]

Dialect text

Multi-dialect speech

synthesis

Dialect speech

Miyazaki-ben

/53

日本人英語音声合成 English-read-by-Japanese text-to-speech synthesis

18

[Oshima16]

Conventional

Ours

Japanese-accented English uttered by a Japanese undergraduate

“I can see that knife now.”

Text Text-to-speech

For voice building

Make the voice fluent.

/53

高品質音声変換 High-quality voice conversion

19 http://voicetext.jp/voiceactor/

SAYAKA HIKARI

Conversion

(Conven-

tional)

http://voicetext.jp/voiceactor/






音声合成の手順 PROCEDURES OF SPEECH SYNTHESIS

20

/53

音声合成の歴史 History of speech synthesis

1939: Voder (Bell lab: ベル研究所)

– その前身はvocoder (voice + coder)

1961: 音声合成による ‘Daisy Bell’ (ベル研究所)

– ～

~1990: フォルマント音声合成

– 専門家による音声規則設計

1990~: 波形接続型音声合成

– ダイフォン音声合成，単位選択型音声合成

1995~: 統計的パラメトリック音声合成

– HMM音声合成・DNN音声合成

– 1998~: GMM音声変換・DNN音声変換

21

事前収録音声を用いるコーパスベース合成方式

Unit selection synthesis

Statistical parametric speech synthesis

Formant synthesis

Corpus-based synthesis using pre-recorded voices

/53

サンプルベース方式 (波形接続型) Unit selection synthesis

22

音声データベースにある音声セグメント

選択された音声セグメント系列

入力テキストから予測された音声特徴量系列

𝑢𝑛+1 𝑢𝑛 𝑢𝑛−1

ターゲットコスト: 𝐶t(us)

𝑡𝑛, 𝑢𝑛 接続コスト: 𝐶c(us)

𝑢𝑛−1, 𝑢𝑛

𝑡𝑛−1 𝑡𝑛 𝑡𝑛+1

接続コストとターゲットコストの和を最小化するようにセグメントを選択

Speech segments of the speech database

Sequence of selected segments

Concatenation cost Target cost

Speech parameter sequence predicted from the input text

Select segments by minimizing a sum of the concatenation cost and target cost.

/53

統計ベース方式 Statistical parametric speech synthesis

23

音声データベースから構築した統計モデル

テキスト情報をもとに選択された統計モデル

生成された音声パラメータ系列

音声パラメータをモデル化した統計モデルから，音声パラメータを生成 Generate speech parameters from statistical models trained using speech parameters.

Statistical models trained using the speech database

Selected statistical models

Generated speech parameters

/53

統計ベース方式の手順 Procedures of statistical parametric speech synthesis

24

入力音声から音声パラメータを抽出

音声パラメータ (ケプストラム・F0) から音声を合成

入出力特徴量を対応付け

Text テキスト解析 Text analysis

音声分析 Speech analysis

音声パラメータ

生成 Speech parameter

generation

波形生成 Waveform synthesis

音響モデリング

Acoustic modeling

Input Output

入力テキストから音声に関係する特徴量を抽出

Extraction of speech-related features from input text

Waveform synthesis using generated speech parameters

Extraction of speech-related features from input speech

Modeling relationship between input and output features

/53

音声合成のためのテキスト解析 Text-analysis for text-to-speech

テキストを読み上げたい！

どうやって読んだらいいの？

– テキストと音声を結びつける構成要素がいくつかある

– ①発音・音節 (pronunciation & syllable)

– ②アクセント・ストレス (accent & stress)

– ③リズム・等時性 (rhythm & isochrony)

25

How do we read the text?

/53

①発音・音節 Pronunciation & syllable

発音 (pronunciation)

– 発声の最小単位である音素 (phoneme) の違い

音節 (syllable)

– 音節 … 言語依存の発声単位 (日本語ならほぼひらがな一つに対応)

• 開音節 (open syl.) … 母音で終わる音節. 日本語の/か(k a)/など

• 閉音節 (closed syl.) … 子音で終わる音節. 英語の/it (i t)/など

– 子音連結 (consonant cluster) … 同一音節中で連続する子音

• 日本語 (jp) … ほぼCV (C: consonant, 子音、V: vowel, 母音)

• 英語 (en) … CCCV、CCV、VCC、VCCCなどが頻出

– straight = stra + ight

26

Phoneme: a smallest unit of pronunciation

Syllable: a language-dependent single unit of speech.

Syllable that a vowel is at the end

Syllable that a consonant is at the end

Group of consonants with no vowels

/53

②アクセント・ストレス Accent & stress

音声のアクセント・ストレス

– 言語に依存してスペクトルとF0に現れる

例: 日本語 (アクセント)

例: 中国語 (アクセント: 四声)

例: 英語 (ストレス)

27

Low F0

High F0

I went to the library to study for the exam.

Stress

わたしはとしょかんへいきました。

我去图书馆 F0 changes

Japanese: accent-based

Chinese: four tones

English: stress-based

Language-dependent changes in spectrum or F0

/53

③リズム・等時性 Rhythm & isochrony

音声の等時性 (isochrony)

– 言語に依存した音声的単位が、時間的に等間隔に現れる

例1: 日本語 (モーラ等時性)

例2: 中国語 (シラブル等時性)

例3: 英語 (ストレス等時性)

28

わたしはとしょかんへいきました。

I went to the library to study for the exam.

各点は一定時間周期で現れる

我去图书馆

Japanese: mora-timed isochrony

Chinese: syllable-timed

English: stress-timed

The points appears in the equal temporal interval.

postulated rhythmic division of time into equal portions by a language

/53

アクセントは誰が決めてる?：アクセント辞典 Who decided the accent rule?: NHK accent dictionary

2016年に改定 (renewal in 2016)

– 18年ぶり6回目。初版は1943年

29

/53

前回から何が変わった？ What is changed from the previous version?

” ついに「ク＼マ」が出た！”

– ”クマが出た” のアクセントは？

– 外来語は平板化

– 複合語 (歩み＋寄るなど) は平板から起伏化

– などなど

30

[太田他, 2016.]

/53

時系列の対応付け (text-to-speech) text-speech alignment in text-to-speech

通常，テキスト特徴量系列と音声特徴量系列の長さは異なる

– (音声認識などによる) アライメントを実施して揃える

31

あらゆる・・・

Accent phrase

a r a y u r u Phoneme

Low High

Cepstrum, F0

Text

…

Speech

あらゆる

e.g., phoneme-speech alignment using results of speech recognition

/53

時系列の対応付け (voice conversion) speech-speech alignment in voice conversion

(例えば異なる話者による) 音声も，系列長が異なる

– 動的時間伸縮 (DTW) などにより揃える

32

Cepstrum, F0

Speech あらゆる

Cepstrum, F0

Speech

あらゆる

…

e.g., speech-speech alignment using dynamic time warping (DTW)

/53

Text-to-speechでのDNNの利用 DNN-based modeling in text-to-speech

33

テキスト特徴量音声特徴量

t=1

t=2

t=T

当該音素 Phoneme

アクセント Accent

モーラ位置 Mora position

時間位置 Temporal position

などなど

a i

u …

1 2 3

…

0

1

0

1 0

スペクトル (声色) spectrum

F0 (音高)

有声・無声 Voiced/unvoiced label

Text

DNN

DNNは自然音声特徴量との二乗誤差を最小化するように学習 The DNN is trained by minimizing difference between natural and generated speech params.

/53

Voice conversionでのDNNの利用 DNN-based modeling in voice conversion

34

音声特徴量音声特徴量

t=1

t=2

t=T

DNN

F0 (音高)


F0 (音高)




/53

Text-to-speechにおける生成手順 Synthesis procedure in text-to-speech

35

あらゆる・・・

Accent phrase

a r a y u r u Phoneme

Low High

Cepstrum, F0

Text

Speech

あらゆる

… … Predict speech params.

Predict duration

+ duration info.

Duration model (継続長モデル)を別に用意して，継続長を予測

Duration models (e.g., DNN) is used for predicting duration

/53

Voice conversionにおける生成手順 Conversion process in voice conversion

36

Cepstrum, F0

Speech あらゆる

Cepstrum, F0

Speech

あらゆる

… 話速変換モデルが入る場合もある In some studies, the speech rate conversion model is used.

/53

色々な発展技術 Further improvements

音声パラメータの時間変化を考慮したい (temporal structure)

– 動的特徴量(temporal-delta ) … Δ𝑦𝑡 =1

2𝑦𝑡+1 −

1

2𝑦𝑡−1 (t: time)

– リカレント構造DNN (RNN, LSTM)

DNN以外の方法

– HMM: hidden Markov Model (隠れマルコフモデル)

– GMM: Gaussian mixture model (混合正規分布モデル)

– GPR: Gaussian process regression (ガウス回帰過程)

– NMF: Nonnegative matrix factorization (非負値行列因子分解)

– Hybrid with unit selection サンプルベースとのハイブリッド

37

音声合成の最近の研究 RECENT STUDY IN SPEECH SYNTHESIS

38

/53

WaveNet

2016年，波形を直接生成するニューラルネットが提案された

– 音声特徴量生成，波形生成，フレーム分析を取り除き，サンプルごとに波形を生成する単一のニューラルネットワーク

39

あらゆる

a r a y u r u

…

あらゆる

a r a y u r u

WaveNet

Conventional WaveNet

/53

WaveNet音声合成 WaveNet-based TTS

40

[Oord et al., 2016.]

テキスト特徴量

unpooling

・波形の離散化 … float型信号を256 (65536) 段階に圧縮して処理

・自己回帰モデル … 一つ前の結果から現在の結果が分かる.LPCと類似

・Dilated convolution … ユニットを飛ばした畳み込みで受容野を大きく

/53

Tacotron: Sequence-to-sequence 変換 Tacotron: sequence-to-sequence character-to-waveform conversion

2017年，言語処理部と継続長生成を置き換えるものが登場

– 一番のキーは，機械翻訳で提案されたattention modelの利用

– Attention model … 可変長変換を実現するモデルの一つ (詳細は省略)

41

あらゆる

a r a y u r u

…

Conventional Tacotron

あらゆる

Tacotron

/53

Tacotron

42

[Wang et al., 2017.]

/53

Anti-spoofingに敵対する音声合成 Speech synthesis that deceives anti-spoofing

課題

– 自然音声と合成音声の違いを，自動的に見つけてくれないか？

Anti-spoofing

– 音声合成による「声のなりすまし」を防ぐ技術

Anti-spoofingに敵対する学習

– Anti-spoofingを騙すように音声合成器を更新

– 音声合成器の攻撃を防ぐように anti-spoofing を更新

43 Update speech synthesis Update anti-spoofing

[Saito et al., 2017.]

A discriminator to distinguish natural and synthetic speech

The acoustic model is updated by deceiving anti-spoofing.

The discriminator is updated by detecting attacks by speech synthesis

/53

DNN音声合成のための敵対学習 DNN-based TTS incorporating generative adversarial networks

44

⋯ ⋯ ⋯ ⋯

⋯ ⋯ ⋯ ⋯

⋯

⋯

Linguistic feats.

Parameter generation

𝐿G 𝒚, 𝒚

𝐿D,1 𝒚 Feature function 1: natural

⋯

𝒚 𝒚

Generated speech params.

Natural speech

params.

𝐿 𝒚, 𝒚 = 𝐿G 𝒚, 𝒚 + 𝜔D𝐿D,1 𝒚 → minimize

Adversarial loss Generation loss

Text-to-speech

[Saito et al., 2017.]

Anti-spoofing

/53

Generative Adversarial Network (GAN)

Generative adversarial network

– 分布間の近似 Jensen-Shannon divergence を最小化

– 合成器と，自然／合成音声を識別するanti-spoofingを敵対

45

𝒚

1: natural

0: synthesized

[Goodfellow et al., 2014.]

⋯ ⋯ ⋯ ⋯

Text +noise

Anti-spoofing

Text-to-speech

GAN minimizes approx. Jensen-Shannon divergence.

A generator for fooling and a discriminator for distinguising

Natural

/53

別の生成モデル：moment-matching network Yet another generative model: moment-matching network

Moment matching network [Li et al., 2015.]

– 分布のモーメント (平均，分散，…) 間の二乗距離を最小化

– 実装上は，グラム行列のノルムの差を最小化

46

𝒚

Natural speech ⋯ ⋯ ⋯ ⋯

Text +noise

Text-to-speech

[Li et al., 2015.]

A MMN minimizes mean squared distance of moments (mean, variance, …).

In the actual implementation, the MMN minimizes norm difference of the Gram matrices.

/53

多言語・多話者モデリング Multi-lingual & multi-speaker modeling

複数言語・複数話者の音声合成を一気に作りたい！

– DNNを分割して，各言語・各話者をモデル化する専用の層を作る

47 http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7472737

[Fan et al., 2016.]

Separate DNNs into language-specific and speaker-specific layers.

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7472737

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7472737

/53

主成分分析などによる教師なし特徴量抽出 Unsupervised feature extraction using PCA etc.

48

/ko n ni chi wa/

主成分：話者情報 (dominant factor: speaker)

主成分：音韻・韻律情報 (dominant factor: phoneme, prosody)

音声特徴量の潜在変数とは？

– 音韻・韻律情報 (何を発話しているか)：

• 同じ文を話していれば，話者間で共通の因子

– 話者情報 (誰が発話しているか)：

• 発話内容によらず同一話者内では共通の因子

/53

その他参考資料 Other materials (in Japanese)

東大大学院講義資料

– 高道 “信号処理論第二音声合成・変換1,2” (2016)

• 多様な音声合成法やその応用について，数学的説明も込みで解説

• http://www.sp.ipc.i.u-tokyo.ac.jp/~saruwatari/SP-Grad2016_05.pdf

• http://www.sp.ipc.i.u-tokyo.ac.jp/~saruwatari/SP-Grad2016_08.pdf

フラグシップICASSP2017 読み会資料

– 最先端の音声合成認識法 (高道は音声認識を担当) – https://connpass.com/event/58327/

日本音響学会解説記事

– 橋本他，”深層学習に基づく統計的音声合成” – https://www.jstage.jst.go.jp/article/jasj/73/1/73_55/_pdf

49

http://www.sp.ipc.i.u-tokyo.ac.jp/~saruwatari/SP-Grad2016_05.pdf














https://connpass.com/event/58327/

https://connpass.com/event/58327/

https://www.jstage.jst.go.jp/article/jasj/73/1/73_55/_pdf

https://www.jstage.jst.go.jp/article/jasj/73/1/73_55/_pdf

応用研究 APPLICATIONS

50

/53

近年の応用研究 (紹介だけ) Recent application (short introduction)

発話・聴覚障害を補助する音声合成 (for speaking/hearing aids)

話せない／聞こえない音声を話せる／聞こえるように

音声翻訳・言語教育を支援する音声合成 (for translation & learning)

言語の壁を越えて声を伝える

希少言語の音声合成 (for saving rare languages)

音声言語文化の定量化

あらゆる声からあらゆる声を合成 (from arbitrary to arbitrary)

音環境への頑健性など

聞き手を考慮した音声合成 (for not only speakers but also listeners)

ただ読みあげる，一方的に話かける音声合成を超える

51

ただ喋るだけの音声合成の時代は既に終わっている．これからは，人にどう寄り添うか，人をどう拡張するか，

何に声の芸術性を見出すか・それをどう再現するかが重要になる． A role of speech synthesis is not only to speak.

We need to study how to support and augment humans’ behaviors, and how to define and reproduce artistic quality of voices.

まとめ CONCLUSION

52

/53

本講義のまとめ Conclusion

音声合成

– テキスト音声合成 … テキスト特徴量 → 音声特徴量

– 音声変換 … 音声特徴量 → 音声特徴量

最近の研究

– WaveNet … テキスト特徴量から波形生成

– Tacotron … attention modelの利用

– GAN … 2つの分布間の距離の最小化

– Moment matching … 2つの分布間のモーメント差の最小化

53

音声をどうやって人工的に作り出すか How to artificially synthesize speech

音声合成理論と音声合成システム...音声合成理論と音声合成システム...

Documents