trends in large-scale data analysis - linguistics · problems: 1. static sound-to-tube inversion is...

Trends in Large-Scale Data Analysis

Mark LibermanUniversity of Pennsylvania

ABSTRACT:

For hundreds of years, scientists and engineers have solved problems of prediction, classification and optimization using physical and statistical models. Computing technology has brought an exponential explosion of data collection and storage, and corresponding changes in modeling and analysis methods. Some of these changes are just easier and faster versions of established techniques, but others represent the development of a fundamentally new concept of information processing, under evocative but vague headings like "Neural Networks", "Deep Learning" and "Artificial Intelligence". This talk will sketch the nature, promise and problems of these developments.

5/3/2019 PREM Symposium 2019 2

What I do: Science and technology of language and speech

What I don’t do: Science and technology of materials(Innovative or otherwise)

So why am I here at the PREM 2019 Symposiumon “Data Science for Innovative Materials”?

1. Jorge invited me2. Some interesting aspects of “Data Science” are shared

across applications areas

(I think…)


http://prem.uprh.edu/eventos/symposium/symposium2019/

Some (shared?) history:

Evolution (or oscillation?)from physical & mathematical modeling

… to statistical modeling

… to “deep learning”

… to ???


Thoughts → Words → Vocal Gestures → Sounds → Words → Thoughts

Thoughts → Words → Vocal Gestures → Sounds → Words → Thoughts

Vocal Gestures → Vocal Tract States = Acoustic Transfer Functions

Voice Source + Acoustic Transfer Function → Sounds

1940-1975: Focusing on the physics of speech communication


Mid-20th-century history of my field --

Chiba, T and Kajiyama, M. “The Vowel: Its Nature and Structure”, Tokyo-Kaiseikan Pub. Co., Ltd., Tokyo (1941)


“Formant” Model (Chiba & Kajiyama 1941, Fant 1951, …) –

Assumptions:1. Source-filter independence

(Source = larynx; Filter = supra-laryngeal vocal tract)2. Vocal tract acoustics = plane waves propagating along the axis of a radially-symmetrical tube

(closed at the larynx, open at the lips)

Results:1. Filter transfer function is the sum of a set of complex resonances = “formants”

(caused by standing waves in the hypothesized tube)2. Only 3 of these resonances are materially affected by (smooth) changes in the tube diameter

(and the effects of higher resonances can therefore be approximated by a single term)


Dunn, Hugh K. "The calculation of vowel resonances, and an electrical vocal tract." The Journal of the Acoustical Society of America 22, no. 6 (1950): 740-753:

By treating the vocal tract as a series of cylindrical sections, or acoustic lines, it is possible to use transmission line theory in finding the resonances. With constants uniformly distributed along each section, resonances appear as modes of vibration of the tract taken as a whole. […] An electrical circuit based on the transmission line analogy has been made to produce acceptable vowel sounds. This circuit is useful in confirming the general theory and in research on the phonetic effects of articulator movements. The possibility of using such a circuit as a phonetic standard for vowel sounds is discussed.


Gunnar Fant. Transmission Properties of the Vocal Tract with Application to the Acoustic Specifications of Phonemes. Acoustics Laboratory, Massachusetts Institute of Technology, 1951. Gunnar Fant. Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. No. 2. Walter de Gruyter, 1960.


Flanagan, James L., Kenzo Ishizaka, and Kathy L. Shipley. "Synthesis of speech from a dynamic model of the vocal cords and vocal tract." Bell System Technical Journal 54, no. 3 (1975): 485-506.

We describe a computer model of the human vocal cords and vocal tract that is amenable to dynamic control by parameters directly identified in the human physiology. The control format consequently provides an efficient, parsimonious description of speech information. The control parameters represent subglottal lung pressure, vocal-cord tension and rest opening, vocal-tract shape, and nasal coupling. Using these inputs, we synthesize vowel-consonant-vowel syllables to demonstrate the dynamic behavior of the cord/tract model. We show that inherent properties of the model duplicate phenomena observed in human speech; in particular, cord/tract acoustic interaction, cord vibration, and tract-wall radiation during occlusion, and voicing onset-offset behavior. Finally, we describe an approach to deriving the physiological controls automatically from printed text, and we present sentence-length synthesis obtained from a preliminary system.


Explorations of more complete dynamic physical models --


Problems:1. Static sound-to-tube inversion is an underdetermined problem

(even for longitudinal plane waves in radially-symmetrical tubes without wall losses)2. Solutions for rapidly-changing articulatory kinematics are harder3. Dynamic models (robot talkers) are even harder

(and we have little idea how the physics and physiology really work)

Results:1. Despite widely-held belief in the crucial role of dynamic articulatory models

(and many attempts to use them in speech technology)there have never been any engineering applications.

2. Engineers’ interest in such models faded after 1980 or so.

The formant model has had a less negative history, because:1. The model fits the data (sort of, sometimes)2. It yields a big reduction in dimensionality --

Formants are 3 slowly-varying inexact numbers, ~100*3 per secondDigital audio for speech must be sampled at least 8000 times per second

3. The 3 formant dimensions “make sense” phonetically


But there are still many problems with this approach:1. The underlying physical model leaves out many details• The vocal tract is not radially symmetrical• There are source-tract interactions• The nasal cavity creates additional poles and zeros• Near closures create zeros• There are subglottal resonances during the open phase of glottal oscillation

2. The underlying physical model may have more serious (non-linear) problems• There are apparently complex spatially-separated swirling flows

(aerodynamic rather than acoustic phenomena)• The expected longitudinal standing waves seem to be absent

3. Estimation of formant parameters from sound is catastrophically unstable(even in synthetic data, where by construction the model fits perfectly,

arbitrarily small differences in input yield large differences in parameter estimation)

4. Similar problems exist for excitation parameters (e.g.“pitch” and “voice quality”)5/3/2019 PREM Symposium 2019 13

Teager, H. M., and S. M. Teager. "Evidence for nonlinear sound production mechanisms in the vocal tract." In Speech production and speech modelling, pp. 241-261. Springer, Dordrecht, 1990.

Much of what speech scientists believe about the mechanisms of speech production and hearing rests less on an experimental base than on a centuries-old faith in linear mathematics. Based on experimental evidence we believe that the momentum waves, or the interactions of the inertia-laden flows leading to various modes of oscillation, within the vocal tract are neither passive nor acoustic. Measurements of flow within the vocal tract indicate that acoustic impedance, or the pressure-flow ratio, is violated. The pressure across any cross section of the tract is constant and does not exhibit the differentials expected from the markedly different separated flows across that same cross section.


“Consider a perfectly spherical cow, radiating milk isotropically…”


So after the mid-1970s:

Engineers mostly abandoned human-defined physical models (time functions of vocal tract parameters or formants) in favor of general spectral parameters

for which small input differences → small output differences

They chose overall statistical models that are simple enoughfor their (many) parameters to be learned from data,

and speech “perception” or “production” can be treated as global optimization of probabilities.


e.g. MFCCs = “Mel Frequency Cepstrum Coefficients”

Mel = nonlinear warping of frequency scaleModeled on human auditory psychophysicsApproximates information density of speech


“Cepstrum” – a signal processing pun

= cosine transform of the log amplitude spectrum

The spectrum of the spectrum –from the frequency domain to the quefrency domain.

Why?

Tends to remove correlations among nearby elements in smooth spectra –allows use of diagonal covariance matrices in statistical modeling.


And time-series methods like “Hidden Markov Models”= stochastic functions of a Markov chain:

where can infer the hidden state sequence from observations via Bayes’ Rule and Viterbi decoding…


And there’s an efficient technique for learning system parametersfrom training data:

Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. "A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains." The annals of mathematical statistics 41, no. 1 (1970): 164-171.

Liporace, L. A. PTAH on continuous multivariate functions of Markov chains. Technical Report 80193, Institute for Defense Analysis, Communication Research Department, 1976.


This approach worked – at a cost:

1. Over-simplified models (independence assumptions, etc.)2. Enormous complexity (many millions of parameters)3. Many detailed options for architectures and estimation methods• Choice requires optimization over a complex algorithmic space• Progress depends on thousands of small improvements

…but progress happened!


Hill-climbing in DARPA Speech-To-Text programs:


Four lessons from that experience:

1. Learning is better than programming;2. Global optimization of gradient local decisions is crucial;3. Top-down and bottom-up knowledge must be combined;4. Metrics on shared benchmarks matter.


“Learning is better than programming” –

…but many aspects of early 2000s HLT systems were still “programmed”via “feature engineering” at both endsand many structural and algorithmic choices in the middle…

1. Top-down language models rely on combinations of characters into ”words” and “phrases”with pronunciations given by a dictionaryand/or by letter-to-sound rules

2. Bottom-up acoustic models rely on MFCCs or similar3. In the middle, there are many structural and algorithmic choices


SO…“Deep Learning” to the rescue!

F(x) = L(N(L(N(…L(x)))))

• where x is an arbitrary vector input• L(x) is an affine function ax+b• N(x) is a non-linear function applied to each vector element separately

…plus some other goodies around the edges…


This is a universal computing model,in the sense that such a system

can be programmed to computeany finite function.

And even better, general optimization techniquescan learn model parameters from training data.

(…in the limit, sort of, sometimes..)


So we can do away with “feature engineering”and design a “sequence-to-sequence” model

whose inputs are audio waveform samplesand whose outputs are text characters.

After all, MFCC analysis is just a bunch of inner products,so why not learn the basis functions and band definitions

rather than programming them?

And a more complicated version of the same story applies to text analysis/synthesis.


Deep Learning solutions do work better –

…but at a cost.


Deep Learning “programs” are increasingly complicated --CNNs, RNNs, LSTMs, “transformers”: …

→


Avoiding “feature engineering”increases the number of parameters that need to be learnedand the amount of training data and training time

needed to learn them.


Why should our systems have to re-learn everything --logic, mathematics, physics, acoustics, chemistry, dictionaries, etc. --

all over againfor every new problem?

And we’re beginning to see a return to an old idea:systems that have pieces of relevant science “baked in”.


So people are starting to ask…

In other words, the old epistemological pendulum is starting to swing back from empiricism towards rationalism.

Metaphor: AI programming should be like video game programming.


Anumanchipalli, Gopala K., Josh Chartier, and Edward F. Chang. "Speech synthesis from neural decoding of spoken sentences." Nature 568, no. 7753 (2019):

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics.In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.


?5/3/2019 PREM Symposium 2019 36

trends in large-scale data analysis - linguistics · problems: 1. static sound-to-tube inversion is...

Documents