future work & summary speech, music ... - academic commons

LabROSA

AIE @ MERL - Dan Ellis 2002-01-07 - 1

Audio Information Extraction

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline


Speech, music, and other

General sound organization

Future work & summary

1

2

3

4

LabROSA


Audio Information Extraction (AIE)

• Central operation:

- continuous sound mixture

→

distinct objects & events

• Perceptual impression is very strong

- but hard to ‘see’ in signal

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

LabROSA


Perceptual organization: Bregman’s lake

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Received waveform is a mixture

- two sensors, N signals ...

• Disentangling mixtures as primary goal

- perfect solution is not possible- need knowledge-based

constraints

LabROSA


The information in sound

• A sense of hearing is evolutionarily useful

- gives organisms ‘relevant’ information

• Auditory perception is

ecologically

grounded

- scene analysis is preconscious (

→

illusions)- special-purpose processing reflects

‘natural scene’ properties- subjective

not

canonical (ambiguity)

freq

/ H

z

0 1 2 3 40

1000

2000

3000

4000

time / s0 1 2 3 4

Steps 1 Steps 2

LabROSA


Positioning AIE

• Domain

- text ... speech ... music ... general audio

• Operation

- recognize ... index/retrieve ... organize

Audio InformationExtraction

AIE

Unsupervised ClusteringComputational

Auditory Scene Analysis

CASA

Blind Source Separation

BSS

Independent Component

AnalysisICA

AudioContent-based

RetrievalAudio CBR

MultimediaInformationRetrieval

MMIR

MusicInformation

RetrievalMusic IR

AudioFingerprintingSpoken

DocumentRetrieval

SDR

AutomaticSpeechRecognition

ASR

TextInformation

RetrievalIR

TextInformation Extraction

IE

LabROSA


AIE Applications

• Multimedia access

- sound as complementary dimension- need all modalities for complete information

• Personal audio

- continuous sound capture quite practical- different kind of indexing problem

• Machine perception

- intelligence requires awareness- necessary for communication

• Music retrieval

- area of hot activity- specific economic factors

LabROSA


Outline



- Speech recognition - Multi-speaker processing- Music classification- Other sounds



1

2

3

4

LabROSA


Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art ’ word-error rates (WERs):

- 2% (dictation) - 30% (telephone conversations)

• Can use multiple streams...

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA


Tandem speech recognition

(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)

• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail

• Combine them!

• Train net, then train GMM on net output

- GMM is ignorant of net output ‘meaning’

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Nowaydecoder

Phoneprobabilities

Words

s ah t

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Hybrid Connectionist-HMM ASR

Speechfeatures

Featurecalculation

Inputsound

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

Conventional ASR (HTK)

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Phoneprobabilities

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Tandem modeling

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

LabROSA


Tandem system results: Aurora ‘noisy digits ’

(with Manuel Reyes)

• 50% of word errors corrected over baseline

• Beat even ‘bells and whistles ’ systemusing intensive large-vocabulary techniques

Aurora 2 Eurospeech 2001 Evaluation

- 1 0

0

1 0

2 0

3 0

4 0

5 0

6 0

Avg. rel. improvementRe

l im

pro

vem

en

t %

-

Mu

ltic

on

dit

ion

Columbia

Philips

UPC BarcelonaBell Labs

IBM

Motorola 1Motorola 2

NijmegenICSI/OGI/Qualcomm

ATR/Gri f f i th

AT&TAlcatel

Siemens

UCLAMicrosoft

SloveniaGranada

LabROSA


Missing data recognition

(Cooke, Green, Barker... @ Sheffield)

• Energy overlaps in time-freq. hide features

- some observations are effectively missing

• Use missing feature theory...

- integrate over missing data dimensions

x

m

• Effective in speech recognition

- trick is finding good/bad data mask

p x q( ) p xp xm q,( ) p xm q( ) xmd∫=

"1754" + noise

Missing-dataRecognition

stationary noise estimate

"1754"

Mask based on-5 0 5 10 15 20 Clean0

10

20

30

40

50

60

70

80

90 AURORA 2000 - Test Set AW

ER

SNR (dB)

MD Discrete SNRMD Soft SNRHTK clean trainingHTK multi-condition

LabROSA


The Meeting Recorder project

(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings

- for summarization/retrieval/behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, ...):

- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’

LabROSA


Crosstalk cancellation

• Baseline speaker activity detection is hard:

• Noisy crosstalk model:

• Estimate subband C

Aa

from A ’s peak energy

- ... including pure delay (10 ms frames)- ... then linear inversion

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speaker Bcedes floor

m C s⋅ n+=

LabROSA


Speaker localization(with Wei Hee Huan)

• Tabletop mics form an array;time differences locate speakers

• Ambiguity:- mic positions not fixed- geometric symmetry

• Detect speaker activity, overlap

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-5

-4

-3

-2

-1

0

1

2

3

4

5

LabROSA


Music analysis: Structure recovery(with Rob Turetsky)

• Structure recovery by similarity matrices(after Foote)

- similarity distance measure?

- segmentation & repetition structure

- interpretation at different scales:notes, phrases, movements

- incorporating musical knowledge:‘theme similarity’

LabROSA


Music analysis: Lyrics extraction(with Adam Berenzweig)

• Vocal content is highly salient, useful for retrieval

• Can we find the singing? Use an ASR classi fier:

• Frame error rate ~20% for segmentation based on posterior-feature statistics

• Lyric segmentation + transcribed lyrics→ training data for lyrics ASR...

time / sec time / sec time / sec0 1 2 0 1 2 0 1 2

freq

/ kH

z

0

2

4

phon

e cl

ass

speech (trnset #58)

Spe

ctro

gram

Pos

terio

gram

music (no vocals #1) singing (vocals #17 + 10.5s)

singing

LabROSA


Artist similarity

• Train network to discriminate speci fic artists:

• Focus on vocal segments for consistency

• .. then clustering for recommendation

XT

C

0 50 100 150 200 250 0 100 200 300

tT

ori

0 100 200 300 0 100 200 300 400 500

Bec

k

0 50 100 150 200 250 0 200 400 600

w60o40 stats based on LE plp12 2001-12-28M

ouse

0 100 200 3000 50 100 150

Art

ist c

lass

LabROSA


Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...

• Key: recognize despite background

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

LabROSA


Sound textures(with Marios Athineos)

• Textures: Large class of sounds - no clear pitch, onsets, shape- fire, rain, paper, machines, ...- ‘bulk’ subjective properties

• Abstract & synthesize by:- project into low-dimensional parameter space- learn dynamics within space- generate endless versions

50 100 150 200 250 300 350

10

20

30

40

50 100 150 200 250 300 350

2

4

6

8

-100 -90 -80 -70 -60 020

40-30

-25

-20

-15

-10

-5

0

LabROSA


Outline



General sound organization- Computational Auditory Scene Analysis- Audio Information Retrieval


1

2

3

4

LabROSA


Computational AuditoryScene Analysis (CASA)

• Goal: Automatic sound organization ;Systems to ‘pick out ’ sounds in a mixture- ... like people do

• E.g. voice against a noisy background- to improve speech recognition

• Approach:- psychoacoustics describes grouping ‘rules’- ... just implement them?

CASAObject 1 descriptionObject 2 descriptionObject 3 description...

LabROSA


CASA front-end processing

• Correlogram:Loosely based on known/possible physiology

- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors

Cochleafilterbank

sound

envelopefollower

short-timeautocorrelation

delay line

freq

uenc

y ch

anne

ls

Correlogramslice

freq

lag

time

LabROSA


Adding top-down cues

Perception is not directbut a search for plausible hypotheses

• Data-driven (bottom-up)...

vs. Prediction-driven (top-down) (PDCASA)

• Motivations- detect non-tonal events (noise & click elements)- support ‘restoration illusions’...

→ hooks for high-level knowledge+ ‘complete explanation’, multiple hypotheses, ...

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents

LabROSA


PDCASA and complex scenes

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12

LabROSA


Audio Information Retrieval(with Manuel Reyes)

• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?

• e.g. Muscle Fish “ SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space

- features are ‘global’ for each soundfile,no attempt to separate mixtures

Segmentfeatureanalysis

Sound segmentdatabase

Segmentfeatureanalysis

Seach/comparison

Results

Query example

Feature vectors

LabROSA


Audio Retrieval: Results

• Musclefish corpus- most commonly reported set

• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure

• Results: - 208 examples, 16 classses, 84% correct- confusions:

Instr Spch Env Anim Mech

Musical instrs. 136 (14)

Speech 17 (7) 2

Eviron. 2 6 (1)

Animals 2 2 1 (0)

Mechanical 1 15 (2)

LabROSA


CASA for audio retrieval

• When audio material contains mixtures, global features are insufficient

• Retrieval based on element/object analysis:

- features are calculated over grouped subsets

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Seach/comparison

Results

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water

LabROSA


Outline





1

2

3

4

LabROSA


Automatic audio-video analysis(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)

• Documentary archive management- huge ratio of raw-to-finished material- costly manual logging- missed opportunities for cross-fertilization

• Problem: term <-> signal mapping- training corpus of past annotations- interactive semi-automatic learning- need object-related features

A/V segmentation

and feature

extraction

text processing

concept discovery

A/V feature unsupervised

clustering A/V data

MM Concept Networkannotations

A/V features

terms

generic detectors

Multimedia Fusion:

Mining Concepts &

RelationshipsComplex Spatio-

Temporal Classification

Question Answering

LabROSA


The ‘Machine listener’

• Goal: An auditory system for machines- use same environmental information as people

• Signal understanding- monitor for particular sounds- real-time description

• Scenarios

- personal listener → summary of your day- future prosthetic hearing device- autonomous robots

LabROSA


LabROSA Summary

DO

MA

INS

AP

PLI

CA

TIO

NS

ROSA

• Broadcast• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Speech recognition• Speech characterization• Nonspeech recognition

• Object-based structure discovery & learning

• Scene analysis• Audio-visual integration• Music analysis

• Structuring• Search• Summarization• Awareness• Understanding

LabROSA


Summary: Audio Info Extraction

• Sound carries information- useful and detailed- often tangled in mixtures

• Various important general classes- Speech: activity, recognition- Music: segmentation, clustering- Other: detection, description

• General processing framework- Computational Auditory Scene Analysis- Audio Information Retrieval

• Future applications- Ubiquitous intelligent indexing- Intelligent monitoring & description

LabROSA


Audio Information Extraction:panacea or punishment?

future work & summary speech, music ... - academic commons

Documents