future work & summary speech, music ... - academic commons
TRANSCRIPT
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 1
Audio Information Extraction
Dan Ellis<[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)
Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/
Outline
Audio Information Extraction
Speech, music, and other
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 2
Audio Information Extraction (AIE)
• Central operation:
- continuous sound mixture
→
distinct objects & events
• Perceptual impression is very strong
- but hard to ‘see’ in signal
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 3
Perceptual organization: Bregman’s lake
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Received waveform is a mixture
- two sensors, N signals ...
• Disentangling mixtures as primary goal
- perfect solution is not possible- need knowledge-based
constraints
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 4
The information in sound
• A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
• Auditory perception is
ecologically
grounded
- scene analysis is preconscious (
→
illusions)- special-purpose processing reflects
‘natural scene’ properties- subjective
not
canonical (ambiguity)
freq
/ H
z
0 1 2 3 40
1000
2000
3000
4000
time / s0 1 2 3 4
Steps 1 Steps 2
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 5
Positioning AIE
• Domain
- text ... speech ... music ... general audio
• Operation
- recognize ... index/retrieve ... organize
Audio InformationExtraction
AIE
Unsupervised ClusteringComputational
Auditory Scene Analysis
CASA
Blind Source Separation
BSS
Independent Component
AnalysisICA
AudioContent-based
RetrievalAudio CBR
MultimediaInformationRetrieval
MMIR
MusicInformation
RetrievalMusic IR
AudioFingerprintingSpoken
DocumentRetrieval
SDR
AutomaticSpeechRecognition
ASR
TextInformation
RetrievalIR
TextInformation Extraction
IE
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 6
AIE Applications
• Multimedia access
- sound as complementary dimension- need all modalities for complete information
• Personal audio
- continuous sound capture quite practical- different kind of indexing problem
• Machine perception
- intelligence requires awareness- necessary for communication
• Music retrieval
- area of hot activity- specific economic factors
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 7
Outline
Audio Information Extraction
Speech, music, and other
- Speech recognition - Multi-speaker processing- Music classification- Other sounds
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 8
Automatic Speech Recognition (ASR)
• Standard speech recognition structure:
• ‘State of the art ’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
• Can use multiple streams...
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 9
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail
• Combine them!
• Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Nowaydecoder
Phoneprobabilities
Words
s ah t
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Hybrid Connectionist-HMM ASR
Speechfeatures
Featurecalculation
Inputsound
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Conventional ASR (HTK)
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Phoneprobabilities
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Tandem modeling
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 10
Tandem system results: Aurora ‘noisy digits ’
(with Manuel Reyes)
• 50% of word errors corrected over baseline
• Beat even ‘bells and whistles ’ systemusing intensive large-vocabulary techniques
Aurora 2 Eurospeech 2001 Evaluation
- 1 0
0
1 0
2 0
3 0
4 0
5 0
6 0
Avg. rel. improvementRe
l im
pro
vem
en
t %
-
Mu
ltic
on
dit
ion
Columbia
Philips
UPC BarcelonaBell Labs
IBM
Motorola 1Motorola 2
NijmegenICSI/OGI/Qualcomm
ATR/Gri f f i th
AT&TAlcatel
Siemens
UCLAMicrosoft
SloveniaGranada
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 11
Missing data recognition
(Cooke, Green, Barker... @ Sheffield)
• Energy overlaps in time-freq. hide features
- some observations are effectively missing
• Use missing feature theory...
- integrate over missing data dimensions
x
m
• Effective in speech recognition
- trick is finding good/bad data mask
p x q( ) p xp xm q,( ) p xm q( ) xmd∫=
"1754" + noise
Missing-dataRecognition
stationary noise estimate
"1754"
Mask based on-5 0 5 10 15 20 Clean0
10
20
30
40
50
60
70
80
90 AURORA 2000 - Test Set AW
ER
SNR (dB)
MD Discrete SNRMD Soft SNRHTK clean trainingHTK multi-condition
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 12
The Meeting Recorder project
(with ICSI, UW, SRI, IBM)
• Microphones in conventional meetings
- for summarization/retrieval/behavior analysis- informal, overlapped speech
• Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 13
Crosstalk cancellation
• Baseline speaker activity detection is hard:
• Noisy crosstalk model:
• Estimate subband C
Aa
from A ’s peak energy
- ... including pure delay (10 ms frames)- ... then linear inversion
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
mr-2000-06-30-1600
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop
0
20
40
breathnoise
crosstalk
backchannel(signals desire to regain floor?) floor seizure
interruptions
speaker Bcedes floor
m C s⋅ n+=
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 14
Speaker localization(with Wei Hee Huan)
• Tabletop mics form an array;time differences locate speakers
• Ambiguity:- mic positions not fixed- geometric symmetry
• Detect speaker activity, overlap
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-5
-4
-3
-2
-1
0
1
2
3
4
5
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 15
Music analysis: Structure recovery(with Rob Turetsky)
• Structure recovery by similarity matrices(after Foote)
- similarity distance measure?
- segmentation & repetition structure
- interpretation at different scales:notes, phrases, movements
- incorporating musical knowledge:‘theme similarity’
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 16
Music analysis: Lyrics extraction(with Adam Berenzweig)
• Vocal content is highly salient, useful for retrieval
• Can we find the singing? Use an ASR classi fier:
• Frame error rate ~20% for segmentation based on posterior-feature statistics
• Lyric segmentation + transcribed lyrics→ training data for lyrics ASR...
time / sec time / sec time / sec0 1 2 0 1 2 0 1 2
freq
/ kH
z
0
2
4
phon
e cl
ass
speech (trnset #58)
Spe
ctro
gram
Pos
terio
gram
music (no vocals #1) singing (vocals #17 + 10.5s)
singing
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 17
Artist similarity
• Train network to discriminate speci fic artists:
• Focus on vocal segments for consistency
• .. then clustering for recommendation
XT
C
0 50 100 150 200 250 0 100 200 300
tT
ori
0 100 200 300 0 100 200 300 400 500
Bec
k
0 50 100 150 200 250 0 200 400 600
w60o40 stats based on LE plp12 2001-12-28M
ouse
0 100 200 3000 50 100 150
Art
ist c
lass
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 18
Alarm sound detection
• Alarm sounds have particular structure- people ‘know them when they hear them’
• Isolate alarms in sound mixtures
- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...
• Key: recognize despite background
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 19
Sound textures(with Marios Athineos)
• Textures: Large class of sounds - no clear pitch, onsets, shape- fire, rain, paper, machines, ...- ‘bulk’ subjective properties
• Abstract & synthesize by:- project into low-dimensional parameter space- learn dynamics within space- generate endless versions
50 100 150 200 250 300 350
10
20
30
40
50 100 150 200 250 300 350
2
4
6
8
-100 -90 -80 -70 -60 020
40-30
-25
-20
-15
-10
-5
0
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 20
Outline
Audio Information Extraction
Speech, music, and other
General sound organization- Computational Auditory Scene Analysis- Audio Information Retrieval
Future work & summary
1
2
3
4
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 21
Computational AuditoryScene Analysis (CASA)
• Goal: Automatic sound organization ;Systems to ‘pick out ’ sounds in a mixture- ... like people do
• E.g. voice against a noisy background- to improve speech recognition
• Approach:- psychoacoustics describes grouping ‘rules’- ... just implement them?
CASAObject 1 descriptionObject 2 descriptionObject 3 description...
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 22
CASA front-end processing
• Correlogram:Loosely based on known/possible physiology
- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors
Cochleafilterbank
sound
envelopefollower
short-timeautocorrelation
delay line
freq
uenc
y ch
anne
ls
Correlogramslice
freq
lag
time
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 23
Adding top-down cues
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
vs. Prediction-driven (top-down) (PDCASA)
• Motivations- detect non-tonal events (noise & click elements)- support ‘restoration illusions’...
→ hooks for high-level knowledge+ ‘complete explanation’, multiple hypotheses, ...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 24
PDCASA and complex scenes
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 25
Audio Information Retrieval(with Manuel Reyes)
• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?
• e.g. Muscle Fish “ SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space
- features are ‘global’ for each soundfile,no attempt to separate mixtures
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Results
Query example
Feature vectors
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 26
Audio Retrieval: Results
• Musclefish corpus- most commonly reported set
• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure
• Results: - 208 examples, 16 classses, 84% correct- confusions:
Instr Spch Env Anim Mech
Musical instrs. 136 (14)
Speech 17 (7) 2
Eviron. 2 6 (1)
Animals 2 2 1 (0)
Mechanical 1 15 (2)
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 27
CASA for audio retrieval
• When audio material contains mixtures, global features are insufficient
• Retrieval based on element/object analysis:
- features are calculated over grouped subsets
Genericelementanalysis
Continuous audioarchive
Genericelementanalysis
Seach/comparison
Results
Query example
Element representations
Objectformation
(Objectformation)
Word-to-classmapping
Objects + properties
Properties aloneSymbolic query
rushing water
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 28
Outline
Audio Information Extraction
Speech, music, and other
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 29
Automatic audio-video analysis(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)
• Documentary archive management- huge ratio of raw-to-finished material- costly manual logging- missed opportunities for cross-fertilization
• Problem: term <-> signal mapping- training corpus of past annotations- interactive semi-automatic learning- need object-related features
A/V segmentation
and feature
extraction
text processing
concept discovery
A/V feature unsupervised
clustering A/V data
MM Concept Networkannotations
A/V features
terms
generic detectors
Multimedia Fusion:
Mining Concepts &
RelationshipsComplex Spatio-
Temporal Classification
Question Answering
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 30
The ‘Machine listener’
• Goal: An auditory system for machines- use same environmental information as people
• Signal understanding- monitor for particular sounds- real-time description
• Scenarios
- personal listener → summary of your day- future prosthetic hearing device- autonomous robots
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 31
LabROSA Summary
DO
MA
INS
AP
PLI
CA
TIO
NS
ROSA
• Broadcast• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Speech recognition• Speech characterization• Nonspeech recognition
• Object-based structure discovery & learning
• Scene analysis• Audio-visual integration• Music analysis
• Structuring• Search• Summarization• Awareness• Understanding
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 32
Summary: Audio Info Extraction
• Sound carries information- useful and detailed- often tangled in mixtures
• Various important general classes- Speech: activity, recognition- Music: segmentation, clustering- Other: detection, description
• General processing framework- Computational Auditory Scene Analysis- Audio Information Retrieval
• Future applications- Ubiquitous intelligent indexing- Intelligent monitoring & description
LabROSA
AIE @ MERL - Dan Ellis 2002-01-07 - 33
Audio Information Extraction:panacea or punishment?