general soundtrack analysis rosa labdpwe/talks/fbis-2002-03.pdfsoundtrack @ fbis - dan ellis...
TRANSCRIPT
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 1
General Soundtrack Analysis
Dan Ellis<[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)
Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/
Outline
LabROSA introduction
Broadcast soundtrack monitoring
Example technologies
Summary
1
2
3
4
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 2
LabROSA: Sound Organization
• Analyzing and describing complex sounds:
- continuous sound mixture
→
distinct objects & events
• Human listeners as the prototype
- strong subjective impression when listening- ..but hard to ‘see’ in signal
1
time / sec
freq
/ H
z0
2000
4000
6000
8000
0 5 10 15 20 25 30
alj0-20-50
Voice1Stab
Music1 Music2 Music2 (quiet)
Voice2
Analysis
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 3
The information in sound
• Hearing confers evolutionary advantage
- optimized to get ‘useful’ information from sound
• Enormous detail is available in familiar sounds
- ‘ecological’ influence on our sense of hearing
freq
/ H
z
0 1 2 3 40
1000
2000
3000
4000
time / s0 1 2 3 4
Steps 1 Steps 2
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 4
Audio Information Extraction
• Domain
- text ... speech ... music ... general audio
• Operation
- recognize ... index/retrieve ... organize
Audio InformationExtraction
AIE
Unsupervised ClusteringComputational
Auditory Scene Analysis
CASA
Blind Source Separation
BSS
Independent Component
AnalysisICA
AudioContent-based
RetrievalAudio CBR
MultimediaInformationRetrieval
MMIR
MusicInformation
RetrievalMusic IR
AudioFingerprintingSpoken
DocumentRetrieval
SDR
AutomaticSpeechRecognition
ASR
TextInformation
RetrievalIR
TextInformation Extraction
IE
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 5
LabROSA Summary
DO
MA
INS • Broadcast
• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Music • HCIA
PP
LIC
ATIO
NS
• Multimedia access & search• Personal media management• Machine perception & awareness• Prostheses / human augmentation• Automatic judgments/recommendation
ROSA
• Object-based structure discovery & learning
• Speech recognition• Speaker description
• Nonspeech recognition
• Scene Analysis• Audio-visual integration
• Music analysis
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 6
Outline
LabROSA introduction
Broadcast soundtrack monitoring
- Available information- Information filtering- Intelligent analysis
Example technologies
Summary
1
2
3
4
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 7
Broadcast soundtrack monitoring:What information is available?
• Video and soundtrack reinforce, complement
- hard things in one domain can be easy in other
• Information at different scales:
2
seconds daystimescale
specific
generic
words
speaker ID
dialogprogramstylegenre
schedule
speakeremotion
changesstorycoverage
activitylevelsound
events
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 8
Information filtering
• Maximizing analyst utility:
• Automatic support for ‘triage’:
- segmentation- inferring schedules- identify repeated segments- generic categorization/labeling- task-specific flagging
informationselection
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 9
Automatic labeling/flagging
• Computer as vigilant monitor
• Range of tasks
- generic ‘unusual’ patterns- specific trigger events
• Issues
- false alarms vs. misses- combinations of simple detectors for higher-level
classes e.g. program type
MonitoringcomputerModels
Signal
Flaggedevents
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 10
Schedule summarization
• Learn the patterns of a particular broadcaster:
- 24/7 monitoring- automatic segmentation into programs- automatic classification of genres
• Support information extraction
- program types, repetitions, summarization- schedule changes
→
activity?
0000
2400
1200
0600
1800
News
Talk
Sports
Music
Other
M T W Th F S Su
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 11
Outline
LabROSA introduction
Broadcast soundtrack monitoring
Example technologies
- Sound enhancement- Locating repetitions- Labeling soundtracks
Summary
1
2
3
4
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 12
Example technologies:Sound enhancement
• Time scale modification
- speed up for rapid skimming
- slow down for close listening
• Noise reduction
3
freq
/ H
z
Original speech
2000
4000
6000
8000
Sped up by 50%
Slowed down to 50%
Noise reduced at pk - 20dB
time / sec
0
2000
4000
6000
8000
0 1 2 3 4 5 6 7 8 9
0
2000
4000
6000
8000
0
2000
4000
6000
8000
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 13
Repetition detection
• Matching signals is expensive
- but matching
strings
is fast
• Represent audio as simple string
- recognition into arbitrary classes- match detection becomes string matching
• Monitor for many ‘signatures’
soundtrack
Models
Recognizer
target Recognizer
String match
viol
v
v
v
perc
p
p
anim
a
a
a
a
mach
spee
s
cell
c
cp
detection
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 14
Soundtrack labeling
• Broad class: speech-music
• Specific events: alarms
time / sec
freq
/ H
zP
r(m
us)
0
2000
4000
6000
8000
0 5 10 15 20 25 300
0.5
1
alj0-20-50
time / sec
freq
/ H
zP
r(al
arm
)
0 1 2 3 4 5 6 7 8 9
0
2000
4000
0.5
1
Speech + alarm
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 15
Speaking style
• Recognizing information other than words
• Meeting Recorder project: Locate overlaps
• Speaker emotion
- depends on good baseline
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
mr-2000-06-30-1600
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop
0
20
40
breathnoise
crosstalk
backchannel(signals desire to regain floor?) floor seizure
interruptions
speakecedes f
0
CAOMJL
5 10 15 20 25 30 35 40 45 50 55 60
mr-2000-11-02
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 16
Outline
LabROSA introduction
Broadcast soundtrack monitoring
Example technologies
Summary
1
2
3
4
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 17
Summary: Soundtrack analysis
• Soundtrack carries information
- useful and detailed- complementary to image- rapid processing
• Open source monitoring
- Need to find the interesting bits- Short-time: specific detectors- Long-time: schedule, genre classification
• Current techniques
- Signal enhancement- Detectors for speech, music, alarms etc.- Classification of interaction style, emotion
4
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 18
Extra slides
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 19
Automatic Speech Recognition (ASR)
• Standard speech recognition structure:
• ‘State of the art’ word-error rates (WERs):- 2% (dictation) - 30% (telephone conversations)
• Can use multiple streams...
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 20
The Meeting Recorder project(with ICSI, UW, SRI, IBM)
• Microphones in conventional meetings- for summarization/retrieval/behavior analysis- informal, overlapped speech
• Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 21
Crosstalk cancellation
• Baseline speaker activity detection is hard:
• Noisy crosstalk model:
• Estimate subband CAa from A’s peak energy
- ... including pure delay (10 ms frames)- ... then linear inversion
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
mr-2000-06-30-1600
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop
0
20
40
breathnoise
crosstalk
backchannel(signals desire to regain floor?) floor seizure
interruptions
speaker Bcedes floor
m C s⋅ n+=
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 22
Alarm sound detection
• Alarm sounds have particular structure- people ‘know them when they hear them’
• Isolate alarms in sound mixtures
- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...
• Key: recognize despite background
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 23
Audio Information Retrieval(with Manuel Reyes)
• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?
• e.g. Muscle Fish “SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space
- features are ‘global’ for each soundfile,no attempt to separate mixtures
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Results
Query example
Feature vectors
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 24
Audio Retrieval: Results
• Musclefish corpus- most commonly reported set
• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure
• Results: - 208 examples, 16 classes, 84% correct- confusions:
Instr Spch Env Anim Mech
Musical instrs. 136 (14)
Speech 17 (7) 2
Eviron. 2 6 (1)
Animals 2 2 1 (0)
Mechanical 1 15 (2)
LabROSA
Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 25
CASA for audio retrieval
• When audio material contains mixtures, global features are insufficient
• Retrieval based on element/object analysis:
- features are calculated over grouped subsets
Genericelementanalysis
Continuous audioarchive
Genericelementanalysis
Seach/comparison
Results
Query example
Element representations
Objectformation
(Objectformation)
Word-to-classmapping
Objects + properties
Properties aloneSymbolic query
rushing water