general soundtrack analysis rosa labdpwe/talks/fbis-2002-03.pdfsoundtrack @ fbis - dan ellis...

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 1

General Soundtrack Analysis

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline

LabROSA introduction

Broadcast soundtrack monitoring

Example technologies

Summary

1

2

3

4

LabROSA


LabROSA: Sound Organization

• Analyzing and describing complex sounds:

- continuous sound mixture

→

distinct objects & events

• Human listeners as the prototype

- strong subjective impression when listening- ..but hard to ‘see’ in signal

1

time / sec

freq

/ H

z0

2000

4000

6000

8000

0 5 10 15 20 25 30

alj0-20-50

Voice1Stab

Music1 Music2 Music2 (quiet)

Voice2

Analysis

LabROSA


The information in sound

• Hearing confers evolutionary advantage

- optimized to get ‘useful’ information from sound

• Enormous detail is available in familiar sounds

- ‘ecological’ influence on our sense of hearing

freq

/ H

z

0 1 2 3 40

1000

2000

3000

4000

time / s0 1 2 3 4

Steps 1 Steps 2

LabROSA


Audio Information Extraction

• Domain

- text ... speech ... music ... general audio

• Operation

- recognize ... index/retrieve ... organize

Audio InformationExtraction

AIE

Unsupervised ClusteringComputational

Auditory Scene Analysis

CASA

Blind Source Separation

BSS

Independent Component

AnalysisICA

AudioContent-based

RetrievalAudio CBR

MultimediaInformationRetrieval

MMIR

MusicInformation

RetrievalMusic IR

AudioFingerprintingSpoken

DocumentRetrieval

SDR

AutomaticSpeechRecognition

ASR

TextInformation

RetrievalIR

TextInformation Extraction

IE

LabROSA


LabROSA Summary

DO

MA

INS • Broadcast

• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Music • HCIA

PP

LIC

ATIO

NS

• Multimedia access & search• Personal media management• Machine perception & awareness• Prostheses / human augmentation• Automatic judgments/recommendation

ROSA

• Object-based structure discovery & learning

• Speech recognition• Speaker description

• Nonspeech recognition

• Scene Analysis• Audio-visual integration

• Music analysis

LabROSA


Outline



- Available information- Information filtering- Intelligent analysis


Summary

1

2

3

4

LabROSA


Broadcast soundtrack monitoring:What information is available?

• Video and soundtrack reinforce, complement

- hard things in one domain can be easy in other

• Information at different scales:

2

seconds daystimescale

specific

generic

words

speaker ID

dialogprogramstylegenre

schedule

speakeremotion

changesstorycoverage

activitylevelsound

events

LabROSA


Information filtering

• Maximizing analyst utility:

• Automatic support for ‘triage’:

- segmentation- inferring schedules- identify repeated segments- generic categorization/labeling- task-specific flagging

informationselection

LabROSA


Automatic labeling/flagging

• Computer as vigilant monitor

• Range of tasks

- generic ‘unusual’ patterns- specific trigger events

• Issues

- false alarms vs. misses- combinations of simple detectors for higher-level

classes e.g. program type

MonitoringcomputerModels

Signal

Flaggedevents

LabROSA


Schedule summarization

• Learn the patterns of a particular broadcaster:

- 24/7 monitoring- automatic segmentation into programs- automatic classification of genres

• Support information extraction

- program types, repetitions, summarization- schedule changes

→

activity?

0000

2400

1200

0600

1800

News

Talk

Sports

Music

Other

M T W Th F S Su

LabROSA


Outline




- Sound enhancement- Locating repetitions- Labeling soundtracks

Summary

1

2

3

4

LabROSA


Example technologies:Sound enhancement

• Time scale modification

- speed up for rapid skimming

- slow down for close listening

• Noise reduction

3

freq

/ H

z

Original speech

2000

4000

6000

8000

Sped up by 50%

Slowed down to 50%

Noise reduced at pk - 20dB

time / sec

0

2000

4000

6000

8000

0 1 2 3 4 5 6 7 8 9

0

2000

4000

6000

8000

0

2000

4000

6000

8000

LabROSA


Repetition detection

• Matching signals is expensive

- but matching

strings

is fast

• Represent audio as simple string

- recognition into arbitrary classes- match detection becomes string matching

• Monitor for many ‘signatures’

soundtrack

Models

Recognizer

target Recognizer

String match

viol

v

v

v

perc

p

p

anim

a

a

a

a

mach

spee

s

cell

c

cp

detection

LabROSA


Soundtrack labeling

• Broad class: speech-music

• Specific events: alarms

time / sec

freq

/ H

zP

r(m

us)

0

2000

4000

6000

8000

0 5 10 15 20 25 300

0.5

1

alj0-20-50

time / sec

freq

/ H

zP

r(al

arm

)

0 1 2 3 4 5 6 7 8 9

0

2000

4000

0.5

1

Speech + alarm

LabROSA


Speaking style

• Recognizing information other than words

• Meeting Recorder project: Locate overlaps

• Speaker emotion

- depends on good baseline

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speakecedes f

0

CAOMJL

5 10 15 20 25 30 35 40 45 50 55 60

mr-2000-11-02

LabROSA


Outline




Summary

1

2

3

4

LabROSA


Summary: Soundtrack analysis

• Soundtrack carries information

- useful and detailed- complementary to image- rapid processing

• Open source monitoring

- Need to find the interesting bits- Short-time: specific detectors- Long-time: schedule, genre classification

• Current techniques

- Signal enhancement- Detectors for speech, music, alarms etc.- Classification of interaction style, emotion

4

LabROSA


Extra slides

LabROSA


Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art’ word-error rates (WERs):- 2% (dictation) - 30% (telephone conversations)

• Can use multiple streams...

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA


The Meeting Recorder project(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings- for summarization/retrieval/behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, ...):

- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’

LabROSA


Crosstalk cancellation

• Baseline speaker activity detection is hard:

• Noisy crosstalk model:

• Estimate subband CAa from A’s peak energy

- ... including pure delay (10 ms frames)- ... then linear inversion

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speaker Bcedes floor

m C s⋅ n+=

LabROSA


Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...

• Key: recognize despite background

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

LabROSA


Audio Information Retrieval(with Manuel Reyes)

• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?

• e.g. Muscle Fish “SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space

- features are ‘global’ for each soundfile,no attempt to separate mixtures

Segmentfeatureanalysis

Sound segmentdatabase

Segmentfeatureanalysis

Seach/comparison

Results

Query example

Feature vectors

LabROSA


Audio Retrieval: Results

• Musclefish corpus- most commonly reported set

• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure

• Results: - 208 examples, 16 classes, 84% correct- confusions:

Instr Spch Env Anim Mech

Musical instrs. 136 (14)

Speech 17 (7) 2

Eviron. 2 6 (1)

Animals 2 2 1 (0)

Mechanical 1 15 (2)

LabROSA


CASA for audio retrieval

• When audio material contains mixtures, global features are insufficient

• Retrieval based on element/object analysis:

- features are calculated over grouped subsets

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Seach/comparison

Results

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water

general soundtrack analysis rosa labdpwe/talks/fbis-2002-03.pdfsoundtrack @ fbis - dan ellis...

Documents