general soundtrack analysis rosa labdpwe/talks/fbis-2002-03.pdfsoundtrack @ fbis - dan ellis...

25
Lab ROSA Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 1 General Soundtrack Analysis Dan Ellis <[email protected]> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline LabROSA introduction Broadcast soundtrack monitoring Example technologies Summary 1 2 3 4

Upload: others

Post on 22-Dec-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 1

General Soundtrack Analysis

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline

LabROSA introduction

Broadcast soundtrack monitoring

Example technologies

Summary

1

2

3

4

Page 2: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 2

LabROSA: Sound Organization

• Analyzing and describing complex sounds:

- continuous sound mixture

distinct objects & events

• Human listeners as the prototype

- strong subjective impression when listening- ..but hard to ‘see’ in signal

1

time / sec

freq

/ H

z0

2000

4000

6000

8000

0 5 10 15 20 25 30

alj0-20-50

Voice1Stab

Music1 Music2 Music2 (quiet)

Voice2

Analysis

Page 3: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 3

The information in sound

• Hearing confers evolutionary advantage

- optimized to get ‘useful’ information from sound

• Enormous detail is available in familiar sounds

- ‘ecological’ influence on our sense of hearing

freq

/ H

z

0 1 2 3 40

1000

2000

3000

4000

time / s0 1 2 3 4

Steps 1 Steps 2

Page 4: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 4

Audio Information Extraction

• Domain

- text ... speech ... music ... general audio

• Operation

- recognize ... index/retrieve ... organize

Audio InformationExtraction

AIE

Unsupervised ClusteringComputational

Auditory Scene Analysis

CASA

Blind Source Separation

BSS

Independent Component

AnalysisICA

AudioContent-based

RetrievalAudio CBR

MultimediaInformationRetrieval

MMIR

MusicInformation

RetrievalMusic IR

AudioFingerprintingSpoken

DocumentRetrieval

SDR

AutomaticSpeechRecognition

ASR

TextInformation

RetrievalIR

TextInformation Extraction

IE

Page 5: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 5

LabROSA Summary

DO

MA

INS • Broadcast

• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Music • HCIA

PP

LIC

ATIO

NS

• Multimedia access & search• Personal media management• Machine perception & awareness• Prostheses / human augmentation• Automatic judgments/recommendation

ROSA

• Object-based structure discovery & learning

• Speech recognition• Speaker description

• Nonspeech recognition

• Scene Analysis• Audio-visual integration

• Music analysis

Page 6: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 6

Outline

LabROSA introduction

Broadcast soundtrack monitoring

- Available information- Information filtering- Intelligent analysis

Example technologies

Summary

1

2

3

4

Page 7: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 7

Broadcast soundtrack monitoring:What information is available?

• Video and soundtrack reinforce, complement

- hard things in one domain can be easy in other

• Information at different scales:

2

seconds daystimescale

specific

generic

words

speaker ID

dialogprogramstylegenre

schedule

speakeremotion

changesstorycoverage

activitylevelsound

events

Page 8: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 8

Information filtering

• Maximizing analyst utility:

• Automatic support for ‘triage’:

- segmentation- inferring schedules- identify repeated segments- generic categorization/labeling- task-specific flagging

informationselection

Page 9: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 9

Automatic labeling/flagging

• Computer as vigilant monitor

• Range of tasks

- generic ‘unusual’ patterns- specific trigger events

• Issues

- false alarms vs. misses- combinations of simple detectors for higher-level

classes e.g. program type

MonitoringcomputerModels

Signal

Flaggedevents

Page 10: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 10

Schedule summarization

• Learn the patterns of a particular broadcaster:

- 24/7 monitoring- automatic segmentation into programs- automatic classification of genres

• Support information extraction

- program types, repetitions, summarization- schedule changes

activity?

0000

2400

1200

0600

1800

News

Talk

Sports

Music

Other

M T W Th F S Su

Page 11: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 11

Outline

LabROSA introduction

Broadcast soundtrack monitoring

Example technologies

- Sound enhancement- Locating repetitions- Labeling soundtracks

Summary

1

2

3

4

Page 12: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 12

Example technologies:Sound enhancement

• Time scale modification

- speed up for rapid skimming

- slow down for close listening

• Noise reduction

3

freq

/ H

z

Original speech

2000

4000

6000

8000

Sped up by 50%

Slowed down to 50%

Noise reduced at pk - 20dB

time / sec

0

2000

4000

6000

8000

0 1 2 3 4 5 6 7 8 9

0

2000

4000

6000

8000

0

2000

4000

6000

8000

Page 13: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 13

Repetition detection

• Matching signals is expensive

- but matching

strings

is fast

• Represent audio as simple string

- recognition into arbitrary classes- match detection becomes string matching

• Monitor for many ‘signatures’

soundtrack

Models

Recognizer

target Recognizer

String match

viol

v

v

v

perc

p

p

anim

a

a

a

a

mach

spee

s

cell

c

cp

detection

Page 14: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 14

Soundtrack labeling

• Broad class: speech-music

• Specific events: alarms

time / sec

freq

/ H

zP

r(m

us)

0

2000

4000

6000

8000

0 5 10 15 20 25 300

0.5

1

alj0-20-50

time / sec

freq

/ H

zP

r(al

arm

)

0 1 2 3 4 5 6 7 8 9

0

2000

4000

0.5

1

Speech + alarm

Page 15: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 15

Speaking style

• Recognizing information other than words

• Meeting Recorder project: Locate overlaps

• Speaker emotion

- depends on good baseline

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speakecedes f

0

CAOMJL

5 10 15 20 25 30 35 40 45 50 55 60

mr-2000-11-02

Page 16: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 16

Outline

LabROSA introduction

Broadcast soundtrack monitoring

Example technologies

Summary

1

2

3

4

Page 17: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 17

Summary: Soundtrack analysis

• Soundtrack carries information

- useful and detailed- complementary to image- rapid processing

• Open source monitoring

- Need to find the interesting bits- Short-time: specific detectors- Long-time: schedule, genre classification

• Current techniques

- Signal enhancement- Detectors for speech, music, alarms etc.- Classification of interaction style, emotion

4

Page 18: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 18

Extra slides

Page 19: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 19

Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art’ word-error rates (WERs):- 2% (dictation) - 30% (telephone conversations)

• Can use multiple streams...

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

Page 20: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 20

The Meeting Recorder project(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings- for summarization/retrieval/behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, ...):

- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’

Page 21: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 21

Crosstalk cancellation

• Baseline speaker activity detection is hard:

• Noisy crosstalk model:

• Estimate subband CAa from A’s peak energy

- ... including pure delay (10 ms frames)- ... then linear inversion

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speaker Bcedes floor

m C s⋅ n+=

Page 22: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 22

Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...

• Key: recognize despite background

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

Page 23: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 23

Audio Information Retrieval(with Manuel Reyes)

• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?

• e.g. Muscle Fish “SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space

- features are ‘global’ for each soundfile,no attempt to separate mixtures

Segmentfeatureanalysis

Sound segmentdatabase

Segmentfeatureanalysis

Seach/comparison

Results

Query example

Feature vectors

Page 24: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 24

Audio Retrieval: Results

• Musclefish corpus- most commonly reported set

• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure

• Results: - 208 examples, 16 classes, 84% correct- confusions:

Instr Spch Env Anim Mech

Musical instrs. 136 (14)

Speech 17 (7) 2

Eviron. 2 6 (1)

Animals 2 2 1 (0)

Mechanical 1 15 (2)

Page 25: General Soundtrack Analysis ROSA Labdpwe/talks/FBIS-2002-03.pdfSoundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Speaking style • Recognizing information other than words • Meeting

LabROSA

Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 25

CASA for audio retrieval

• When audio material contains mixtures, global features are insufficient

• Retrieval based on element/object analysis:

- features are calculated over grouped subsets

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Seach/comparison

Results

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water