labrosa research overview

LabROSA Overview - Dan Ellis 2010-09-10 /191

1. Real-World Sound2. Speech Separation3. Environmental Audio Classification4. Music Audio Analysis

LabROSA Research Overview

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/

LabROSA Overview

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation

1. Real-World Sound

• Sounds rarely occur in isolation.. so analyzing mixtures (“scenes”) is a problem.. for humans and machines

02_m+s-15-evil-goodvoice-fade

0 2 4 6 8 10 12 time/s

frq/Hz

Voice (evil)Stab

Rumble StringsChoir

Voice (pleasant)

Analysis

level / dB-60

Auditory Scene Analysis

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregmanʼ90)

• Received waveform is a mixture2 sensors, N sources - underconstrained

• Use prior knowledge (models) to constrain

2. Speech Separation

• Given models for sources, find “best” (most likely) states for spectra:

can include sequential constraints...

• E.g. stationary noise:

{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2)p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination

inference ofsource state

time / s

Original speech

In speech-shaped noise (mel magsnr = 2.41 dB)

VQ inferred states (mel magsnr = 3.6 dB)

Roweis ’01, ’03Kristjannson ’04, ’06

Eigenvoices• Idea: Find speaker model

parameter space

generalize without losing detail?

• Eigenvoice model:280 states x 320 bins= 89,600 dimensions10-30 dimensions

Weiss & Ellis ’09, ’10

Speaker modelsSpeaker subspace bases

µ = µ̄ + U w + B hadapted mean eigenvoice weights channel channelmodel voice bases bases weights

Mean Voice

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

Eigenvoice dimension 1

z)Eigenvoice dimension 2

Eigenvoice dimension 3

Speaker-Adapted Separation

• Eigenvoices for Speech Separation taskspeaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI)

3. Soundtrack Classification

• Short video clips as the evolution of snapshots10-100 sec, one location, no editingbrowsing?

• Need information for indexing...video + audioforeground + background

MFCC Covariance Representation

• Each clip/segment → fixed-size statisticssimilar to speaker ID and music genre classification

• Full Covariance matrix of MFCCs maps the kinds of spectral shapes present

• Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model

VTS_04_0001 - Spectrogram

1 2 3 4 5 6 7 8 9012345678

time / sec

level / dB

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimension

MFCC covariance

5 10 15 20

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix

Classification Results

• Wide range:

audio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

Animal

BabyBeach

Birthday

Group of 3

Group of 2

Museum

One person

ParkPicn

Playground

ShowSports

Sunset

Wedding

Dancing

Parade

SingingCheer

Concepts

1G + KL1G + MahGMM Hist. + pLSAGuess

Chang, Ellis et al. ’07Lee & Ellis ’10

Matching Videos via Fingerprints

• Landmark pairs are a noise-robust fingerprint

• Use to match distinct videos with same sound ambience

Cotton & Ellis ’10

VIdeo IMpLQaiHWbE at 195s

VIdeo Yi1hkNkqHBc at 218 s

195.5 196 196.5 197 197.5 198 198.5 199

218.5 219 219.5 220 220.5 221 221.5 2220

time / sec

4. Music Audio Analysis

• ... at all levels from notes to genres

162 164 166 168 170 172 174 time / s

time / beats

level / dB

Signal

Onsets& Beats

Per-framechroma

Per-beatnormalized

chroma

Melody

intensity0

0.50.25

Let it Be (final verse)

390 395 400 405 410 415

High-Resolution Transcription

• Extract notes from audiodespite overlapping voiceswith precise tuning, timing

• Use musical scoretime alignment & approximate match

• Find most likely parametersfundamental, amplitudes of all harmonics

Smit & Ellis ’09

Spectrogram of triplum track: YIN (red)

5 6 7 8 9 10200

Time (s)

Spectrogram of combined trackes: YIN (red), Prob (blue)

5 6 7 8 9 10200

Polyphonic Transcription

• Apply the Eigenvoice idea to musiceigeninstruments? • Subspace NMF

Grindlay & Ellis ’09

Chord Recognition

• StructSVM vs. HMM: better results

Weller, Ellis, Jebara ’09

Melodic-Harmonic Mining

• 100,000 tracks from morecowbell.djas Echo Nest Analyze

• Frequent clusters of 12 x 8 binarized event-chroma

#1 (3491) #2 (2775) #3 (2255) #4 (1241) #5 (1224) #6 (1218) #7 (1092) #8 (1084) #9 (1080) #10 (1035)

#11 (1021)

#1 (3491)

#12 (1005)

)#2 (2775)

#13 (974)

#3 (2255)

#14 (942)

)#4 (1241)

#15 (936)

)#5 (1224)

#16 (924)

)#6 (1218)

#17 (920)

)#7 (1092)

#18 (913)

)#8 (1084)

#19 (901)

)) #10 (1035)#9 (1080)

#20 (897)#

#21 (887)

#11 (1021) #

#21 (887)#21 (887) #22 (882)

)#12 (1005)

))#22 (882)#22 (882) #23 (881)

#13 (974)

#23 (881)#23 (881) #24 (881)

)#14 (942)

))#24 (881)#24 (881) #25 (879)

)#15 (936)

))#25 (879)#25 (879) #26 (875)

)#16 (924)

))#26 (875)#26 (875) #27 (875)

)#17 (920)

))#27 (875)#27 (875) #28 (874)

)#18 (913)

))#28 (874)#28 (874) #29 (868)

) #20 (897)#19 (901)

))#29 (868)#29 (868) #30 (844)

#31 (839) #32 (839) #33 (794) #34 (786) #35 (785) #36 (747) #37 (731) #38 (714) #39 (706) #40 (698)

#41 (682)

#31 (839)#31 (839)

#42 (678)

))#32 (839)#32 (839)

#43 (675)

#33 (794)#33 (794)

#44 (657)

))#34 (786)#34 (786)

#45 (656)

))#35 (785)#35 (785)

#46 (651)

))#36 (747)#36 (747)

#47 (647)

))#37 (731)#37 (731)

#48 (638)

))#38 (714)#38 (714)

#49 (610)

)) #40 (698)#40 (698)#39 (706)#39 (706)

#50 (593)

#51 (592)

#41 (682)#41 (682)

#52 (591)

))#42 (678)#42 (678)

#53 (589)

#43 (675)#43 (675)

#54 (572)

))#44 (657)#44 (657)

#55 (571)

))#45 (656)#45 (656)

#56 (550)

))#46 (651)#46 (651)

#57 (549)

))#47 (647)#47 (647)

#58 (534)

))#48 (638)#48 (638)

#59 (534)

)) #50 (593)#50 (593)#49 (610)#49 (610)

#60 (531)

#61 (528)

#51 (592)

#62 (525)

)#52 (591)

#63 (522)

#53 (589)

#64 (514)

)#54 (572)

#65 (510)

)#55 (571)

#66 (507)

)#56 (550)

#67 (500)

)#57 (549)

#68 (497)

)#58 (534)

#69 (486)

) #60 (531)#59 (534)

#70 (479)

#71 (476)

#61 (528)#61 (528)

#72 (468)

))#62 (525)#62 (525)

#73 (468)

#63 (522)#63 (522)

#74 (466)

))#64 (514)#64 (514)

#75 (463)

))#65 (510)#65 (510)

#76 (454)

))#66 (507)#66 (507)

#77 (453)

))#67 (500)#67 (500)

#78 (448)

))#68 (497)#68 (497)

#79 (441)

)) #70 (479)#70 (479)#69 (486)#69 (486)

#80 (440)

#81 (435)

#71 (476)#71 (476)

#82 (430)

))#72 (468)#72 (468)

#83 (430)

#73 (468)#73 (468)

#84 (425)

))#74 (466)#74 (466)

#85 (425)

))#75 (463)#75 (463)

#86 (419)

))#76 (454)#76 (454)

#87 (419)

))#77 (453)#77 (453)

#88 (417)

))#78 (448)#78 (448)

#89 (416)

)) #80 (440)#80 (440)#79 (441)#79 (441)

#90 (414)

#91 (411)

#81 (435)#81 (435)

#91 (411)#91 (411) #92 (410)

))#82 (430)#82 (430)

))#92 (410)#92 (410) #93 (408)

#83 (430)#83 (430)

#93 (408)#93 (408) #94 (406)

))#84 (425)#84 (425)

))#94 (406)#94 (406) #95 (401)

))#85 (425)#85 (425)

))#95 (401)#95 (401) #96 (398)

))#86 (419)#86 (419)

))#96 (398)#96 (398) #97 (397)

))#87 (419)#87 (419)

))#97 (397)#97 (397) #98 (396)

))#88 (417)#88 (417)

))#98 (396)#98 (396) #99 (396)

)) #90 (414)#90 (414)#89 (416)#89 (416)

))#99 (396)#99 (396) #100 (395)

Musicaudio

LocalitySensitive

Hash Table

Beattracking

Chromafeatures

Keynormalization

Landmarkidentification

Bertin-Mahieux et al. ’10

Results - Beatles• Over 86 Beatles tracks

• All beat offsets = 41,705 patchesLSH takes 300 sec - approx NlogN in patches?

• High-pass along time to avoid sustainednotes

• Song filterremove hitsin same track

02-I Should Have Known Better 92.4-97.7s

05-Here There And Everywhere 12.1-20.5s

09-Martha My Dear 90.9-98.6s

5 10 15 20

12-Piggies 22.0-29.6s

5 10 15 20

Summary

• LabROSA : getting information from sound

• Speechmonaural separation using eigenvoices

binaural + reverb using MESSL

• Environmentalclassification of consumer video

landmark-based events and matching

• Musictranscription of notes, chords, ...

large corpus mining19

labrosa research overview

Documents

research project overview

research data overview

demand response research center research overview

overview research

sound, mixtures, and learning: labrosa...

clinical research overview

research overview

watershed research-overview

experimental research overview

01 overview research

networking research overview

research information management - oclc research overview

exploring marine mammal bioacoustics overview of ongoing...

sound, mixtures, and learning: labrosa overview

caval research overview

bioinformatics research overview

labrosa research overviewdpwe/talks/rosaview-2011-09.pdf ·...

twitter research overview

research overview:

overview of research