computer vision for music identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording...
TRANSCRIPT
![Page 1: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/1.jpg)
Computer Visionfor MusicIdentification
Computer Vision and PatternRecognition (CVPR) 2005Yan Ke, Derek Hoiem, andRahul Sukthankar, CMU
Presented by Eugene WeinsteinApril 4th, 2006
![Page 2: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/2.jpg)
2
Music Identification Scenario:
User records a few seconds of audio Need to match it to a database of songs
Recording could be distorted due to Noise: background, crosstalk, etc Transmission over limited channels (i.e. cell phone)
Recording can come from any point of song Must align to correct position in reference recording
Practical issues: must support >100,000 songs
![Page 3: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/3.jpg)
3
Main Contributions Novel computer vision approach to an audio task Pairwise variant of boosting Functional music identification system with state-
of-the-art performance
![Page 4: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/4.jpg)
4
Background Topics To Be Covered
Spectrograms Jones/Viola image features Boosting/AdaBoost Expectation Maximization (EM) Random Sample Consensus (RANSAC)
![Page 5: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/5.jpg)
5
Spectrogram Graphical representation of frequency content of sound Based on short-time Fourier transform
Extract frequency content over a time window Plot time against frequency “density”
![Page 6: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/6.jpg)
6
Another Example The ship was torn apart on the sharp (reef)
![Page 7: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/7.jpg)
7
Spectrograms of Music Identify music snippets by matching spectrogram
of test recording to reference recording Comparing by correlation is inaccurate and slow Solution: match based on simple features
![Page 8: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/8.jpg)
8
Viola/Jones Features Use rectangle features instead of pixels
Can compute efficiently with integral image Compute sum of pixels within a box,
features are combinations of box sums: B, W: Black, white regions Two rectangles: W-B Three: W1+W2-B Four: W1+W2-(B1+B2)
![Page 9: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/9.jpg)
9
Viola/Jones Features
Above feature classes model Power differences across frequencies Power differences across time Up/downward drifts of dominant frequency Power peaks across frequencies Power peaks across time
![Page 10: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/10.jpg)
10
Viola/Jones Features
Each feature can vary in Frequency location: 1 to 33 Frequency width: 1 to 33 Time width: 1 frame (11.6ms) to 82 frames
(951ms) ~25,000 total possible features
![Page 11: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/11.jpg)
11
Feature Selection Need to pick M features (out of ~25,000)
Matching song classifier composed of selected features Idea: Use boosting to select features that yield
Similar output when recording is a match Differing output when recording is not a match
![Page 12: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/12.jpg)
12
AdaBoost Review Standard AdaBoost scenario: boost
classification performance of a “weak”classifier, e.g., perceptron Apply to successively harder problems Tweak parameters at each classification stage
This work: use Jones/Viola features asweak classifiers Find sequence of best features by boosting
![Page 13: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/13.jpg)
13
Boosting Framework x1, x2 spectrogram images Want to find “strong” classifier H(x1, x2)=
1 if images derive from same audio source -1 if derive from different sources
Find weak classifiers of the form
Want matching images on same side ofthreshold
Difference from AdaBoost: label assigned topairs of images (pairwise boosting)
![Page 14: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/14.jpg)
14
Boosting Initialization
Given: n spectrogram image pairs:
Labels for each pair:
Initialize weights:
![Page 15: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/15.jpg)
15
Boosting Training Loop For m=1,…,M
1.Select min-error classifier
2.If ith image pair classified incorrectly and yi=1(matching pair of images), adjust its weight up:
– If yi=-1, don’t do anything3.Normalize the weights such that:
![Page 16: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/16.jpg)
16
Final (Strong) Classifier Linear combination of weak classifiers Weighted by performance of each classifier
Note, if , classifier t does notcontribute to combination
Strong classifier apparently not used in finalsystem, just for evaluating selected features
![Page 17: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/17.jpg)
17
Differences From AdaBoost AdaBoost reweights all correctly learned points
down and incorrect points up Our boosting algorithm cannot do that
Recall, our classifier has the form
Let us draw a pair of non-matching spectrogramimages x1, x2 at random
Then let But then Thus,
Violates weak classifier criterion: correct at least ½ the time
Solution: reweight only matching examples
![Page 18: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/18.jpg)
18
Occlusion Model Boosting classifier identifies distorted versions of
the same song However, some parts of the recorded song might
be mostly noise or interference Thus, need to model whether audio chunk is the
song or some distraction (occlusion)
![Page 19: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/19.jpg)
19
Occlusion Model Compute M weak features at each time
step (11.6ms): “descriptor” (M-bit vector) Probability that current descriptor is
caused by an occlusion depends on Current descriptor: xi
Whether previous descriptor was caused byocclusion: yi-1
![Page 20: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/20.jpg)
20
Occlusion Model Details Given
n vector descriptors from recorded song’sspectrogram:
Descriptors from original song: Differences between recorded and original
descriptors: Find: yi={0,1}, whether ith chunk due to
distortion
![Page 21: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/21.jpg)
21
What’s the problem?
Have to simultaneously estimatedistributions for xi
r-o : data, with underlying distribution yi
: occlusion labels Solution:
Model data, labels with Bernoulli distribution Apply Expectation Maximization (EM) algorithm
![Page 22: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/22.jpg)
22
EM: An Aside Given dependent random variables:
Observed variable x Latent (unobserved) variable y that generates x
Assume probability distributions: Pθ (x), Pθ (y) θ represents all parameters of distribution
Repeat until convergence E-step: Compute “expectation” of logPθ (y,x)
θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes above sum
![Page 23: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/23.jpg)
23
EM Derivation Lemma (Special case of Jensen’s
Inequality): Let p(x), q(x) be probabilitydistributions. Then
Proof: rewrite as:
![Page 24: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/24.jpg)
24
EM Derivation EM Theorem:
If then
Proof:
By a lot of algebra and lemma on last slide,
So, if this quantity is positive, so is
![Page 25: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/25.jpg)
25
EM Summary Repeat until convergence
E-step: Compute “expectation” of logPθ (x,y) θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes (1)
EM Theorem: If then
Interpretation As long as we can improve the “expectation” in (1),
EM improves our model of observed variable x
![Page 26: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/26.jpg)
26
EM Discussion
Problems with EM? Local maxima Need to bootstrap training process (pick a θ )
When is EM most useful? When model distributions easy to maximize
e.g., Gaussian mixture models
EM is a meta-algorithm, needs to beadapted to particular application
![Page 27: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/27.jpg)
27
Applying EM to Our Problem EM “score”: xi
r-o: data, yi : labels
Model P(xir-o |yi
) with 2M Bernoulli variables Each xi consists of M=32 weak classifier outputs
Model P(yi |yi-1
) with 2 Bernoulli variables 2M+2=66 total parameters to estimate
Repeat until convergence E-step: Compute “expectation” of logPθ (x,y)
θ ’ ,θ : old, new distribution parameters
M-step: Find θ that maximizes (1)
![Page 28: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/28.jpg)
28
EM For Song Matching Given recording xr, find most likely original song
xo that produced the recording Reject any match where EM score less than
threshold T : need Unclear how T is determined
So, now we can calculate the likelihood ofrecording snippet matching a given original song But, matching against entire song database too slow Solution: search database for near-neighbors of
recording
![Page 29: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/29.jpg)
29
Retrieval Calculate M-bit descriptor of each song in
database at each time step (11.6ms) Store in hash table (descriptor song)
To look up song, perturb descriptor vector Try all flips of 1 bit, 2 bits, etc
= Hamming distance 1, 2, etc Look up perturbed vectors in hash table
Get back near-neighbor candidates Now, need to align: use RANSAC
![Page 30: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/30.jpg)
30
Another Aside: RANSAC Random Sample Consensus (Fischler & Bolles,
1981) Assumption: data to be modeled consists of
Mostly data points matching the model (“inliers”) A few outliers
Idea: Keep picking random samples of data points Eventually we will pick a set with few outliers Improvement: pick data points intelligently
What is “a few” outliers? When selected data points are explained by model
within a certain error tolerance
![Page 31: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/31.jpg)
31
Applying RANSAC Given: Sequence of M-bit descriptors over
test recording Iterate over all time alignments of test
recording to candidate originals Select alignments at random Compute EM score over all descriptors for
each alignment Pick candidate original with best EM score
Subject to
<500 iterations usually “sufficient”
![Page 32: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/32.jpg)
32
Experiments Data set: 1,861 songs from variety of genres First, learn “bootstrap” features and EM
parameters on synthetically distorted data This yields basic model good enough to align training
data to original Training data: 78 songs played, recorded using
low-quality equipment Test data
A: 71 songs played at low volume, recorded withdistorted microphone
B: 220 songs recorded in “very noisy” setup
![Page 33: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/33.jpg)
33
Testing Descriptor Performance Test data: ~100,000(+)/1,000,000(-) examples
15-second snippets of 71 songs from test set A
Baseline: original and improved algorithm from otherauthors (Haitsma & Kalker, 2002)
Vary Hamming distance threshold to generate ROC curve
![Page 34: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/34.jpg)
34
Varying Hamming Threshold For fast retrieval, need HamDist ≤ 2 For 10-sec query, need recall of a few %
Recall = TP/(TP+FN) Precision = TP/(TP+FP)
Recall rates table vs. HamDist Threshold
![Page 35: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/35.jpg)
35
Song Retrieval 10/15 seconds (860,
1290 descriptors) Both Test Sets A and B HamDist={0,1,2}
![Page 36: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/36.jpg)
36
Interface Screenshot
![Page 37: Computer Vision for Music Identificationeugenew/publications/ke-cvmusicid05-talk.pdf · recording snippet matching a given original song But, matching against entire song database](https://reader034.vdocuments.net/reader034/viewer/2022052518/5f0be69f7e708231d432c626/html5/thumbnails/37.jpg)
37
Thank You! Any questions? References:
Lecture notes, MIT class 6.345: Automatic SpeechRecognition
F. Jelinek, Statistical Methods for SpeechRecognition, 1997
M. A. Fischler, R. C. Bolles. Random SampleConsensus: A Paradigm for Model Fitting withApplications to Image Analysis and AutomatedCartography, Comm. of the ACM, Vol 24, pp 381-395,June 1981.
Wikipedia