methods for the automatic analysis of music and audio

1

FXPAL Technical Report FXPAL-TR-99-038

�Methods for the Automatic Analysis of Music and Audio

Jonathan Foote [email protected]

AbstractThis report describes methods for automatically locating points of significant change in music oraudio, by analyzing local self-similarity. This method can find individual note boundaries or evennatural segment boundaries such as verse/chorus or speech/music transitions, even in the absenceof cues such as silence. This approach uses the signal to model itself, and thus does not rely on par-ticular acoustic cues nor requires training. We present a wide variety of applications, includingindexing, segmenting, and beat tracking of music and audio. The method works well on a widevariety of audio sources.

2

Introduction

In video, the frame-to-frame difference is very useful in as a measure of novelty. The frame differ-ence can be used for automatic segmentation and key frame extraction, among other things. A sim-ilar measure for audio would have many useful applications, however there have been no goodapproaches save for various attempts at silence detection. Computing audio novelty is significantlymore difficult than video. Straightforward approaches like measuring spectral differences are notuseful because they typically give too many false alarms as typical spectra for speech and music isin constant flux.

We describe a method of finding points of maximum audio change by considering the self-similar-ity of the audio. For each instant in the audio, the self-similarity for past and future regions is com-puted, as well as the cross-similarity between the past and future. A significantly novel point willhave high self-similarity in the past and future and low cross-similarity. The extent of the “past”and “future” can be varied to change the scale of the system so that, for example, individual notescan be found using a short time extent while longer events, such as musical themes, can be foundby considering windows further into the past or future. The result is a measure of how novel thesource audio is at any time. Instances when this measure is large correspond to significant audiochanges, and are thus good points to use for segmenting or indexing the audio. Periodic peaks cor-respond to periodicity in the music, such as rhythm, and so this method can be used for beat-track-ing, that is, finding the tempo and location of downbeats in music. Other applications of thismethod include:

• Automatic segmentation for audio classification and retrieval.

• Audio indexing/browsing: jump to segment start points

• Audio summarization: play only start of significantly new segments

• Audio “gisting:” play only segment that best characterizes entire work

• Align music audio waveforms with MIDI notes for segmentation

• Indexing/browsing audio: link/jump to next novel segment

• Automatically find endpoints points for audio “smart cut-and-paste”

• Aligning audio for non-linear time scale modification (“audio morphing”).

• Tempo extraction, beat tracking, and alignment

• “Auto DJ” for concatenating music with similar tempos.

• Finding time indexes in speech audio for automatic animation of mouth movements

• Analysis for structured audio coding such as MPEG-4

3

Technical Details

The methods of this paper produce a time series that is proportional to the acoustic novelty of asource audio at any instant. High values and peaks correspond to large audio changes. The noveltyscore can be thresholded to find these instances, which can be used as segment boundaries. Figure1 shows the steps in computing the novelty score.

The first step is to parameterize the audio. This is typically done by windowing the audio wave-form. Variable window widths and overlaps can be used; in the present system a windows(“frames”) are 256 samples wide, and are overlapped by 128 points. For audio sampled at 16kHz,this results in a 16 mS frame width and a 125 per second frame rate. Each frame is then parameter-ized using a standard analysis such as a Fourier transform or Mel-Frequency Cepstral Coefficientsanalysis. Other parameterization include those based on linear prediction, psychoacoustic consid-erations [17] or even a combination (Perceptual Linear Prediction). In the examples presentedhere, a standard frequency analysis is used. Each analysis frame is windowed with a 256-pointHamming window and a fast Fourier transform (FFT) estimates the spectral components in thewindow. The logarithm of the magnitude of the result is used as an estimate of the power spectrumof the signal in the window. High frequency components are discarded (typically, those aboveFs/4) as they are not as useful for the similarity calculation as lower frequencies. The resultingvector characterizes the spectral content of a window. This is a standard audio processing tech-nique. Many compression techniques such as MPEG Layer 3 audio use a similar spectral represen-tation, which could be used for a distance measure, avoiding decoding the audio and

Figure 1: Segmentation system flow chart

4

reparameterizing. Regardless of the parameterization used, the result is a compact vector of param-eters for every frame.Note that the actual parametrization is not crucial as long as “similar” sounds yield similar param-eters. Different parameterizations may be very useful for different applications; for example exper-iments have shown that the MFCC representation, which preserves the coarse spectral shape whilediscarding fine harmonic structure due to pitch, may be particularly appropriate for certain appli-cations. A single pitch in the MFCC domain is represented by roughly the envelope of the harmon-ics, not the harmonics themselves. Thus MFCCs will tend to match similar timbres rather thanexact pitches; though single-pitched sounds will match if they are present. Thus the system is veryflexible and can subsume most any existing audio analysis method.This method works for anygeneral measure of audio similarity. It might be possible to tune the similarity measure to the task,for example by adjusting parameters such as window size, to maximize the contrast of the result-ing similarity matrix. Psychoacoustically motivated parameterizations, like those described bySlaney in [17], may be especially appropriate if they better reproduce the similarity judgments ofhuman listeners.

Distance Matrix EmbeddingOnce the audio has been parameterized, it is then embedded in a 2-dimensional representation [6].Figure 2 shows the embedding step schematically. A measure D of the (dis)similarity between fea-ture vectors and is calculated between every pair of audio frames i and j. A simple distance

measure is the Euclidean distance in the parameter space, that is the square root of the sum of thesquares of the differences of each vector parameter.

istart endj

i

j

similarity

matrix S

start

end

waveform

D(i,j)

Figure 2: Distance matrix calculation

vi vj

DE i j,( ) vi vj–≡

5

Another useful metric of vector similarity is the scalar (dot) product of the vectors. This will belarge if the vectors are both large and similarly oriented.

To remove the dependence on magnitude (and hence energy, given our features), the product canbe normalized to give the cosine of the angle between the parameter vectors. This is equivalent tothe cosine of the angle between the vectors and has the property that it yields a large similarityscore even if the vectors are small in magnitude. For most applications, this is appropriate.Because of Parseval’s relation, the norm of each spectral vector is proportional to the average sig-nal energy in that window.

Using the cosine measure means that windows with low energy, such as those containing silence,will be spectrally similar, which is generally desirable. Because windows, hence feature vectors,occur at a rate much faster than typical musical events, a better similarity measure can be obtainedby computing the vector correlation over a window w. This also captures the time dependence ofthe vectors. To result in a high similarity score, vectors in a window must not only be similar buttheir sequence must be similar as well.

Considering a one-dimensional example, the scalar sequence (1, 2, 3, 4, 5) has a much highercosine similarity score with itself than with the sequence (5, 4, 3, 2, 1). Note that the dot-productand cosine measures grow with increasing vector similarity while the Euclidean distanceapproaches zero; the final measure can be trivially inverted for the proper sense of similarity. Anyother reasonable distance measure can be used, including statistical measures and weighted ver-sions of the above metrics

The distance measure is a function of two frames, hence instants in the source signal. It is conve-nient to consider the similarity between all possible instants in a signal. This is done by embeddingthe distance measure in a two-dimensional representation. The matrix S contains the similaritymetric calculated for all frame combinations, hence time indexes i and j such that the i,jth elementof S is D(i,j). In general, S will have maximum values on the diagonal (because every window willbe maximally similar to itself); furthermore if D is symmetric then S will be symmetric as well. Tosimplify computation, the similarity can be represented in the “slanted” domain L(i,l) where l isthe lag . This is particularly true in for the applications presented later, where the similar-

ity is only required for relatively small lags and not all combinations of i and j. Computing L onlyfor small, non-negative values of l can result in substantial reductions in computation and storage.S can be visualized as a square image such that each pixel i, j is given a gray scale value propor-tional to the similarity measure D(i,j), and scaled such that the maximum value is given the maxi-mum brightness. These visualizations let us clearly see the structure of an audio file. Regions ofhigh audio similarity, such as silence or long sustained notes, appear as bright squares on the diag-onal. Repeated figures, such as themes, phrases, or choruses, will be visible as bright off-diagonal

Dd i j,( ) vi vj•≡

DC i j,( )vi vj•vi vj

-----------------≡

D i( j w, ), 1w---- D i k+ j k+,( )

k 0=

w 1–

∑≡

l i j–=

6

rectangles. If the music has a high degree of repetition, this will be visible as diagonal stripes orcheckerboards, offset from the main diagonal by the repetition time.

Figure 3 shows the first seconds of Bach’s Prelude No. 1 in C Major, from The Well-TemperedClavier, BVW 846. (In this and following visualizations, image rather than matrix coordinate con-ventions are used, thus the origin is at the lower left and time increases both with height and to theright.) The left image visualizes a 1963 piano performance by Glenn Gould. The visualizationmakes both the structure of the piece and details of performance visible. The musical structure isclear from the repetitive motifs; multiples of the repetition time can be seen in the off-diagonalstripes parallel to the main diagonal. Figure 7 shows the first few bars of the score: the repetitivenature of the piece should be clear even to those unfamiliar with musical notation. The visualiza-tion makes both the structure of the piece and details of performance visible. 34 notes can be seenas squares along the diagonal. The repetition time is visible in the off-diagonal stripes parallel tothe main diagonal, as well as the repeated C note at 0, 2, 4, and 6 seconds. The right image visual-izes a similar excerpt, this time from a MIDI realization using a passable piano sample and a stricttempo. Beginning silence is also visible as a square at lower left. Here, unlike the human perfor-mance, all notes have exactly the same duration and articulation

Kernel Correlation.

The structure of S is key to the novelty measure. Consider a simple “song” having two successivenotes of different pitch, for example a cuckoo call. When visualized, S for this example will looklike a 2 x 2 checkerboard. White squares on the diagonal correspond to the notes, which have highself-similarity; black squares on the off-diagonals correspond to regions of low cross-similarity.Assuming we are using the cosine metric Dc, similar regions will be close to 1 while dissimilar

regions will be closer to -1, Finding the instant when the notes change is as simple as finding thecenter of the checkerboard. This can be done by correlating S with a kernel that itself looks like a

time (s)

time

(s)

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

Figure 3: Similarity visualizations of Bach’s Prelude No. 1. Left: a Glenn Gould performance. Right: an acoustic realization from MIDI data

7

checkerboard. We will call this class “checkerboard” kernels. Perhaps the simplest is the 2x2 unitkernel

Other variations can be used; for example the unit checkerboard kernel can be decomposed into“coherence” and “anticoherence” kernels as follows:

The first term measures the self-similarity on either side of the center point; this will be high whenboth regions are self-similar. The second term measures the cross-similarity between the tworegions; this will be high when the regions are substantially similar, thus with little differenceacross the center point. The difference of the two values estimates the novelty of the signal at thecenter point; this will have a high value when the two regions are self-similar but different fromeach other. Larger kernels. are easily constructed by forming the Kronecker product of a unit ker-nel with a matrix of ones, for example.

1 1–

1– 1

1 1–

1– 1

1 0

0 1

0 1

1 0–=

010

2030

4050

6070

0

10

20

30

40

50

60

70

−1

−0.5

0

0.5

1

x 10−3

Figure 4: 64 x 64 checkerboard kernel with Gaussian taper

1 1–

1– 1

1 1

1 1

1 1 1– 1–

1 1 1– 1–

1– 1– 1 1

1– 1– 1 1

=⊗

8

Kernels can be smoothed to avoid edge effects using windows (such as a Hamming) that tapertowards zero at the edges. For the experiments presented here, a radially-symmetric Gaussianfunction is used. Figure 4 shows a 3-dimensional plot of a 64 x 64 checkerboard kernel with aradial Gaussian taper having sigma = 32.

Correlating a checkerboard kernel with the similarity matrix S results in a measure of novelty. Tosee how this works, imagine sliding C along the diagonal of our example, and summing the ele-ment-by-element product of C and S. When C is over a relatively uniform region, such as a sus-tained note, the positive and negative regions will tend to sum to zero. Conversely, when C ispositioned exactly at the crux of the checkerboard, the negative regions will multiply the negativeregions of low cross-similarity, and. the overall sum will be large. Thus calculating this correlationalong the diagonal of S gives a time-aligned measure of audio novelty N(i), where i is the framenumber, hence time index, corresponding to the original source audio. By convention, the kernel C

has a width (lag) of W and is centered on 0,0. For computation, S can be zero-padded to avoidundefined values, or, as in the present examples, only computed for the interior of the signal wherethe kernel overlaps S completely. Note that only regions of S with a lag of W or smaller are used;thus the slant representation is particularly helpful. Also, typically both S and C are symmetricthus only one-half of the values under the double summation (those for ) need be computed.

The width of the kernel W directly affects the properties of the novelty measure. A small kerneldetects novelty on a short time scale, such as beats or notes. Increasing the kernel size decreases

0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

time (s)

0.5S kernel

2S kernel

nove

lty sc

ore

Figure 5: Novelty score for Gould performance and different kernel widths

N i( ) C m n,( )S i m i n+,+( )

n W– 2⁄=

W 2⁄

∑m W 2⁄–=

W 2⁄

∑=

m n≥

9

the time resolution, and increases the length of novel events that can be detected. Larger kernelsaverage over short-time novelty and detect longer structure, such as musical transitions like thatbetween verse and chorus, key modulations, or symphonic movements. Figure 5 shows the noveltyscore, computed on the similarity matrix of the Gould performance from Figure 3. Results usingtwo kernel widths are shown (the 2S kernel plot was offset slightly upwards scale for clarity). Theshorter kernel clearly picks out the individual note events, though some notes are slurred and arenot as distinct. In particular, in Gould’s idiosyncratic performance, the last three notes in eachphrase are emphasized with a staccato attack. Note how this system clearly identifies the onset ofeach note, without analyzing explicit features such as pitch, energy, or silence. The the longer ker-nel yields peaks at the boundaries of 8-note phrases (at 2, 4, and 6 seconds). Each peak occursexactly at the downbeat of the first note in each phrase. The longer kernel shows secondaryextrema at the note just following the start of each phrase, because it is pitched significantly lowerthan other notes in the phrase. Note that this method has no a-priori knowledge of musical phrasesor pitch, but is finding perceptually and musically significant points.

Extracting Segment BoundariesAs mentioned above, extrema in the novelty score correspond to large changes in the audio charac-teristics. These points often serve as good boundaries for segmenting the audio, as the audio willbe similar within boundaries and significantly different across them. These points also serve asuseful indexes into the audio as they indicate points of significant change. Finding the segmentboundaries is a simple matter of finding the peaks in the novelty score. A simple approach is tofind points where the score exceeds a local or global threshold. For more time precision, maximumor zero-slope points above the threshold locate peaks exactly. A useful way of organizing the indexpoints is in a binary tree structure constructed by ranking all the index points by novelty score. Thehighest-scoring index point becomes the root of the tree, and divides the signal into left and rightsections. The highest-scoring index point in the left and right sections become the left and rightchildren of the root node, and so forth recursively until no more index points are left above thethreshold. This data structure facilitates navigation of the index points, making it easy to find thenearest index from any arbitrary point by walking the tree. In addition, the tree can be truncated atany threshold level to yield a desired number of index points, and hence segments. An enhance-ment to this approach is to reduce the size of the kernel as the tree is descended, so lower-levelindex points reveal increasingly fine time granularity. Figure 6 shows the segment boundariesextracted from the 1/2 second kernel results of on the first 10 seconds of the Gould performance.Boundaries are indicated with a . Individual notes are clearly isolated, save for the fourth note

which has been slurred with the third. This works well for musical notes, but can’t be expected tosegment speech into words unless they are spectrally different. This is because words are oftennot well-delineated acoustically, for example, the phrase “that’s stupid” would be segmented into“that’s-s” and “tupid” because there is little acoustic differences in the “s” sounds of the twowords. Note that this would likely be the segmentation that a non-English speaker would choose.

⊕

10

.

time (s)

tim

e (

s)

1 2 3 4 5 6

1

2

3

4

5

6

Figure 6: Gould performance showing note boundaries

Figure 7: First bars of Bach’s Prelude No. 1 in C Major, BVW 846, from The Well-Tempered Clavier

11

Speech/Music and Speaker SegmentationBesides music, these methods work for segmenting audio into speech and music components, andto a lesser extent segmenting audio by speaker. Figure 8 shows the audio novelty for the firstminute of Animals Have Young (video V14 from the MPEG-7 content set [24]). This segment con-tains 4 seconds of introductory silence, followed by a short musical segment with the productionlogo. At 17 seconds the titles start, and very different theme music commences. At 35 seconds, thisfades into a short silence, followed by female speech over attenuated background music for theremainder of the segment. Figure 5 shows the similarity score computed over a 8-second window,which is long enough to average out the short-time spectral differences in speech. The largest peakoccurs directly on the speech/music transition at 35 seconds. The two other major peaks occur atthe transitions between silence and music at 4 seconds and between the introduction and thememusic at 17 seconds. Using a naive spectral distance measure, the novelty score can’t in general,discriminate between speakers unless they have markedly different vocal spectra (such as betweengenders). Often, however, there are enough differences for the similarity measure to find. In partic-ular, this method can often find differences in speaking style and not, as in previous approaches,differences between known vocal characteristics of particular speakers or speaker models. Thoughnot demonstrated here, it is a straightforward matter to use a distance metric tuned to discriminatespeakers by vocal characteristics, for example based on cepstral coefficients or using the methodsof [19].

Figure 8: Novelty measure for video soundtrack excerpt

0 5 10 15 20 25 30 35 40 45 50 54 590

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

time (s)

music 1 music 2 speech

12

Automatic beat tracking and the “beat spectrum”Another application of the audio similarity is to derive both the periodicity and relative strength ofbeats in the music. A measure of self- similarity as a function of the lag we will call the beat spec-trum B(l). Peaks in the beat spectrum correspond to periodicities in the audio. A simple estimate ofthe beat spectrum can be found by summing S along the diagonal as follows:

Here, B(0) is simply the sum along the main diagonal over some continuous range R, B(1) is thesum along the first superdiagonal, and so forth. Once again the slant representation is especiallyconvenient as diagonal sums are simply the sum across columns, or the projection onto the lagaxis. Figure 9 shows an example computed for a range of 3 seconds in the Gould performance. Theperiodicity of each note can be clearly seen, as well as the strong 8-note periodicity of the phrase(with a sub-harmonic at 16 notes). Especially interesting are the peaks at 3 and 5 notes. Thesecomes from the three-note periodicity of the eight-note phrase: in each phrase, notes 3 and 6, notes4 and 7, and notes 5 and 8 are the same. A more robust definition of the beat spectrum is the auto-correlation of S:

Because B(k,l) will be symmetrical, it is only necessary to sum over one variable, giving a one-dimensional result B(l). This approach works surprisingly well across a range of musical genres,

0 1 2 3 4−50

0

50

100

150

200

250

time (s)

beat

spe

ctru

m (

Gou

ld)

Figure 9: “Beat spectrum” of Gould perfomance from diagonal sum

B l( ) S k k l+,( )

k R⊂∑≈

B k l,( ) S i j,( )S i k+ j l+,( )

i j,∑=

13

tempos, and rhythmic structures. Figure 10 shows the beat spectrum computed from the first 10seconds of the jazz composition Take 5 by the Dave Brubeck Quartet. Besides being in the unusual

time signature, this rhythmically sophisticated piece requires some interpretation. The first

thing to note is that there is no obvious periodicity at the actual beat tempo (marked by solid verti-cal lines in the figure). Rather, there is a marked periodicity at 5 beats, and a corresponding sub-harmonic at 10. Jazz aficionados know that “swing” is the subdivision of beats into non-equal peri-ods rather than “straight” (equal) eighth-notes. The beat spectrum clearly shows that each beat issubdivided into near-perfect triplets. This is indicated with dotted lines spaced 1/3 of a beat apartbetween the second and third beats. A clearer illustration of “swing” would be difficult to provideby any other means. Using the beat spectrum in combination with the narrow-kernel novelty score gives excellent esti-mates of musical tempo. Peaks in the beat spectrum give the fundamental rhythmic periodicity,while peaks in the novelty score give the precise downbeat time or phase. Correlating the noveltyscore with a comb-like function with a period from the beat spectrum yields a signal that hasstrong peaks at every beat. Strong off-beats and syncopations can be then deduced from secondarypeaks in the beat spectrum. This is a novel and robust approach to beat-tracking that has funda-mental advantages. Other approaches look for absolute acoustic attributes, for example energypeaks correlated across particular sub-bands. The method presented here is more robust in that theonly signal attribute necessary is a repetitive change. The beat spectrum has some interesting parallels with the frequency spectrum. Firstly, there is aninverse relation between the time accuracy and the beat spectral precision. This is because youneed more periods of a repetitive signal to more precisely estimate its frequency. The longer the

5 4⁄

0 1 2 3 4−0.2

0

0.2

0.4

0.6

0.8

1

1.2

beat

spe

ctru

m (

Tak

e 5)

time (s)

1 2 3 4 5

swing

10

Figure 10: Beat spectrum of jazz composition Take 5

14

summation range, the better the beat accuracy, but of course the worse the temporal accuracy.(Technically, the beat spectrum is a frequency operator, and hence does not commute with a timeoperator.) In particular, if the tempo changes over the measurement range, it will “smear” the beatspectrum. Analogously, a changing a signal’s frequency over the analysis window will result in asmeared frequency spectrum. Thus beat spectral analysis, just like frequency analysis, is a trade-off between spectral and temporal resolution.

Applications

The ability to reliably segment and beat-track audio has a large number of useful applications.Several are presented here. Note that this approach can be used for any time-dependent media,such as video, where some measure of point-to-point similarity can be determined.

Audio segmentation and indexing

As demonstrated above, these methods give good estimates of audio segment boundaries. This isuseful for any application where one might wish to play back only a portion of an audio file. Forexample, in an audio editing tool, selection operations can be constrained to segment boundariesso that the selection region does not contain fractional notes or phrases. This would be similar to“smart cut and paste” in a text editor that constrains selection regions to entire units such as wordsor sentences. The segment size can be adjusted to the degree of zoom so that the appropriate timeresolution is available. When zoomed in, higher resolution (perhaps from the lower levels of theindex tree) would allow note-by-note selection, while a zoomed-out view would allow selection byphrase or section. Similarly, segmenting audio greatly facilitates audio browsing: a “jump-to-next-segment” function allows audio to be browsed faster than real time. Because segments will be rea-sonably self-similar, listening to a small part of the segment will give a good idea of the entire seg-ment.

Audio summarization and gisting

This approach can be extended to automatic audio summarization, for example, by playing onlythe start of each segment as in the “scan” feature on a CD player. In fact, segments can be clusteredso that only significantly novel segments are included in the summary. Segments too similar to asegment already in the summary could be skipped without losing too much information. For exam-ple, when summarizing a popular song, repeated instances of the chorus could be excluded fromthe summary as they would be redundant. A further refinement of this could result in audio “gist-ing:” finding the short segment that best characterizes the entire work. Most on-line music retailersoffer a small clip of their wares for customers to audition. These are typically just a short intervaltaken near the beginning of each work and may not be representative of the entire piece. Simpleprocessing of the similarity matrix can find the segment most similar to the entire work. Averagingthe similarity matrix over the interval corresponding to each segment gives a measure of how wellthe segment represents the entire work. The highest-scoring segment would thus be the best candi-date for a sample clip.

Classification and Retrieval

Classification and retrieval will work better on shorter audio segments with uniform features than

15

long hereterogeneous data. In streaming audio like broadcast radio or a TV soundtrack, it is rarelyclear when one segment begins and ends. When the audio is segmented, each segment is guaran-teed to be reasonably self-similar and hence homogeneous. Thus segments will be good units tocluster by similarity, or classify as in the methods of [20]. These visualizations show how acousti-cally similar passages can be located in an audio recording. similarity can also be found acrossrecordings as well as within a single recording. As an immediate application, this would be usefulwherever known music or audio needs to be located in a longer file. For example, it would be asimple matter to find the locations of the theme music in a news broadcast, or the times that adver-tisements occur in a TV broadcast if the audio was previously available. In this case, the similaritymeasure would be computed between all frames of the source commercial and the TV broadcast,resulting in a rectangular similarity matrix. Commercial onset times could be determined bythresholding the similarity matrix at some suitable value. Even if the commercials were not knownin advance, they could be detected by their repetition. The structure of most music is sufficient tocharacterize the work. As proof by example, human experts can identify music and sound byvisual structure alone. Victor Zue of MIT teaches a course in “reading” sound spectrographs. In adouble-blind test, Arthur G. Lintgen of Philadelphia was able to distinguish unlabeled classicalrecordings by identifying the softer and louder passages visible in the LP grooves [22]. Theseexamples indicate that the visualization method presented here might be useful for music retrievalby similarity. Not only can acoustically similar audio be located, but structurally similar audioshould be straightforward to find, by comparing similarity matrices. For example, different perfor-mances of the same symphonic movement will have a similar structural visualization regardless ofhow or when they were performed or recorded, or indeed the instruments used. Figure 11 showsthe self-similarity of the entire first movement of Beethoven’s Symphony No. 5. Two visualizationsare shown, each from a different performance featuring different conductors (Herbert von Karajanand Carlos Kleiber) and orchestras (the Berlin and Vienna Philharmonics, respectively). Becauseof the length of the piece (more than seven minutes), much fine detail is not observable. Each pixelrepresents nearly a second of music, thus the famous opening theme occurs in the in only the firstfew pixels. The primary visible structure is the alternation of softer string passages with louder

time (s)

time

(s)

48 95 143 191 239 286 334 382

48

95

143

191

239

286

334

382

time (s)

time

(s)

48 96 143 191 239 287 334 382

48

96

143

191

239

287

334

382

Figure 11: Self-similarity of Symphony No. 5. Left: von Karajan performance. Right: Carlos Kleiber performance.

16

tutti (all instruments playing) sections, for example the sustained climax near the end of the move-ment. This figure illustrates how the visualization captures both the essential structure of the pieceas well as variations of individual performances.

Automatic audio alignment

As shown above, a benefit of the novelty score is that it will be reasonably invariant across differ-ent realizations of the same music. Because it is based on self-similarity, rather than particularacoustic features, different performances of the same music should have similar novelty scores.Thus Bach’s Air on a G String should result in a novelty score that is similar whether played on aviolin or a kazoo. An immediate application of this is that different realizations of the same musiccan be aligned. The time axis of the one realization’s novelty score can be warped to match theother using dynamic programming. The warping function then serves as a tempo map. Using non-linear time scale modification, one piece could be played back with the tempo of the other. (SeeSprenger’s excellent review of audio time and pitch scaling methods [23].) In general, segmentboundaries are useful landmarks for time-scale modification, just as “control points” are used forimage “morphing.” Another application might be to play back an audio piece along with unpre-dictably timed events such as progress through a video game. Longer segments could be associ-ated with particular stages, such as a game level or virtual environment location. As long as theuser stayed at that stage, the segment would be looped. Moving to a different stage would causeanother segment to start playing.

Automatic tempo extraction

Knowing the time location of these landmarks allows synchronization of external events to theaudio. For example, an animated character could beat time or dance to the musical tempo. Videoclips could be automatically sequenced to an existing musical soundtrack. In another useful appli-cation, the lip movements of animated characters could be synchronized to speech or singing. Con-versely, audio can be matched to an existing time sequence, by warping segments so that audiolandmarks (segment boundaries) occur at given times. Examples include creating a soundtrack foran existing animation or video sequence. Another application might be to arrange songs by similartempo so that the transition between them is smooth. This is a process that is done manually byexpert DJs and also for vendors of “environmental” music such as Muzak.

Music Analysis By Synthesis (and synthesis from analysis)

As another application of this principle, Figure 12 shows a similarity image of the Bach preludederived directly from the MIDI data. Here, no acoustic information was used. Matrix entries (i,j)were colored white if note i was the same pitch as note j, and left black otherwise. Comparing thisimage with the acoustic similarity images of Figure 3, clearly the structures of all three visualiza-tions are highly similar, indicating that they do indeed capture the underlying structure of the

17

music1. For example, Figure 3 shows two realizations of the same Bach piece, one by Glenn Gouldand another which is a computer rendition of a MIDI file. Given the audio of a particular perfor-mance and a MIDI file representation of the same piece, as in Figures 4 and 5, it would be possibleto warp the similarity matrix from the known-tempo MIDI rendition to match that of the originalperformance. The warping function would then serve as a tempo map, allowing the MIDI file to beplayed back with the tempo of the original. Other compelling applications result from the ability toget note-by-note or event-by-event indexes into the source audio.

Reliably segmenting music by note or phrase allows substantial compression. For example, arepeated series of notes can be represented by the first note and the repetition times. If the secondchorus is nearly identical to the first, it need not be transmitted; just a code to indicate repetition ofthe first. The MPEG-4 structured audio standard supports exactly this kind of representation, butheretofore there have been few reliable methods to analyze the structure of existing audio.

Previous work“Recurrence plots,” which are thresholded versions of similarity matrices, have been used fortime-series analysis [4]. Rather than any spectral-based measure, recurrence plots display the sca-lar distance between time series points, and thresholded such that only distances less than somethreshold are plotted.

1.Though this is perhaps beyond the scope of this paper, if a suitable measure of chord-wise simi-larity were used, the novelty score could be computed for an image like Figure 12, and hence this approach could be used to segment MIDI data.

Figure 12: Self-similarity of Prelude No. 1 computed from MIDI data only

18

Much work in audio analysis falls under the rubric “Auditory Scene Analysis” after the work byBregman [5]. This approach typically looks for components in the frequency domain that are har-monically or temporally related, and is quite different from the work presented here. In particular,rules or assumptions are necessary to define what “related” means; these rules typically work wellonly in a limited domain. Many groups have reported work on musical beat-tracking and analysis.A recent approach uses correlated energy peaks across sub-bands [1]. Another approach dependson restrictive assumptions such as the music must be in 4/4 time and have a bass drum on thedownbeat [2].

Previous researchers have found useful applications for segmenting audio by speaker [13,14,15].In this case, specialized models are trained on each speaker to be found, and the statistical likeli-hood is used to find segments spoken by the same speaker [9]. A group at Xerox PARC did audiosegmentation by speaker, by clustering audio segments and training a hidden Markov model whichwas then used to segment the audio [9,10,11,12]. Another approach segments audio by findingsimilarity to a model trained on specific audio examples, using features such as spectral moments[7].

It is relatively easy to segment audio that has significant intersegment silences (sustained regionsof low signal energy). This approach has been used for browsing recorded speech [16] as well as incommercial audio processing software (e.g. Sound Forge). Much interesting audio does not con-tain significant silence, however, including noisy or reverberant sources and most popular music.The novelty score approach is an improvement in that it does not use an absolute acoustic feature—silence—rather the change in acoustic features. It should be noted that typical parameterizationsand distance measures give silence a high self-similarity and low cross-similarity with othersounds, thus the novelty score will robustly segment audio with intersegment silences if they exist.

References

[1] Scheirer, Eric D. (1998). “Tempo and Beat Analysis of Acoustic Musical Signals.” InJ. Acoust. Soc. Am. 103(1) (Jan 1998), pp 588-601.

[2] Goto, M. and Y. Muraoaka (1994). “A Beat Tracking System for Acoustic Signals ofMusic,” In Proc. ACM Multimedia 1994, San Francisco, ACM.

[3] Scheirer, Eric D. (1998), “Using Musical Knowledge to Extract Expressive PerformanceInformation from Audio Recordings.” In Readings in Computational Auditory SceneAnalysis, Rosenthal and Okuno (eds). Mahweh, NJ: Lawrence Erlbaum, 1998.

[4] Eckman, J.P., et. al., “Recurrence Plots of Dynamical Systems,” in Europhys. Lett. 4, 973(1 November 1987).

[5] A. Bregman, Auditory Scene Analysis: Perceptual Organization of Sound. BradfordBooks, 1994.

[6] Foote, J. (1999). “Visualizing Music and Audio using Self-Similarity.” In Proc. ACM Mul-timedia 99, Orlando, FL.

[7] US Patent 5918223 Method and article of manufacture for content-based analysis, storage,retrieval, and segmentation of audio information

19

[8] US Patent 5227892: Method and apparatus for identifying and selecting edit points in dig-ital audio signals recorded on a record medium

[9] Kimber, D., and L. Wilcox, L., “Acoustic Segmentation for Audio Browsers,” in Proc.Interface Conference. Sydney, Australia, 1996.

[10] US Patent 5659662: Unsupervised speaker clustering for automatic speaker indexing ofrecorded audio data. L. Wilcox and D. Kimber,

[11] US Patent 5598507: Method of speaker clustering for unknown speakers in conversationalaudio data, Kimber; D., Wilcox; L.,Chen; F.

[12] US Patent 5655058: Segmentation of audio data for indexing of conversational speech forreal-time or postprocessing applications. Balasubramanian; V.,Chen; F., Chou; P., Kimber;D., Poon; A., Weber; K., Wilcox; L.

[13] Gish et al., "Segregation Of Speakers For Speech Recognition And Speaker Identifica-tion," Proc. ICASSP, May 1991, vol. 2 pp. 873-876.

[14] Siu et al., "An Unsupervised Sequential Learning Algorithm For The Segmentation OfSpeech Waveforms With Multiple Speakers," Proc. ICASSP, Mar. 1992, vol. 2, pp. 189-192.

[15] Sugiyama et al., "Speech Segmentation And Clustering Based On Speaker Features,"Proc. ICASSP, Apr. 1993, vol. 2 pp. 395-398.

[16] Arons, B. “SpeechSkimmer: A system for interactively skimming recorded speech.” InACM Trans. on Computer Human Interaction, 4(1):3-38, March 1997.

[17] Slaney, M. (1998). “Auditory Toolbox,” Technical Report #1998-010, Interval ResearchCorporation,Palo Alto, CA

[18] Rabiner, L., and Juang, B.-H., Fundamentals of Speech Recognition, Englewood Cliffs,NJ, 1993

[19] J. T. Foote and H. F. Silverman. A model distance measure for talker clustering and identi-fication. In Proc. ICASSP ’94, volume S1, pages 317-320, Adelaide, Australia,April 1994.IEEE.

[20] Foote, J. (1997). “Content-Based Retrieval of Music and Audio.” In Multimedia Storageand Archiving Systems II, Proc. SPIE, Vol. 3229, Dallas, TX.

[21] United States Patent 5,828,994 Covell, et. al. “Non-uniform time scale modification ofrecorded audio” Oct. 27, 1998

[22] Johnson, P., “sci.skeptic FAQ,” Section 0.6.2, http://www.faqs.org/faqs/skeptic-faq/1 ,1999

[23] Sprenger, S. “Time and Pitch Scaling of Audio Signals,” http://www.dspdimen-sion.com/html/timepitch.html

[24] MPEG Requirements Group, “Description of MPEG-7 Content Set,” Doc.ISO/MPEG N2467, MPEG Atlantic City Meeting, October 1998

methods for the automatic analysis of music and audio

Documents