a computational framework for sound segregation in music signals using marsyas

A Computational Framework for Sound Segregation in Music

Signals

Luís Gustavo Martins CITAR / Escola das Artes da UCP

[email protected]

Porto, Portugal

Auditory Modeling Workshop Google, Mountain View, CA, USA

19.11.2010

Acknowledgments

A Computational Framework for Sound Segregation in Music Signals 2

}  This work is the result of the collaboration with: }  University of Victoria, BC, Canada

}  George Tzanetakis, Mathieu Lagrange, Jennifer Murdock

}  All the Marsyas team

}  INESC Porto

}  Luis Filipe Teixeira

}  Jaime Cardoso

}  Fabien Gouyon

}  Technical University of Berlin, Germany

}  Juan José Burred

}  FEUP PhD Advisor Professor }  Aníbal Ferreira

}  Supporting entities }  Fundação para a Ciência e a Tecnologia - FCT }  Fundação Calouste Gulbenkian

}  VISNET II, NoE European Project

Research Project


}  FCT R&D Project (APPROVED FOR FUNDING) }  A Computational Auditory Scene Analysis Framework for Sound Segregation in Music Signals

}  3-year project (starting Jan. 2011) }  Partners:

}  CITAR (Porto, Portugal)   Luís Gustavo Martins (PI), Álvaro Barbosa, Daniela Coimbra

}  INESC Porto (Porto, Portugal)   Fabien Gouyon

}  UVic (Victoria, BC, Canada)   George Tzanetakis

}  IRCAM (Paris, France)   Mathieu Lagrange

}  Consultants }  FEUP (Porto, Portugal)

  Prof. Aníbal Ferreira, Prof. Jaime Cardoso }  McGill University / CIRMMT (Montreal, QC, Canada)

  Prof. Stephan McAdams

Summary


}  Problem Statement }  The Main Challenges }  Current State }  Related Research Areas }  Main Contributions }  Proposed Approach }  Results }  Software Implementation }  Conclusions and Future Work

Problem Statement


}  Propose a computational sound segregation framework }  Focused on music signals

}  But not necessarily limited to music signals }  Perceptually inspired

}  So it can build upon the current knowledge of how listeners perceive sound events in music signals

}  Causal }  So it mimics the human auditory system and allows online processing of sounds

}  Flexible }  So it can accommodate different perceptually inspired grouping cues

}  Generic }  So it can be used in different audio and MIR application scenarios

}  Effective }  So it can improve the extraction of perceptually relevant information from musical

mixtures }  Efficient

}  So it can find practical use in audio processing and MIR tasks

MUSIC LISTENING

ABSTRACT

KNOWLEDGE

STRUCTURES

EVENT

STRUCTURE

PROCESSING

EXTRACTION

OF

ATTRIBUTES

AUDITORY

GROUPING

PROCESSES

MENTAL

REPRESENTATION OF

SOUND

ENVIRONMENT

TR

AN

SD

UC

TIO

N

TR

AN

SD

UC

TIO

N

ATTENTIONAL

PROCESSES

Figure 2: The main types of auditory processing and their interactions (adaptedfrom [McAdams and Bigand, 1993]).

possible to extract perceptual attributes which provide a representation of each element in

the auditory system.

These attributes can now be interpreted with respect to evoked abstracted knowledge

structures that allow the recognition and identification of events and sequences of events,

as well as the assignment of meaning and significance. All this is performed at the light

of the local stimulus context and the past experience of the listener (see the discussion of

schema-based grouping in section 2.3.1).

These perceived relations among sound events can influence the perception of sub-

sequent events (and even revise past interpretations – e.g. the McAdams-Reynolds

oboe example [McAdams, 1984] or the “old-plus-new” organization principle described

in [Bregman, 1990, pp.261]). This event structure processing influences the establishment

of larger-scale structural relations.

The resulting elements from the event structure processing give rise to the progressive

creation of a mental representation of the sound environment involving the listener. Given

the transitory nature of the acoustic stimulus, information about acoustic events must

be somehow accumulated through time. This accumulated mental representation takes

a crucial part on the ability to dynamically focus attention to new incoming acoustic

16

}  Human listeners are able to perceive individual sound events in complex mixtures }  Even if listening to:

}  Monaural music recordings }  Unknown sounds, timbres or instruments

}  Perception is influenced by several complex factors }  Listener’s prior knowledge, context, attention, … }  Based on both low-level and high-level cues

}  Difficult to replicate computationally…

The Main Challenges


The Main Challenges


}  Why Music Signals? }  Music sound is, in some senses, more challenging to analyse

than non-musical sounds }  High time-frequency overlap of sources and sound events

  Music composition and orchestration   Sources that often play simultaneously polyphony   Favor consonant pitch intervals   Sound sources are highly correlated

}  High variety of spectral and temporal characteristics   Musical instruments present a wide range of sound production

mechanisms }  Techniques traditionally used for monophonic, non-musical

or speech signals perform poorly

}  Yet, music signals are usually well organized and structured

Current State


}  Typical systems in MIR }  Represent statistically the entire sound mixture

}  Analysis and retrieval performance reached a “glass ceiling” [Aucouturier and Pachet, 2004]

}  New Paradigm }  Attempt to individually characterize the different sound

events in a sound mixture }  Performance still quite limited when compared to human auditory

system }  But already provides alternative and improved approaches to common

sound analysis and MIR tasks

Applications


}  “Holy grail” applications }  “The Listening Machine” }  “The Robotic Ear”

}  “Down to earth” applications }  Sound and Music Description }  Sound Manipulation }  Robust Speech and Speaker Recognition }  Object-based Audio Coding }  Automatic Music Transcription }  Audio and Music Information Retrieval }  Auditory Scene Reconstruction }  Hearing Prostheses }  Up-mixing }  …

Related Research Areas


}  Sound and Music Computing (SMC) [Serra et al., 2007]

}  Computational Auditory Scene Analysis (CASA) [Wang and Brown, 2006]

}  Perception Research }  Psychoacoustics [Stevens, 1957]

}  Auditory Scene Analysis (ASA) [Bregman, 1990]

}  Digital Signal Processing [Oppenheim and Schafer, 1975]

}  Music Information Retrieval (MIR) [Downie, 2003]

}  Machine Learning [Duda et al., 2000]

}  Computer Vision [Marr, 1982]

Related Areas


}  Auditory Scene Analysis (ASA) [Bregman, 1990] }  How do humans “understand” sound mixtures? }  Find packages of acoustic evidence such that each package has

arisen from a single sound source

}  Grouping Cues }  Integration

  Simultaneous vs. Sequential   Primitive vs. schema-based

}  Cues   Common amplitude, frequency, fate   Harmonicity   Time continuity   …

Time

Related Areas


}  Computational Auditory Scene Analysis (CASA) [Wang and Brown, 2006]

}  “Field of computational study that aims to achieve human performance in ASA by using one or two microphone recordings of the acoustic scene.” [Wang and Brown, 2006]

MUSIC LISTENING

SOURCE

MODELS

ANALYSIS

FRONT-END

MID-LEVEL

REPRESENTATION

SCENE

ORGANIZATION

GROUPING

CUES

STREAM

RESYNTHESIS

ACOUSTIC

MIXTURE

SEGREGATED

SIGNALS

Figure 3: System Architecture of a typical CASA system.

reference in the development of sound source separation systems, since it is the only ex-

isting system which can robustly separate sound sources in various circumstances. As

a result, the resynthesis of the separated signals allows to assess the performance of a

CASA system by means of listening tests or by using objective measurements (see Sec-

tion 2.4.5 for a more thorough discussion on the evaluation approaches traditionally used

in CASA). The resynthesis of the segregated components in an acoustic mixture is also of

great interest to most CASA applications (see Section 1.4).

2.4.3 Mid-level Representations for Music Signals

One of the most common digital representations of an acoustical signal is the sampled

waveform, also known as the time-domain representation (see top panel of Figure 4).

Each sample describes the sound pressure level of the acoustic signal at a specific time,

allowing an adequate visualization of the variation of the energy of the signal along time.

Although this representation allows the efficient visual inspection of some signal fea-

tures (e.g. it is easy to identify silence regions, the onsets of sound events, fade ins and

outs) and is the input of many basic audio signal analysis and processing operations, it is

not directly suitable for some more challenging tasks, such as the segregation of complex

music mixtures.

As a result, some other type of representation is necessary to bring out more clearly

the important characteristics of the signal for the application in hand, usually leading to

a so-called mid-level representation. Mid-level representation is a concept mainly coming

30

Main Contributions


}  Proposal and experimental validation of a flexible and efficient framework for sound segregation }  Focused on “real-world” polyphonic music }  Inspired by ideas from CASA }  Causal and data-driven

}  Definition of a novel harmonicity cue

}  Termed Harmonically Wrapped Peak Similarity (HWPS) }  Experimentally shown as a good grouping criteria

}  Software implementation of the proposed sound segregation framework }  Modular, extensible and efficient }  Made available as free and open source software (FOSS)

}  Based on the MARSYAS framework

Proposed Approach


}  Assumptions }  Perception primarily depends on the use of low-level sensory

information }  Does not necessarily require prior knowledge (i.e. training) }  Still able to perform primitive identification and segregation of sound

events in a sound mixture

}  Prior knowledge and high-level information can still be used }  To award additional meaning to the primitive observations }  To consolidate primitive observations as relevant sound events }  To modify the listener’s focus of attention

Proposed Approach


}  System overview

Sinusoidal

Synthesis

Texture Window

Spectral Peaks

(over Texture Window)

150ms

Spectral

Peaks

46ms

Sinusoidal

Analysis

Spectral

Peaks

46ms

Cluster Selection

Similarity Computation

Normalized Cut

Analysis Front-end


}  Sinusoidal Modeling }  Sum of highest amplitude sinusoids at each frame peaks

}  Maximum of 20 peaks/frame }  Window = 46ms ; hop = 11ms }  Parametric model: Estimate Amplitude, Frequency, Phase of each peak

freq

uenc

y

Spectral Peaks

Sinusoidal Analysis

Spectral Peaks

46ms

Time Segmentation


}  Texture Windows }  Construct a graph over a texture window of the sound

mixture }  Provides time integration

  Approaches partial tracking and source separation jointly   Traditionally two separated, consecutive stages

Spectral Peaks

Sinusoidal Analysis

time

freq

uenc

y

Spectral Peaks

Sinusoidal Analysis

Texture Window

Time Segmentation


}  Fixed length texture windows }  E.g. 150 ms

}  Dynamically adjusted texture windows }  Onset detector }  Perceptually more

relevant }  50ms ~ 300ms

Am

plit

ude

Fre

quency

0 0.8 1.6

Time (secs)

Spectr

al F

lux

1 TEXTURE WINDOW 2 TEXTURE WINDOW 3 TEXTURE WINDOW 4 TEXTURE WINDOW 5 6 TEXTURE WIN. 7

Perceptual Cues as Similarity Functions



AMPLITUDE SIMILARITY

FREQUENCY SIMILARITY

HARMONIC SIMILARITY

(HWPS)

AZIMUTH PROXIMITY

COMMONONSETOFFSET

SOURCEMODELS

COMBINER

Spectral Peaks(over Texture Window)

150ms

OVERALLSIMILARITY MATRIX

Normalized Cut

...



}  Grouping Cues (inspired from ASA) }  Similarity between time-frequency components in a texture window

}  Frequency proximity }  Amplitude proximity }  Harmonicity proximity (HWPS) }  …

}  Encode topological knowledge into a similarity graph/matrix }  Simultaneous integration (peaks within the same frame) }  Sequential integration over the texture window

Similarity Matrix

A0 A1 A2 A3 B3, A4 B0 B1 B2 B4

A0

A1

A2

A3

B3, A4

B0

B1

B2

B4

xixj

xk

wij = wji

xq

xp

xl



}  Defining a Generic Similarity Function }  Fully connected graphs

}  Gaussian similarity function   How to define neighborhood width (σ)?

  Local statistics from data in a Texture Window   Use prior knowledge (e.g. JNDs)   Use σ as weights (after normalizing the Sim. Fun. to [0,1])

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5

0.25

0.5

0.75

1

d(xi, xj)

wij

σ=

0.4 σ =1.0

σ = 1.2

wij = e−

“d(xi,xj)

σ

”2

xi

xj

wij = wji



}  Amplitude and Frequency Similarity }  Amplitude

}  Gaussian function of the Euclidean distances   In dB more perceptually relevant

}  Frequency }  Gaussian function of the Euclidean distances

  In Bark more perceptually relevant

}  Not sufficient to segregate harmonic events }  Nevertheless are important to group peaks from:

  Inharmonic or noisy frequency components in harmonic sounds   Non-harmonic sounds (unpitched sounds)

PERCEPTUAL CUES AS GROUPING CRITERIA

3.7 Perceptual Cues as Grouping Criteria

Following the more generic considerations about the definition of a similarity function for

spectral clustering presented in Section 3.6.2, this section will focus on the specification

of grouping criteria that attempt to capture the perceptual proximity between sound

components (i.e. peaks) in accordance to some of the perceptual grouping principles

discussed in Section 2.3.1. These include frequency and amplitude proximity and a novel

harmonicity criterion, termed Harmonically Wrapped Peak Similarity (HWPS). Such a

harmonicity criterion is of upmost importance since amplitude and frequency cues are not

enough for segregating multiple overlapping harmonic sound sources.

Still, the proposed framework is able to easily accommodate future similarity functions

such as frequency or time masking [Lagrange et al., 2006], common fate, stereo location,

or other grouping principles, allowing to model an increasing number of perceptual mech-

anisms involved in human hearing (discussed in Section 2.3.1).

3.7.1 Amplitude and Frequency Similarity

Two of the most basic similarities explored by the auditory system are related to the

frequency and amplitude features of the sound components in a sound mixture (see Sec-

tion 2.3.1).

Accordingly, the edge weight connecting two peaks pkl and pk+n

m will depend on their

frequency and amplitude proximities. Following the generic considerations discussed for

the definition of a similarity function for spectral clustering in Section 3.6.2, amplitude

and frequency similarities, Wa and Wf respectively, are defined as follows:

Wa(pkl , p

k+nm ) = e

−„

akl −ak+n

mσa

«2

(18)

Wf (pkl , p

k+nm ) = e

−„

fkl −fk+n

mσf

«2

(19)

where the Euclidean distances are modeled as two Gaussian functions, as previously de-

fined in Equation 8. The amplitudes are measured in Decibels (dB) and the frequencies

are measured in Barks (a frequency scale approximately linear below 500 Hz and logarith-

mic above), since these scales have shown to better model the the sensibility response of

the human ear [Hartmann, 1998].

79


3.7 Perceptual Cues as Grouping Criteria

Following the more generic considerations about the definition of a similarity function for

spectral clustering presented in Section 3.6.2, this section will focus on the specification

of grouping criteria that attempt to capture the perceptual proximity between sound

components (i.e. peaks) in accordance to some of the perceptual grouping principles

discussed in Section 2.3.1. These include frequency and amplitude proximity and a novel

harmonicity criterion, termed Harmonically Wrapped Peak Similarity (HWPS). Such a

harmonicity criterion is of upmost importance since amplitude and frequency cues are not

enough for segregating multiple overlapping harmonic sound sources.

Still, the proposed framework is able to easily accommodate future similarity functions

such as frequency or time masking [Lagrange et al., 2006], common fate, stereo location,

or other grouping principles, allowing to model an increasing number of perceptual mech-

anisms involved in human hearing (discussed in Section 2.3.1).

3.7.1 Amplitude and Frequency Similarity

Two of the most basic similarities explored by the auditory system are related to the

frequency and amplitude features of the sound components in a sound mixture (see Sec-

tion 2.3.1).

Accordingly, the edge weight connecting two peaks pkl and pk+n

m will depend on their

frequency and amplitude proximities. Following the generic considerations discussed for

the definition of a similarity function for spectral clustering in Section 3.6.2, amplitude

and frequency similarities, Wa and Wf respectively, are defined as follows:

Wa(pkl , p

k+nm ) = e

−„

akl −ak+n

mσa

«2

(18)

Wf (pkl , p

k+nm ) = e

−„

fkl −fk+n

mσf

«2

(19)

where the Euclidean distances are modeled as two Gaussian functions, as previously de-

fined in Equation 8. The amplitudes are measured in Decibels (dB) and the frequencies

are measured in Barks (a frequency scale approximately linear below 500 Hz and logarith-

mic above), since these scales have shown to better model the the sensibility response of

the human ear [Hartmann, 1998].

79



}  Harmonically Wrapped Peak Similarity (HWPS) }  Harmonicity is one of the most powerful ASA cues [Wang and Brown, 2006]

}  Proposal of a novel harmonicity similarity function }  Does not rely on the prior knowledge of f0 in the signal }  Takes into account spectral information in a global manner (spectral patterns)

  For peaks in a same frame or in different frames in a Texture Window   Takes into consideration the amplitudes of the spectral peaks

}  3 step algorithm   Shifted Spectral Pattern   Wrapped Frequency Space Histogram computation   Discrete Cosine Similarity [0,1]


STEP 3 – Discrete Cosine Similarity

The last step is now to correlate the two shifted and harmonically wrapped spec-

tral patterns (F kl and F

k+nm ) to obtain the HWPS measure between the two correspond-

ing peaks. This correlation can be done using an algorithmic approach as proposed in

[Lagrange and Marchand, 2006], but this was found not to be reliable or robust in prac-

tice. Alternatively, the proposal is to discretize each shifted and harmonically wrapped

spectral pattern into an amplitude weighted histogram, Hkl , corresponding to each spec-

tral pattern Fkl . The contribution of each peak to the histogram is equal to its amplitude

and the range between 0 and 1 of the Harmonically-Wrapped Frequency is divided into

20 equal-size bins (a 12 or a 24 bin histogram would provide a more musically meaning-

ful chroma-based representation, but preliminary and empirical tests have shown better

results when using 20 bin histograms).

In addition, the harmonically wrapped spectral patterns are also folded into an octave

to form a pitch-invariant “chroma” profile. For example, in Figure 19, the energy of the

spectral pattern in wrapped frequency 1 (all integer multiples of the wrapping frequency)

is mapped to histogram bin 0.

The HWPS similarity between the peaks pkl and p

k+nm is then defined based on the

cosine distance between the two corresponding discretized histograms as follows:

Wh(pkl , p

k+nm ) = HWPS(pk

l , pk+nm ) = e

0

@ c(Hk

l,H

k+nm )

rc(Hk

l,H

k

l)·c(Hk+n

m ,Hk+nm )

1

A2

(28)

where

c(Hba, H

dc ) =

�

i

Hba(i)×H

dc (i)�

. (29)

One may notice that due to the wrapping operation of Equation 25, the size of the

histograms can be relatively small (e.g. 20 bins), thus being computationally efficient. A

Gaussian function is also used for controlling the neighborhood width of the harmonicity

cue, where σh = 1 is implicitly used in the current system implementation.

An Intuitive View on HWPS

The three steps that compose the HWPS algorithm presented in the previous section can

be more intuitively understood if looking at the following example scenarios.

As a first example, imagine that in a frame k there are six peaks from two harmonic

95

Wh(pkl , p

k+nm ) = HWPS(pkl , p

k+nm ) = e

−�1− c(Hk

l,H

k+nm

)√c(Hk

l,H

k

l)×c(Hk+n

m ,Hk+nm )

�2



}  HWPS }  Between peaks of a same

harmonic “source” }  In a same frame

High similarity (~1.0)

A0 B0

A1 B1

B2A2

f0A f0B 2f0A 3f0A 3f0B2f0B0

frame k

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

fkA1

= 2f0A

SHIFTING

SHIFTING

fkA0

= f0A

A1

A0

HWPS(A1, A0)|h=f0A

F kA1

F kA1 F k

A0F k

A0

F kA0

F kA1

dB

High HWPS(A1, A0)|h=f0A

= =

0 1

A1 A0

F kA1

= = F kA0

A1A0



}  HWPS }  Between peaks of

different harmonic “sources”

}  In a same frame

Low similarity (~0.0)

A0 B0

A1 B1

B2A2


frame k

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

F kA1

= = F kB0

fkA1

= 2f0A

SHIFTING

SHIFTING

fkB0

= f0B

A1

HWPS(A1, B0)|h=f0A

F kA1

F kA1 F k

B0F k

B0

F kB0

F kA1

dB

B0

!

A1 B0

A1B0

Low HWPS(A1, B0)|h=f0A

=

0 1



}  HWPS }  Between peaks of a same

harmonic “source” }  In different frames

 Mid-High similarity

  Interfering spectral content may be different   Degrades HWPS…   Only consider bin 0?

A0 B0

A1 B1

B2A2


frame k

F kA1

= = F k+nA0

dBA0

A1A2

f0A 2f0A 3f0A0

dB

frame k + n

C0

C1

C2

f0C 2f0C 3f0C

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

01

3f0

0

−f0A

f0A

2f0A

3f0A

4f0A

fkA1

= 2f0A

SHIFTING

SHIFTING

Ak1

HWPS(Ak1 , Ak+n

0 )|h=f0A

F kA1

F kA1

F k+nA0

F k+nA0

F k+nA0

F kA1

Ak+n0

Ak1 Ak+n

0

fk+nA0

= f0A

Ak1

Ak+n0

Mid-High HWPS(Ak1 , Ak+n

0 )|h=f0A

=

0 1

=



}  HWPS }  Impact of f0 estimates (h’)

}  Ideal }  Min peak frequency }  Highest amplitude peak }  Histogram-based f0 estimates pitch estimates == nr. Sources?

A FRAMEWORK FOR SOUND SEGREGATION IN MUSIC SIGNALS

wrapping operation would be perfect with the prior knowledge of the fundamental fre-

quency. With this knowledge it would be possible to parametrize the wrapping operation

h as:

h = min(f0kl , f0

k+nm ) (26)

where f0kl is the fundamental frequency of the source of the peak pk

l . Without such prior,

a conservative approach h� is considered instead, although it will tend to over estimate

the fundamental frequency:

h� = min(fkl , fk+n

m ) (27)

Notice that the value of the wrapping frequency function h is the same for both pat-

terns corresponding to the peaks under consideration. Therefore the resulting shifted and

wrapped frequency pattern will be more similar if the peaks belong to the same harmonic

“source”. The resulting shifted and wrapped patterns are pitch invariant and can be seen

in the middle plot of Figures 19 and 20.

Different approaches could have been taken for the definition of the fundamental fre-

quency estimation function h�. One possibility would be to select the highest amplitude

peak in the union of the two spectral patterns under consideration as the f0 estimate

(i.e. h� = {fi|i = argmaxi(Ai),∀i ∈ [1,#A], where A = Akl ∪ Ak+n

m , #A is its number

of elements and Akl is the set of amplitudes corresponding to the spectral pattern F k

l ).

The motivation for this approach is the fact that the highest amplitude partial in musical

signals often corresponds to the fundamental frequency of the most prominent harmonic

‘source” active in that frame, although this assumption will not always hold.

A more robust approach, though more computationally expensive, would be to calcu-

late all the frequency differences between all peaks in each spectral pattern and compute a

histogram. The peaks in these histograms would be good candidates for the fundamental

frequencies in each frame (in order to avoid octave ambiguities, a second histogram with

the differences between all the candidate f0 values could be again computed, where the

highest peaks would be selected as the final f0 candidates). The HWPS could then be

iteratively calculated using each f0 candidate in this short list, and select the one with

the best value as the final choice. In fact, this technique could prove an interesting way to

robustly estimate the number of harmonic “sources” in each frame, including their pitches,

but experimental evaluations are still required to validate these approaches.

—————

94




h as:

h = min(f0kl , f0

k+nm ) (26)






m ) (27)












l ).














—————

94




h as:

h = min(f0kl , f0

k+nm ) (26)






m ) (27)












l ).














—————

94

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1A

0

A1

A2

A3

A4, B

3

B0

B1

B2

B4

Frequency (Hz)

Am

plit

ud

e

Similarity Combination





HARMONIC SIMILARITY

(HWPS)

AZIMUTH PROXIMITY

COMMONONSETOFFSET

SOURCEMODELS

COMBINER


150ms


Normalized Cut

...

Similarity Combination


}  Combining cues }  Product operator [Shi&Malik2000]

  High overall similarity only if all cues are high…

}  More expressive operators?


would be necessary to be conducted in order to validate their adequacy.

Plots f,g in Figures 17 and 18 show the segregation results when only using the HWPS

similarity function Wh. In the case of the Tones A+B example, it is clear that the HWPS

enabled the segregation algorithm to correctly separate the two harmonic sets, A and B

(note that the shared harmonic {A4, B3} was grouped with the components from harmonic

set B).

For the real-world example Jazz1, it is also clear that the HWPS allowed the grouping

algorithm to successfully group the most predominant harmonic content in the mixture

into a single cluster (see plot f ), separating it from the harmonic background and more

noisy components in the sound mixture (see panel g)5.

Section 4.2.1 will present additional experimental evaluations as well as a comparative

study with some of the most well known approaches for the computation of harmonicity

similarity. Additionally, Appendix D describes the use of the HWPS as the basis of a

computationally efficient scheme for the dominant harmonic source separation, further

demonstrating the ability of this novel similarity cue to correctly identify the harmonic

content in complex sound mixtures.

3.7.3 Combining Similarity Functions

The use of a single similarity cue for sound segregation has been presented in Sections 3.7.1

and 3.7.2, and the results showed that each cue allows to obtain a different representation

of the complex audio mixture. As a result, each cue by itself only provides a limited ability

to represent the different sound events in a complex mixture. Therefore, the combination

of different similarity cues could allow to make the best use of their isolated grouping

abilities towards a more meaningful segregation of a sound mixture.

Following the work of Shi and Malik [Shi and Malik, 2000], who proposed to compute

the overall similarity function as the product of the individual similarity cues used for

image segmentation, the current system combines the amplitude, frequency and HWPS

grouping cues presented in the previous sections into a combined similarity function W as

follows:

W (pl, pm) = Wafh(pl, pm) = Wa(pl, pm)×Wf (pl, pm)×Wh(pl, pm) (30)

Plots g in Figures 15 and 16 show the histogram of the values resulting from the com-

bined similarity functions for the two sound examples, Tones A+B and Jazz1, respectively.

5Audio clips of the signals plotted in Figures 17 and 18 are available at http://www.inescporto.

pt/˜lmartins/Research/Phd/Phd.htmXXX

105Wafh = [(Wf ∧Wa) ∨Wh] ∧Ws

Segregating Sound Events


}  Segregation task }  Carried out by clustering components that are close in the similarity space }  Novel method based on Spectral Clustering

}  Normalized Cut (Ncut) criterion   Originally proposed for Computer Vision   Takes cues as pair-wise similarities   Cluster the peaks into groups taking into account simultaneously all cues




HARMONIC SIMILARITY

(HWPS)

AZIMUTH PROXIMITY

COMMONONSETOFFSET

SOURCEMODELS

COMBINER


150ms


Normalized Cut

...



}  Segregation Task }  Normalized Cut criterion

}  Achieves a balanced clustering of elements }  Relies on the eigenstructure of a similarity matrix to partition points

into disjoint clusters   Points in the same cluster high similarity   Points in different clusters low similarity

xixj

xk

wij = wji

better cut mincut

xq

xp

xl



}  Spectral Clustering }  Alternative to the EM and k-means traditional algorithms:

}  Does not assume a convex shaped data representation }  Does not assume Gaussian distribution of data }  Does not present multiple minima in log-likelihood

  Avoids multiple restarts of the iterative process

}  Correctly handles complex and unknown shapes }  Usual in audio signals [Bach and Jordan 2004]



}  Divisive clustering approach }  Recursive two-way cut

}  Hierarchical partition of the data   Recursively partitions the data into two sets

  Until pre-defined number of clusters is reached (requires prior knowledge!)   Until a stopping criteria is met

}  Current implementation   Requires definition of number of clusters [Martins et al., 2007]

  Or alternatively partitions data into 5 clusters and selects the 2 “denser” ones   Segregation of the dominant clusters in the mixture [Lagrange et al., 2008a]

Segregation Results


a) Jazz1

b) AMPLITUDE SIMILARITYCLUSTER 1

c) AMPLITUDE SIMILARITYCLUSTER 2

d) FREQUENCY SIMILARITYCLUSTER 1

e) FREQUENCY SIMILARITYCLUSTER 2

f) HWPS SIMILARITYCLUSTER 1

g) HWPS SIMILARITYCLUSTER 2

h) COMBINED SIMILARITIESCLUSTER 1

i) COMBINED SIMILARITIESCLUSTER 2

FR

EQ

UE

NC

Y (

Hz)

TIME (secs)

TIME (secs) TIME (secs)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

a) Tones A+B

b) AMPLITUDE SIMILARITYCLUSTER 1

c) AMPLITUDE SIMILARITYCLUSTER 2

d) FREQUENCY SIMILARITYCLUSTER 1

e) FREQUENCY SIMILARITYCLUSTER 2

f) HWPS SIMILARITYCLUSTER 1

g) HWPS SIMILARITYCLUSTER 2

h) COMBINED SIMILARITIESCLUSTER 1

i) COMBINED SIMILARITIESCLUSTER 2

FR

EQ

UE

NC

Y (

Hz)

TIME (secs)

TIME (secs) TIME (secs)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

FR

EQ

UE

NC

Y (

Hz)

B0

B1

B2

A4 + B3

A3

A2

A1

A0

0 500 1000 1500 2000 2500 30000

0.2

0.4

0.6

0.8

1A

0

A1

A2

A3

A4, B

3

B0

B1

B2

B4

Frequency (Hz)

Am

plit

ude

Results


}  Predominant Melodic Source Segregation }  Dataset of real-world polyphonic music recordings

}  Availability of the original isolated tracks (ground truth) }  Results (the higher the better)

  HWPS improves results   When combined with other similarity features   When compared with other state-of-the-art harmonicity features [Srinivasan and Kankanhalli, 2003]

[Virtanen and Klapuri, 2000]

0 1 2 3 4 5 6 7

Mean SDR (dB) for a 10 song dataset

A+F+HWPS

A+F+rHWPS

A+F+HV

A+F+HS

A+F

Results


}  Predominant Melodic Source Segregation }  On the use of Dynamic Texture Windows

}  Results (the higher the better)

  Smaller improvement (0.15 dB) than expected   Probably due to the cluster selection approach being used…

  More computationally intensive (for longer texture windows)

Results


}  Main Melody Pitch Estimation }  Resynthesize the segregated main voice clusters }  Perform pitch estimation using well known monophonic pitch estimation technique

(Praat)

}  Comparison with two techniques: }  Monophonic pitch estimation applied to mixture audio (from Praat) }  State-of-the-Art multi-pitch and main melody estimation algorithm applied to mixture

audio [Klapuri, 2006]

}  Results (the lower the better)

Results


}  Voicing Detection }  Identifying portions of a music file containing vocals

}  Evaluated three feature sets:   MFCC features extracted from the polyphonic signal   MFCC features extracted from the segregated main voice   Cluster Peak Ratio (CPR) feature

  extracted from the segregated main voice clusters

Results


}  Timbre Identification in polyphonic music signals [Martins et al., 2007]

}  Polyphonic, multi-instrumental audio signals }  Artificial mixtures of 2-, 3- and 4-notes from real instruments

}  Automatic separation of the sound sources }  Sound sources and events are reasonably captured, corresponding in

most cases to played notes

}  Matching of the separated events to a collection of 6 timbre models

note 1

note n

...

Sound Source

Formation

note 1 / inst 1

note n / inst i

...

TimbreModels

Matching

Matching

Pea

kP

icki

ng

Sin

usoi

dal

Ana

lysi

s

...... ...

Results


}  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  Sound sources and events are reasonably captured,

corresponding in most cases to played notes

Results


}  Timbre Identification in polyphonic music signals [Martins et al., 2007] }  6 instruments modeled [Burred et al., 2006]:

}  Piano, violin, oboe, clarinet, trumpet and alto sax }  Modeled as a set of time-frequency templates

  Describe the typical evolution in time of the spectral envelope of a note   Matches the salient peaks of the spectrum

00.2

0.40.6

0.81 2000 4000 6000 8000 10000

-80

-60

-40

-20

0

Frequency (Hz)

Time (norm

alized)

Am

plit

ude (

dB) PIANO

0.2

0.4

0.6

0.8

12000 4000 6000 8000 10000

-80

-60

-40

-20

0

Frequency (Hz)

Time (n

ormalized)

Am

plit

ude (

dB) OBOE

Results


}  Timbre Identification in polyphonic music signals [Martins et al., 2007]

}  Instrument presence detection in mixtures of notes }  56% of instruments occurrences correctly detected, with a precision of

64% [Martins et al., 2007]

Weak Matching

Alto sax cluster piano prototype

Strong Matching

Piano cluster piano prototype

Software Implementation


}  Modular, flexible and efficient software implementation }  Based on Marsyas

}  Free and Open Source framework for audio analysis and processing http://marsyas.sourceforge.net

> peakClustering myAudio.wav



}  Marsyas }  peakClustering Overview

Series/mainNet

frameMaxNumPeaks

totalNumPeaks

PeakViewSink/peSink

PeakLabeler/labeler

PeakConvert/conv

Accumulator/textWinNet

... ... ...

1FlowThru/clustNet

... ... ...

Shredder/synthNet

... ... ...

2 3

nTimes

A B

peakLabels

nTimestotalNumPeaks

frameMaxNumPeaks

innerOut

B



}  Marsyas }  Sinusoidal analysis front-end


Series/analysisNet

Series/peakExtract

ShiftInput/si

Fanout/stereoFo

Series/stereoSpkNet

Parallel/LRnet

Series/spkL

Windowing/win

Spectrum/spk

Series/spkR

Windowing/win

Spectrum/spk

EnhADRessStereoSpectrum/stereoSpk

EnhADRess/ADRess

Series/spectrumNet

Stereo2Mono/s2m

Shifter/sh

Windowing/wi

Parallel/par

Spectrum/spk1

Spectrum/spk2

FlowThru/onsetdetector

... ... ...

1a

FanOutIn/mixer

+

Series/mixSeries

Delay/noiseDelay

SoundFileSource/src

Gain/noiseGain

Series/oriNet

SoundFileSource/src

Gain/oriGain

A

1

onsetDetected

flush


Windowing/wi

Spectrum/spk

PowerSpectrum/pspk

Flux/flux

ShiftInput/sif

Filter/filt1

Filter/filt2

Reverse/rev1

Reverse/rev2

PeakerOnset/peaker

1a

onsetDetected

I

S



}  Marsyas }  Onset detection


Series/analysisNet

Series/peakExtract

ShiftInput/si

Fanout/stereoFo

Series/stereoSpkNet

Parallel/LRnet

Series/spkL

Windowing/win

Spectrum/spk

Series/spkR

Windowing/win

Spectrum/spk

EnhADRessStereoSpectrum/stereoSpk

EnhADRess/ADRess

Series/spectrumNet

Stereo2Mono/s2m

Shifter/sh

Windowing/wi

Parallel/par

Spectrum/spk1

Spectrum/spk2


... ... ...

1a

FanOutIn/mixer

+

Series/mixSeries

Delay/noiseDelay

SoundFileSource/src

Gain/noiseGain

Series/oriNet

SoundFileSource/src

Gain/oriGain

A

1

onsetDetected

flush


Windowing/wi

Spectrum/spk

PowerSpectrum/pspk

Flux/flux

ShiftInput/sif

Filter/filt1

Filter/filt2

Reverse/rev1

Reverse/rev2

PeakerOnset/peaker

1a

onsetDetected

I

S



}  Marsyas }  Similarity matrix computation and Clustering

PeakConvert/conv

FlowThru/clustNet

frameMaxNumPeaks

totalNumPeaks

FanOutIn/simNet

x

Series/freqSimSimilarityMatrix/FREQsimMat

Metric/FreqL2Norm

RBF/FREQrbf

Series/ampSimSimilarityMatrix/AMPsimMat

Metric/AmpL2Norm

RBF/AMPrbf

Series/HWPSimSimilarityMatrix/HWPSsimMat

HWPS/hwps

RBF/HWPSrbf

Series/panSimSimilarityMatrix/PANsimMat

Metric/PanL2Norm

RBF/PANrbf

PeakFeatureSelect/FREQfeatSelect

2

B

D

D

Series/NCutNet

Fanout/stack

NormCut/NCut

Gain/ID

PeakClusterSelect/clusterSelect

E

innerOut

PeakLabeler/labeler

B

labels

D

D

D

PeakFeatureSelect/AMPfeatSelect

PeakFeatureSelect/PANfeatSelect

PeakFeatureSelect/HWPSfeatSelect

F

C1

C2

C3



}  Marsyas }  More flexible Similarity expression

FanOutIn/simNet

Series/panSimSimilarityMatrix/PANsimMat

Metric/PanL2Norm

RBF/PANrbf

PeakFeatureSelect/PANfeatSelect

.*

FanOutIn/ORnet

FanOutIn/ANDnet

.*

Series/freqSimSimilarityMatrix/FREQsimMat

Metric/FreqL2Norm

RBF/FREQrbf

PeakFeatureSelect/FREQfeatSelect

Series/ampSimSimilarityMatrix/AMPsimMat

Metric/AmpL2Norm

RBF/AMPrbf

PeakFeatureSelect/AMPfeatSelect max

Series/HWPSimSimilarityMatrix/HWPSsimMat

HWPS/hwps

RBF/HWPSrbf

PeakFeatureSelect/HWPSfeatSelect



}  Marsyas }  Cluster Resynthesis

Shredder/synthNet

Series/postNet

Gain/outGain

PeakSynthOsc/pso

Windowing/wiSyn

OverlapAdd/ov

SoundFileSink/dest

3

B



}  Marsyas }  Data structures

D

total number of peaksin texture window

tota

l num

ber o

f peaks

in te

xtu

re w

indow

SIMILARITYMATRIX

E


tota

l num

ber o

f peaks

in te

xtu

re w

indow

3 221 1 3 NCUT indicator

SIMILARITYMATRIX

F

3 -1-11 1 3cluster selection

indicator

C1

f2 f5f4f1 f3 f6peaks'

frequency


C2

a2 a5a4a1 a3 a6peaks'

amplitude


C3

3 21 2 1 3

f2 f4f1 f3 f5 f6peaks' frequency

XX aa XX

a aX XX X

X aa aa X

aX Xf a f

f fa f a f

f ff f f f

NumPeaks in frame

peak

spectra

l patte

rn


A

Re(0)

Re(N/2)

Re(1)

Im(1)

Im(N/2-1)

Re(N/2-1)

...

...

...

...

...

...

...

Re(0)

Re(N/2)

Re(1)

Im(1)

Im(N/2-1)

Re(N/2-1)

...

...

...

...

...

...

...

com

ple

x s

pectru

m 1

(N p

oin

ts)

Pan(0)

Pan(1)

Pan(N/2)...

...

...

...

...

...

...

ste

reo

spectru

m(N

/2+

1 p

oin

ts)

texturewindow frames

com

ple

x s

pectru

m 2

(N p

oin

ts)

B

peaksFREQUENCY

peaksAMPLITUDE

peaksPHASE

peaksGROUP

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

fram

eM

axN

um

Peaks


peaksTRACK

...

...

...

...

...

...

...

...

...

...

...

...

...

audio frame(N+1 samples)

I

31 42 50

1 430 2 5 Ch1 samples

Ch2 samples

analysis window(N samples)

S

1 30 2 5 Audio Samples

430 2 5 Shifted Audio Samples

1

4



}  Marsyas }  Data structures

D


tota

l num

ber o

f peaks

in te

xtu

re w

indow

SIMILARITYMATRIX

E


tota

l num

ber o

f peaks

in te

xtu

re w

indow

3 221 1 3 NCUT indicator

SIMILARITYMATRIX

F

3 -1-11 1 3cluster selection

indicator

C1

f2 f5f4f1 f3 f6peaks'

frequency


C2

a2 a5a4a1 a3 a6peaks'

amplitude


C3

3 21 2 1 3

f2 f4f1 f3 f5 f6peaks' frequency

XX aa XX

a aX XX X

X aa aa X

aX Xf a f

f fa f a f

f ff f f f

NumPeaks in frame

peak

spectra

l patte

rn


A

Re(0)

Re(N/2)

Re(1)

Im(1)

Im(N/2-1)

Re(N/2-1)

...

...

...

...

...

...

...

Re(0)

Re(N/2)

Re(1)

Im(1)

Im(N/2-1)

Re(N/2-1)

...

...

...

...

...

...

...

com

ple

x s

pectru

m 1

(N p

oin

ts)

Pan(0)

Pan(1)

Pan(N/2)

...

...

...

...

...

...

...

ste

reo

spectru

m(N

/2+

1 p

oin

ts)


com

ple

x s

pectru

m 2

(N p

oin

ts)

B

peaksFREQUENCY

peaksAMPLITUDE

peaksPHASE

peaksGROUP

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

fram

eM

axN

um

Peaks


peaksTRACK

...

...

...

...

...

...

...

...

...

...

...

...

...

audio frame(N+1 samples)

I

31 42 50

1 430 2 5 Ch1 samples

Ch2 samples

analysis window(N samples)

S

1 30 2 5 Audio Samples

430 2 5 Shifted Audio Samples

1

4

Conclusions


}  Proposal of a framework for sound source segregation }  Inspired by ideas of CASA }  Focused on “real-world” music signals }  Designed to be causal and efficient }  Data-driven

}  Does not require any training or prior knowledge about audio signals under analysis

}  Approaches partial tracking and source separation jointly }  Flexible enough to include new perceptually motivated auditory cues

}  Based on a Spectral Clustering technique }  Shows good potential for applications

}  Source segregation/separation, }  Monophonic or polyphonic instrument classification, }  Main melody estimation }  Pre-processing for polyphonic transcription, ...

Conclusions


}  Definition of a novel harmonicity cue }  Termed Harmonically Wrapped Peak Similarity (HWPS)

}  Experimentally shown as: }  Good grouping criteria for sound segregation in polyphonic music signals.

}  Compares favorably to other state-of-the-art harmonicity cues

}  Software development of the sound segregation framework }  Used for validation and evaluation }  Made available as Free and Open Source Software (FOSS)

}  Based on Marsyas

}  Free for everyone to try, evaluate, modify and improve

Future Work


}  Analysis front-end }  Evaluate alternative analysis frontends

}  Perceptually-informed filterbanks }  Sinusoid+transient representations }  A different auditory front-end (as long as it is invertible).…

}  Evaluate alternative frequency estimation methods for spectral peaks }  Parabolic interpolation }  Subspace methods }  …

}  Use of a beat-synchronous approach }  Based on the use of onset detectors and beat estimators for dynamic

adjustment of texture windows }  Perceptually motivated

Future Work


}  Grouping Cues }  Improve HWPS

}  Better f0 candidate estimation }  Reduce negative impact of sound events in different audio frames

}  Inclusion of new perceptually motivated auditory cues }  Time and frequency masking }  Stereo placement of spectral components (for stereo signals) }  Timbre models as a priori information }  Peak tracking as a pre- and post-processing }  Common fate (onsets, offsets, modulation)

Future Work


}  Implement Sequential integration }  between texture windows

}  Cluster segregated clusters?

}  Timbre similarity [Martins et al. 2007]

Cluster 1

Cluster 2

Future Work


}  Clustering }  Definition of the neighborhood width (σ) in similarity

functions }  JNDs?

}  Define and evaluate more expressive combinations of similarity functions

}  Automatic estimation of the number of clusters in each texture window

}  Extraction of new descriptors directly from segregated cluster parameters (e.g., CPR): }  Pitch, spectral features, frequency tracks, timing information

Future Work


}  Creation of a sound/music evaluation dataset }  Simple and synthetic sound examples

}  For preliminary testing, fine tuning, validation

}  “real-world” polyphonic recordings }  More complex signals, for final stress-test evaluations

}  To be made publicly available

}  Software Framework }  Analysis an processing framework based on Marsyas

}  FOSS, C++, multi-platform, real-time

}  Feature rich software visualization and sonification tools

Related Publications


}  PhD Thesis: }  Martins, L. G. (2009). A Computational

Framework for Sound Segregation in Music Signals. PhD thesis, FEUP.

}  Book: }  Martins, L. G. (2009). A Computational

Framework for Sound Segregation in Music Signals – An Auditory Scene Analysis Approach for Modeling Perceptual Grouping in Music Listening. Lambert Academic Publishing.

}  Book Chapter: }  Martins, L. G., Lagrange, M., and Tzanetakis, G.

(2010). Modeling grouping cues for auditory scene analysis using a spectral clustering formulation. Machine Audition: Principles, Algorithms and Systems. IGI Global.

Related Publications


}  Lagrange, M., Martins, L. G., Murdoch, J., and Tzanetakis, G. (2008). Normalized cuts for predominant melodic source separation. IEEE Transactions on Audio, Speech, and Language Processing, 16(2). Special Issue on MIR.

}  Martins, L. G., Burred, J. J., Tzanetakis, G., and Lagrange, M. (2007). Polyphonic instrument recognition using spectral clustering. In Proc. International Conference on Music Information Retrieval (ISMIR), Vienna, Austria.

}  Lagrange, M., Martins, L. G., and Tzanetakis, G. (2008). A computationally efficient scheme for dominant harmonic source separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, Nevada, USA.

}  Tzanetakis, G., Martins, L. G., Teixeira, L. F., Castillo, C., Jones, R., and Lagrange, M. (2008). Interoperability and the Marsyas 0.2 runtime. In Proc. International Computer Music Conference (ICMC), Belfast, Northern Ireland.

}  Lagrange, M., Martins, L. G., and Tzanetakis, G. (2007). Semi-automatic mono to stereo up-mixing using sound source formation. In Proc. 112th Convention of the Audio Engineering Society, Vienna, Austria.

Thank you


Questions?

[email protected] http://www.artes.ucp.pt/citar/

a computational framework for sound segregation in music signals using marsyas

Technology