music signal separation by supervised nonnegative matrix...
TRANSCRIPT
Music Signal Separation by Supervised Nonnegative
Daichi Kitamura, Hiroshi Saruwatari, Kiyohiro Shikano
(Nara Institute of Science and Technology, Nara, Japan)
Overview of research
• Automatic music transcription
• Sound augmented reality (AR)
• 3D audio system, etc.
Applications
To cope with the problem of supervised NMF, we propose advanced
supervised NMF algorithm that employs a deformable capability for
the trained spectral bases and penalty terms for making the bases fit
into the real instrumental sound.
n Previous research
• Nonnegative matrix factorization (NMF) [1] Sparse representation and decomposition algorithm
NMF attempts to separate instrumental sources using spectral characteristics
Many techniques have been proposed using NMF as an unsupervised method [2] ~ [6],
It is difficult to cluster the decomposed spectral patterns into a specific target sound
• Supervised NMF (SNMF) [7], [8] Use some sample sound of the target instrumental signal in a priori training
A mismatch between the spectra trained in advance and the target sound reduces the
accuracy of source separation
Purpose of our research
1. Introduction
n We address a music signal separation problem, and propose a new
supervised algorithm for real instrumental signal separation.
n Nonnegative matrix factorization (NMF) is a sparse representation
algorithm.
n Supervised NMF can extract the target instrumental sources.
• The instrumental sounds differ according to, e.g., individual styles
of playing and the timbre individuality for each instrument.
• How do we get the supervision sound of the target instrument
recorded in CD?
Problem of conventional supervised method
n We propose a new advanced supervised method that adapts the
supervision sound (that we can get easily) to the target instrumental
sound as “Supervised NMF with basis deformation.”
n From experimental results, our proposed method separated the real
instrumental sources using supervision sound generated by MIDI
synthesizer.
Extract!
To solve this inherent problem…
• Recently, music signal separation technologies have received much
attention.
Mixture of instruments
Target signal Supervision sound
Slightly different
Proposed SNMF adapts supervision
sound to the target signal
Sample sound of piano
Proposed SNMF
Extracted piano signal
2. NMF NMF for source separation
: Observed spectrogram
: Basis matrix that includes frequently-appearing
spectral patterns in as column vectors
: Activation matrix that involves activation
information of each basis of
: Number of frequency bins
: Number of frames
: Number of bases
3. Conventional method Penalized Supervised NMF: PSNMF [8]
Update Supervised bases
Observed spectrogram that consists
multiple instrumental sources
Spectrogram of the solo-played
target signal for training
Training process
Separation process
Fix trained bases
and update
: Supervised basis matrix that involves spectral
patterns of the target instrumental sound
: Basis matrix that involves residual spectral
patterns that cannot be expressed by
: Activation matrix that corresponds to
: Activation matrix that corresponds to
• Ideally, represents the target instrumental components, and
represents other different components from the target sounds after the
decomposition.
In addition, to prevent the simultaneous formulation of
similar spectral patterns in the matrices and , a
specific penalty is imposed between and . Force to become
mostly different
Cost function
subject to for all ,
where is the weighting parameter for penalty term .
60
40
20
0
-20
Am
plit
ude [
dB
]
3.02.52.01.51.00.50.0
Frequency [kHz]
Real sound Artificial sound by MIDI
These spectral
differences reduce
separation performance.
4. Proposed method
A mismatch between the bases trained in advance and the target
actual sound reduces the accuracy of separation.
It is impossible to provide perfect supervision or to predict more
realistic supervision in practice.
Problem of supervised method
• In this research, we assume that we can obtain specific solo-played
instrumental sounds, which is the target of the separation task.
• We use the target supervision sound generated by MIDI synthesizer
because we can easily generate the training sound via MIDI.
Target signal
(piano)
Supervision
sound
MIDI data of the same type of target instrument
Slightly different
Generate
• In this method, we need not prepare a lot of real sound samples of the
target instrument to learn the variance of the instrumental sounds.
All entries of these matrix are nonnegative value.
Constrains
Matrix Factorization with Basis Deformation
Kazunobu Kondo, Yu Takahashi
(Yamaha Corporate Research & Development Center, Shizuoka, Japan)
In proposed method, supervised bases composed in training process
can be deformed to adapt to the target spectra.
Strategy
Supervision sound Target sound
Adapt to the target
Supervised bases
Separation process
: Basis matrix for the deformation and
shares the activation matrix with
Proposed SNMF with basis deformation
n Decomposition model
• We employed a deformable term to adapt the supervision bases into the
target sound that cannot be represented by .
Update Deformable term
• To deform the supervised bases , deformation term should have both
positive and negative value, unlike normal NMF decomposition.
• However, the target spectral bases after deformation must be
nonnegative.
• The deformation matrix should be constructed under the following
constrains:
: entry of matrix ,which probably has positive and negative values
: Hyper parameter that controls the allowable range of the negative deformation of
Cost function
subject to for all ,
where and are the weighting parameters for penalty terms.
Force to become mostly
different with each other
Supervised bases
Other bases
Deformation bases
represents undesired
instrumental components.
represents spectral
difference between supervision
sound and the target sound.
Deform
For the case of It is possible to decrease
until 30% of the total.
If , the deformed bases
has a risk to become .
What is the hyper parameter ?
5. Evaluation experiment
Update rules
• The update rules that minimize the cost function are defined as follows:
Target instruments Flute, Clarinet, Piano, Trombone
Observed signal Mixing two sources selected from four sources with the input SNR of 0dB
Supervision sound
(MIDI)
Artificial MIDI sounds of the target instruments that consists two octave
notes, which cover all notes of the target signal
Number of bases Supervision bases: 100, Other bases: 50
Number of iterations
of NMF Training process: 500, Separation process: 400
Parameters Experimentally determined
Evaluation scores [9]
Signal to distortion ratio (SDR: quality of extracted signal),
Source to interference ratio (SIR: degree of separation),
Sources to artifact ratio (SAR: absence of distortion)
Clari t, Pia Tr bo
Experimental condition
• To confirm the effectiveness of the proposed algorithm, we compared the
conventional method (SNMF) and our SNMF with basis deformation.
where,
Target signals
Average scores
Experimental results
4
3
2
1
0
Fre
qu
en
cy [
kH
z]
43210
Time [s]
4
3
2
1
0
Fre
qu
en
cy [
kH
z]
43210
Time [s]
4
3
2
1
0
Fre
qu
en
cy [
kH
z]
43210
Time [s]
4
3
2
1
0
Fre
qu
en
cy [
kH
z]
43210
Time [s]
Observed signal consisting of piano and trombone Oracle signal of target piano signal
Piano signal extracted by conventional method Piano signal extracted by proposed method
The target signals were
recorded with actual
musical instruments.
The supervision sounds
were generated by MIDI
synthesizer.
4
Example of spectrograms
References D. D. Lee, H. S. Seung, “Algorithms for non-negative matrix factorization,” Neural Info. Process. Syst., vol.13, pp.556–562, 2001.
T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,”
IEEE Trans. Audio, Speech and Language Processing, vol.15, pp.1066–1074, 2007.
S. A. Raczynski, N. Ono, S. Sagayama, “Multipitch analysis with harmonic nonnegative matrix approximation,” Proc. ISMIR, pp.381–386,
2007.
A. Ozerov, C. Fevotte, M. Charbit, “Factorial scaled hidden Markov model for polyphonic audio representation and source separation,”
Proc. WASPAA, 2009.
W. Wang, A. Cichocki, J. A. Chambers, “A multiplicative algorithm for convolutive non-negative matrix factorization based on squared
Euclidean distance,” IEEE Trans. on Signal Processing, vol.57, no.7, pp.2858–2864, 2009.
H. Kameoka, M. Nakano, K. Ochiai, Y. Imoto, K. Kashino, S. Sagayama, “Constrained and regularized variants of non-negative matrix
factorization incorporating music-specific constraints,” Proc. ICASSP, pp.5365–5368, 2012.
P. Smaragdis, B. Raj, M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” Proc.
LVA/ICA 2010, LNCS 6365, pp.140–148, 2010.
K. Yagi, Y. Takahashi, H. Saruwatari, K. Shikano, K. Kondo, “Music signal separation by orthogonality and maximum-distance constrained
nonnegative matrix factorization with target signal information,” Proc. Audio Engineering Society 45th International Conference, 2012.
E. Vincent, R. Gribonval, C. Fevotte. “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech and
Language Processing, vol.14, no.4, pp.1462–1469, 2006.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
0
2
4
6
8
10
SNMF Proposedmethod
SD
R [
dB
]
0
5
10
15
20
SNMF Proposedmethod
SIR
[d
B]
0
2
4
6
8
10
SNMF Proposedmethod
SA
R [d
B]
Target
sound
Other
sound
Conventional method Proposed method
SDR SIR SAR SDR SIR SAR
Piano Clarinet 1.0 6.1 3.5 7.4 13.0 9.1
Piano Trombone 1.6 13.7 2.1 12.3 23.5 12.7
Clarinet Flute 0.1 1.8 7.2 0.6 2.4 7.3
Clarinet Trombone -0.5 12.8 0.0 9.4 21.5 9.7
Flute Piano 4.2 14.2 4.8 6.4 16.6 6.9
Trombone Clarinet 0.7 12.5 1.2 5.4 17.0 5.8