music genre classification with multilinear and sparse techniques … · 2009-10-19 ·...

Music genre classification with multilinear andsparse techniques

Constantine Kotropoulos∗†, Yannis Panagakis∗, andGonzalo R. Arce†

∗ Department of InformaticsAristotle University of ThessalonikiThessaloniki 54124, GREECE

† Department of Electrical & Computer EngineeringUniversity of DelawareNewark, DE 19716, USA

Greek Signal Processing JamAthens, October 17th, 2009

Music genre classification with multilinear and sparse techniques 1/79

Outline

1 Introduction

2 Auditory Spectro-temporal Modulations

3 Ensemble Discriminant Sparse Projections

4 Sparse Representation-based Classification (SRC)

5 Locality Preserving Non-negative Tensor Factorization within SRC

6 Outlook

1 Introduction

3 Ensemble Discriminant Sparse ProjectionsOvercomplete Dictionaries for Sparse RepresentationsDual Linear Discriminant Analysis of Sparse RepresentationsExperimental Assessment

4 Sparse Representation-based Classification (SRC)Experimental Assessment

5 Locality Preserving Non-negative Tensor Factorization within SRCLocality Preserving Non-negative Tensor FactorizationExperimental Assessment

6 OutlookComparison with the State of the ArtConclusions-Future Work

Introduction

Music GenreThe most popular description of music content despite the lack ofa commonly agreed definition.Depends on cultural, artistic, or market factors, etc.

Problem DefinitionTo classify music recordings into distinguishable genres usinginformation extracted from the audio signal.

Music Genre Classification AlgorithmsModel the music signals by the long-term statistics of short-timefeatures, such as timbral texture, rhythmic, pitch content-related,or their combinations.

Introduction

MotivationThe appealing properties of slow temporal and spectro-temporalmodulations from the human perceptual point of viewa;The strong theoretical foundations of sparse representationsbc.

aK. Wang and S. A. Shamma, “Spectral shape analysis in the central auditory system,” IEEE Trans. Speech and Audio

Processing, vol. 3, no. 5, pp. 382-396, 1995.b

E. J. Candes, J. Romberg, and T. Tao,“Robust uncertainty principles: Exact signal reconstruction from highly incompletefrequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489-509, February 2006.

cD. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289-1306, April 2006.

Introduction

MotivationThe appealing properties of slow temporal and spectro-temporalmodulations from the human perceptual point of viewa;The strong theoretical foundations of sparse representationsbc.

aK. Wang and S. A. Shamma, “Spectral shape analysis in the central auditory system,” IEEE Trans. Speech and Audio

Processing, vol. 3, no. 5, pp. 382-396, 1995.b

E. J. Candes, J. Romberg, and T. Tao,“Robust uncertainty principles: Exact signal reconstruction from highly incompletefrequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489-509, February 2006.

cD. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289-1306, April 2006.

Introduction

First approach: Ensemble Discriminant Sparse Projections (1)Each music recording is represented by its slow temporalmodulations, the so-called auditory temporal modulationrepresentation.Given a training set of auditory temporal modulations, thedictionary, that best represents each member of the training setunder sparsity constraints, is extracted by means of the K-SVDalgorithma.

aM. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse

representation,” IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4311-4322, Nov. 2006.

Introduction

First approach: Ensemble Discriminant Sparse Projections (1)Each music recording is represented by its slow temporalmodulations, the so-called auditory temporal modulationrepresentation.Given a training set of auditory temporal modulations, thedictionary, that best represents each member of the training setunder sparsity constraints, is extracted by means of the K-SVDalgorithma.

aM. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse

representation,” IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4311-4322, Nov. 2006.

Introduction

First approach: Ensemble Discriminant Sparse Projections (2)Discriminant Sparse Projections: The most discriminating features(MDF)a are extracted by applying dual linear discriminant analysis(LDA)b to the two principal subspaces of the within-class andbetween-class covariance matrices of the sparse coefficientvectors.Classifier Ensemble: Majority voting is applied to the decisionstaken by multiple individual dual LDA classifiers.

aD. L. Swets and J. Weng, “Using discriminant eigenfeatures for image retrieval,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 18, no. 8, pp. 831-836, August 1996.b

X. Wang and X. Tang, “Dual space linear discriminant analysis for face recognition,” in Proc. IEEE Computer Society Conf.CVPR, 2004, vol. 2, pp. 564-569.

Introduction

First approach: Ensemble Discriminant Sparse Projections (2)Discriminant Sparse Projections: The most discriminating features(MDF)a are extracted by applying dual linear discriminant analysis(LDA)b to the two principal subspaces of the within-class andbetween-class covariance matrices of the sparse coefficientvectors.Classifier Ensemble: Majority voting is applied to the decisionstaken by multiple individual dual LDA classifiers.

aD. L. Swets and J. Weng, “Using discriminant eigenfeatures for image retrieval,” IEEE Trans. Pattern Analysis and Machine

Intelligence, vol. 18, no. 8, pp. 831-836, August 1996.b

X. Wang and X. Tang, “Dual space linear discriminant analysis for face recognition,” in Proc. IEEE Computer Society Conf.CVPR, 2004, vol. 2, pp. 564-569.

Introduction

Second approach: Sparse Representation-based ClassifierEach music recording is again represented by its slow auditorytemporal modulations.The vectorized training auditory temporal modulations form adictionary of basis signals for music genres.Any test representation is expressed as a compact linearcombination of the dictionary atoms for the genre, where itbelongs to.Classification is performed by sparse representation-basedclassifier (SRC)a.

aJ. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma “Robust face recognition via sparse representation,” IEEE Trans.

Pattern Analysis and Machine Intelligence vol. 31, no. 2, pp. 210-227, Feb. 2009.

Introduction

Third approach: Locality preserving non-negative tensorfactorization within a sparse representation-based classifier (1)

Cortical representation: A given music recording is mapped to athree-dimensional (3D) representation of its slow spectral andtemporal modulationsa.Each cortical representation is modeled as a sparse weightedsum of the basis elements (atoms) of an overcomplete dictionary,which stems from the cortical representations associated totraining music recordings whose genre is known.

aI. Panagakis, E. Benetos, and C. Kotropoulos: “Music genre classification: A multilinear approach,” in Proc. 7th Int. Symp.

Music Information Retrieval,Philadelphia, USA, 2008.

Introduction

Cortical representation: A given music recording is mapped to athree-dimensional (3D) representation of its slow spectral andtemporal modulationsa.Each cortical representation is modeled as a sparse weightedsum of the basis elements (atoms) of an overcomplete dictionary,which stems from the cortical representations associated totraining music recordings whose genre is known.

aI. Panagakis, E. Benetos, and C. Kotropoulos: “Music genre classification: A multilinear approach,” in Proc. 7th Int. Symp.

Music Information Retrieval,Philadelphia, USA, 2008.

Introduction

By vectorizing a typical 3D cortical representation of 6 scales, 10rates, and 128 frequency bands, one obtains a vector of 7680dimensions.Multilinear dimensionality reduction techniques do not guaranteethat two data points, which are close in the intrinsic geometry ofthe original space, are also close in the data space aftermultilinear dimensionality reduction.A novel algorithm is proposed, where the geometrical informationof the original data space is incorporated into the objectivefunction optimized by non-negative tensor factorization (NTF).

Introduction

1 Introduction

Auditory Spectro-temporal Modulations

Computational Auditory ModelThe computational auditory model is inspired by psychoacousticaland neurophysiological investigations in the early and centralstages of the human auditory system.

tralau

delAuditory Spectrogram

Auditory Temporal Modulations

Auditory Spectro-Temporal Modulations(Cortical Representation)

Computational Auditory ModelThe computational auditory model is inspired by psychoacousticaland neurophysiological investigations in the early and centralstages of the human auditory system.

tralau

delAuditory Spectrogram

Auditory Spectro-Temporal Modulations(Cortical Representation)

Early Auditory SystemAuditory Spectrogram: time-frequency distribution of energy alonga tonotopic (logarithmic frequency) axis.

Early auditory model

Auditory Spectrogram

Central Auditory System - Temporal Modulations

Auditory Spectrogram

Temporal Modulation ParametersTemporal Modulations: ω ∈ 2,4,8,16,32,64,128,256 (Hz)96 frequency channels covering 4 octaves.

Auditory Temporal Modulations across 10 Music Genres

Blues Classical Country Disco Hiphop

Jazz Metal RockPop Reggae

Central Auditory System - Spectro-temporal Modulations

Auditory Spectrogram Auditory Spectro-Temporal Modulations

Cortical RepresentationA bank of 2D spectrotemporal filters is applied to the auditoryspectrogram, which are selective to different spectrotemporalmodulation parameters ranging from slow to fast rates temporally(in Hz) and from narrow to broad scales spectrally (inCycles/Octave).Each point in the auditory spectrogram has a 2D (hidden)rate-scale representation, which indicates the modulation strengthfor all rates and scales for that channel and time instant.

Temporal modulations - (Hzω )

FrequencyChannels- f

Spectralmodulations - (Ω c/o)

Cortical Representation ParametersSpectral Modulations: Ω ∈ 0.25, 0.5, 1,2,4,8 (Cycles/Octave).Temporal Modulations: Positive and negative ω ∈ 2,4,8,16,32(Hz).128 frequency channels covering 51

3 octaves.

1 Introduction

Overcomplete Dictionaries for Sparse Representations

Mathematical Modeling of Auditory Temporal ModulationsThe auditory temporal modulations of a set of music recordings(i.e. a dataset) are represented by a 3rd-order nonnegative real-valued tensor Y ∈ RNω×Nf×Ns

+ , where Nω = 8, Nf = 96, and Nsdenotes the number of music recordings.

Let Y(3) ∈ RNs×(Nf ·Nω)+ be the 3rd mode matrix unfolding of Y.

Data matrix : Y = YT(3) =

[y1|y2| · · · |yNs

], where T denotes matrix

transposition.yj ∈ R768

+ , j = 1,2, . . . ,Ns, is downsampled to yield a vector of sizeM ∈ 12,48,85,192.

[y1|y2| · · · |yNs

Sparse ApproximationA downsampled representation of auditory temporal modulationsyj ∈ RM

+ , j = 1,2, . . . ,Ns, admits a sparse approximation over adictionary D ∈ RM×K , when yj = D xj or ||yj − Dxj ||p ≤ γ, where|| ||p denotes the `p vector norm for p = 1,2 and∞.To learn D with a fixed number of atoms K , the K-SVD is used.K-SVD iteratively alternates between sparse coding of the trainingsamples based on the current dictionary and dictionary updatingto better fit the training set.

K-SVD (1)

The following problem is solved minxj ,D∑Nst

j=1 ||yj − Dxj ||22 subjectto ||xj ||0 ≤ L, where Nst < Ns is the number of training samplesand || ||0 counts the number of nonzero vector elements.For a given D with K atoms, where M ≤ K ≤ Nst , the optimalsparse coefficient vector of the j th training sample is found bysolving x∗j , argminx ||yj − D x||22 subject to ||x||0 ≤ L with anypursuit algorithm, e.g. OMPa or FOCUSSb.Let Y =

[y1|y2| · · · |yNst

], where yj is a compact notation for y:j .

aG. Davis, S. Mallat, and Z. Zhang, “Adaptive time-frequency decompositions,” Optical Engineering, vol. 33, no. 7, pp.

2183-2191, July 1997.b

I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted normminimization algorithm,” IEEE Trans. Signal Processing, vol. 45, no. 3, pp. 600-616, March 1997.

K-SVD (1)

[y1|y2| · · · |yNst

2183-2191, July 1997.b

K-SVD (1)

[y1|y2| · · · |yNst

2183-2191, July 1997.b

K-SVD (2)

Let also xTk : =

[xk1, xk2, . . . , xkNst

]be the k th row of X ∈ RK×Nst .

If || ||2F denotes the Frobenius norm of a matrix,

||Y− DX||2F = ||(

Y−K∑κ=1κ 6=k

dκxTκ:

)︸︷︷︸

−dkxTk :||2F = ||Ek − dkxT

k :||2F .

K-SVD (2)

Let also xTk : =

[xk1, xk2, . . . , xkNst

]be the k th row of X ∈ RK×Nst .

If || ||2F denotes the Frobenius norm of a matrix,

||Y− DX||2F = ||(

Y−K∑κ=1κ 6=k

dκxTκ:

)︸︷︷︸

−dkxTk :||2F = ||Ek − dkxT

k :||2F .

||Ek Ωk︸︷︷︸ER

−dk xTk :Ωk︸︷︷︸[xR

||2F = ||ERk − dk [xR

k :]T ||2F .

||2F = ||ERk − dk [xR

k :]T ||2F .

||2F = ||ERk − dk [xR

k :]T ||2F .

K-SVD (4)

Let ERk = Υ∆VT be the Singular Value Decomposition (SVD) of

The updated k th dictionary atom is first column of matrix Υ;The updated coefficient vector xR

k : corresponds to the first columnof matrix V multiplied by ∆11.The projections to the principal component analysis subspace,that precede LDA (e.g. in face recognitiona) are replaced by thesparse approximations over the overcomplete dictionary D∗.

aP. N. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear

projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997.

K-SVD (4)

1 Introduction

Dual Linear Discriminant Analysis of SparseRepresentations

DefinitionsLet the training set contain Ng genres and each genre class Yihave ni samples whose sample mean vector is denoted by mi ,i = 1,2, . . . ,Ng .The within-class sample covariance matrix is defined asSw = 1

∑Ngi=1∑

yj∈Yi(yj −mi) (yj −mi)

The between-class sample covariance matrix is given bySb = 1

∑Ngi=1 ni (mi −m) (mi −m)T , where m is the gross

sample mean vector of the whole training set.

∑Ngi=1∑

Discriminant Sparse Projections (1)The most discriminant features (MDFs) are obtained by projectingthe training samples on the columns of the matrixW∗ = argmaxW

|WT Sb W||WT Sw W| .

We propose to apply LDA in the space of the sparserepresentations defined by the matrix D∗.Let mi = 1

∑j: yj∈Yi

xj be the sample mean vector of the sparsecoefficients associated to the training samples that belong to thei th class.

∑j: yj∈Yi

Discriminant Sparse Projections (2)

Sw ≈ D∗∑Ng

i=1∑

yj∈Yi

(xj − mi

) (xj − mi

)T[D∗]T = D∗ Sw [D∗]T ,

where Sw is the within-class sample covariance matrix of thesparse coefficients.Sb ≈ D∗ Sb [D∗]T , where Sb is the between-class samplecovariance matrix of the sparse coefficients.Let W , [D∗]T W. The optimization problem can be recast as

maxW|WT Sb W||WT Sw W|

Sw ≈ D∗∑Ng

i=1∑

yj∈Yi

(xj − mi

) (xj − mi

)T[D∗]T = D∗ Sw [D∗]T ,

Sw ≈ D∗∑Ng

i=1∑

yj∈Yi

(xj − mi

) (xj − mi

)T[D∗]T = D∗ Sw [D∗]T ,

If W∗ is the solution of the optimization problem, the solution of theoriginal LDA problem is then W∗ =

[[D∗]†

]T W∗,

which suggests that zj = [W∗]T yj = [W∗]T [D∗]† yj = [W∗]T xj , i.e.LDA is applied to the coefficients of the sparse representation.Most expressive features (MEFs): xj ,MDFs: zj .Discriminant Sparse Projection: Cascade of sparserepresentation and LDA.

[[D∗]†

]T W∗,

[[D∗]†

]T W∗,

[[D∗]†

]T W∗,

[[D∗]†

]T W∗,

Dual LDA: Principal subspace of Sw

Sw is a real symmetric matrix decomposable as ΦwGwΦTw . It has

rank ρw ≤ min(K ,Nst − Ng) ≤ K . It is assumed that Gw is thediagonal matrix confined to the ρw largest eigenvalues of Sw . Theprincipal subspace of Sw is defined by the eigenvectors, which areassociated to its ρw largest eigenvalues Φw = [φ1|φ2| · · · |φρw ].

Sb is transformed to Qb = G− 1

2w ΦT

w Sb Φw G− 1

Qb is real symmetric matrix of size ρw × ρw having rank Ng − 1decomposable as Qb = Ψb Hb ΨT

b . Let Ψb have columns theNg − 1 eigenvectors associated to the non-zero eigenvalues of Qb.

The Ng − 1 discriminative vectors in the principal subspace of Sw

are the columns of U∗F = Φw G− 1

2w Ψb.

2w ΦT

w Sb Φw G− 1

2w Ψb.

2w ΦT

w Sb Φw G− 1

2w Ψb.

2w ΦT

w Sb Φw G− 1

2w Ψb.

Dual LDA: Principal subspace of Sb

Sb is a real symmetric matrix decomposable as Sb = ΦbGbΦTb . It

has rank ρb ≤ min(K ,Ng − 1) = Ng − 1. Let Gb be the diagonalmatrix of the ρb non-zero eigenvalues of Sb. The principalsubspace of Sb is defined by the eigenvectors, which areassociated to the ρb non-zero eigenvalues of Sb, i.e.Ψb = [ψ1|ψ2| · · · |ψρb

Sw is transformed to Qw = G− 1

2b ΦT

b Sw Φb G− 1

Qw is real symmetric matrix of size ρb × ρb decomposable asQw = Ψw Hw ΨT

w .The Ng − 1 discriminative vectors in the principal subspace of Sb

are the columns of U∗F

= Φb G− 1

2b Ψw H

− 12

2b ΦT

b Sw Φb G− 1

= Φb G− 1

2b Ψw H

− 12

2b ΦT

b Sw Φb G− 1

= Φb G− 1

2b Ψw H

− 12

2b ΦT

b Sw Φb G− 1

= Φb G− 1

2b Ψw H

− 12

Dual LDA ClassifierAt the test stage, the sparse coefficient vector of any test sampley and the class centers are projected to the discriminant vectors inthe two principal subspaces

D(y,mi) = ||[U∗F ]T(x− mi

)||2 + % ||[U∗F ]T

(x− mi

where % =tr[

U∗F [U∗F ]T]

[U∗F]T] and tr[ ] stands for the trace of the matrix

enclosed in brackets.The test sample y is classified to genre i∗ = argmini D(y,mi).

Dual LDA ClassifierAt the test stage, the sparse coefficient vector of any test sampley and the class centers are projected to the discriminant vectors inthe two principal subspaces

D(y,mi) = ||[U∗F ]T(x− mi

)||2 + % ||[U∗F ]T

(x− mi

where % =tr[

U∗F [U∗F ]T]

[U∗F]T] and tr[ ] stands for the trace of the matrix

enclosed in brackets.The test sample y is classified to genre i∗ = argmini D(y,mi).

Classifier EnsembleClassifier combination has been an active research topic inMachine Learning and Pattern Recognitiona

By exploiting the 10 folds the training dataset is split into bystratified 10 fold cross-validation, the overcomplete dictionary[D∗]τ and the projection matrices [U∗F ]τ and [U∗

F]τ in each training

dataset fold τ = 1,2, . . . ,10 are learned.For each test sample, a voting is performed between theclassification labels assigned to it by the aforementioned 10discriminant sparse projections.The test sample is classified to the class received the most votes.

aJ. J. Rodrıguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1619-1630, October 2006.

1 Introduction

Experimental Assessment

GTZAN dataset1000 audio recordings 30 seconds longa;10 genre classes: Blues, Classical, Country, Disco, HipHop, Jazz,Metal, Pop, Reggae, and Rock;Each genre class contains 100 audio recordings.The recordings are converted to monaural wave format at 16 kHzsampling rate with 16 bits and normalized, so that they have zeromean amplitude with unit variance.

aG. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech and Audio Processing, vol.

10, no. 5, pp. 293-302, July 2002.

Feature ExtractionThe auditory temporal modulations representation is computedover a segment of 30 sec duration, vectorized, and normalized tounit length.Each fold delivers a raw training pattern matrix of size 768× 900and a raw test pattern matrix of size 768× 100, which undergodownsampling with ratios 1/8, 1/4, 1/3, 1/2 in the rate-frequencydomain. Downsampled training pattern matrix Y ∈ RM×Nst ,M ∈ 12,48,85,192 and Nst = 900.

Multidimensional scaling (MDS)MDS with locality preserving indexinga

aD. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Trans. Knowledge and Data

Engineering, vol. 17, no. 12, pp. 1624-1637, December 2005.

Auditory temporal modulations for the 1st test fold for M = 192

−0.27 −0.26 −0.25 −0.24 −0.23 −0.22 −0.21 −0.2 −0.19 −0.18 −0.17−0.04

−0.03

−0.02

−0.01

1st MDS coordinate

TestBluesClassicalCountryDiscoHiphopJazzMetalPopReggaeRock

Sparse coefficients for K = 400 and L = 20

−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25−0.15

−0.1

−0.05

1st MDS coordinate

Statistics of the number of non-zero coefficients, when [D∗]τ ,τ = 1,2, . . . ,10 are employed in OMP

0 20 40 60 80 1000

Test sample index j

Projections of the sparse coefficient vectors to the principalsubspaces of [Sw ]1 and [Sb]1

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.8

−0.6

−0.4

−0.2

1st MDS coordinate

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.8

−0.6

−0.4

−0.2

1st MDS coordinate

Classifier ensemble decisions

10 20 30 40 50 60 70 80 90 100

Classical

Country

Hiphop

Reggae

Test sample index j

BluesClassicalCountryDiscoHiphopJazzMetalPopReggaeRock

Average classification accuracy (in %)Classifier Ratio

1/8 1/4 1/3 1/2dual LDA 34.2 42.3 49.3 54.3ensemble dual LDA 34.8 44.1 52.4 57.5DKL 34.4 42.3 49.3 54.4ensemble dual DKL 34.9 44.1 52.4 57.5discriminant sparse projections 42.2 43.03 55.03 57.64ensemble discriminative sparse projection 44.64 59.9 75.33 84.96

and its 95% confidence interval (in %)Classifier Ratio

1/8 1/4 1/3 1/2dual LDA/DKL 2.95 3.07 3.11 3.10ensemble dual LDA/DKL 2.96 3.09 3.11 3.07discriminant sparse projections 3.07 3.08 3.10 3.07ensemble discriminant sparse projections 3.09 3.05 2.68 2.22

Cumulative Confusion Matrix (in %)Genre Blues Classical Country Disco Hiphop Jazz Metal Pop Reggae RockBlues 91 1 2 0 5 0 0 0 1 0

Classical 0 96 0 1 0 0 1 0 0 2Country 2 0 88 1 0 0 0 1 3 5Disco 0 0 2 89 2 0 0 4 0 3

Hiphop 0 0 0 9 78 0 3 9 0 1Jazz 3 0 2 1 0 92 0 0 0 2Metal 0 3 3 0 0 0 88 0 0 6Pop 0 0 1 6 2 0 0 86 1 4

Reggae 4 1 0 5 11 1 0 2 70 6Rock 6 0 1 5 3 2 8 3 2 70

Sparse Representation-based Classification (SRC)

Main Idea (1)

Let us denote by Ai = [ai,1|ai,2| . . . |ai,ni ] ∈ R768×ni+ the

(sub)dictionary that has as columns the ni auditory modulationrepresentations stemming from the i th genre (i.e., atoms).Given a test auditory representation y ∈ R768

+ , that belongs to thei th genre, it can be expressed as y = Ai ci , whereci = [ci,1, ci,2, . . . , ci,ni ]

T ∈ Rni .

Main Idea (1)

T ∈ Rni .

Main Idea (1)

T ∈ Rni .

Main Idea (2)

Let D = [A1|A2| . . . |AN ] ∈ R768×n+ be formed by concatenating the

n auditory modulation representations distributed across Ngenres.The test auditory representation y can be equivalently rewritten asy = D c, where c = [0T | . . . |0T |cT

i |0T | . . . |0T ]T .c contains information about the genre the test auditoryrepresentation y belongs to.We can find such a c by seeking the sparsest solution to the linearsystem of equations y = D c.

Main Idea (2)

Problem formulationGiven D and y, solve for

c∗ = argminc||c||0 subject to D c = y.

The aforementioned problem is NP-hard due to the nature of theunderlying combinational optimization.An approximate solution can be obtained by replacing the `0 normwith the `1 norm: c∗ = argminc ||c||1 subject to D c = y.

Dimensionality ReductionFor overcomplete dictionaries derived from the auditory temporalmodulation representations, the dimensionality of atoms must be min(768,n).Thus, we reformulate the optimization problem under study as:

c∗ = argminc||c||1 subject to WT D c = WT y

where W ∈ R768×k with k << min(768,n) is a projection matrix.W is obtained by

non-negative matrix factorization (NMF);principal component analysis (PCA);independently sampling from a zero-mean normal distribution,andnormalizing each column to unit length.

Downsampling the entries of D is another option.

Reasons for Dimensionality ReductionThe computational cost of linear programming solvers is reduced.The creation of a redundant dictionary from the training auditorytemporal modulation representations is facilitated.

Why a Redundant Dictionary?Enables treatment of

missing dataoutliersnoise.

ClassificationA test auditory modulation is classified as follows.

1 y is projected onto the reduced dimensions space: y = WT y.2 c∗ = argminc ||c||1 subject to WT D c = y.3 Ideally, c∗ contains non-zero entries associated with the columns

of WT D stemming from a single genre.4 Due to modeling errors, there are small non-zero entries in c∗, that

are associated to multiple genres.5 Each auditory modulations representation is classified to the

genre that minimizes the `2 norm residual between y andy = WT D ϑi(c), where ϑi(c) ∈ Rn is a new vector whose nonzeroentries are those in c associated to the i th genre.

Sparse coefficients and residuals for a test auditory temporalmodulations representation of blues genre

1 Introduction

Datasets1 GTZAN dataset2 ISMIR 2004 Genre Dataset

1458 full audio recordings;6 genre classes: Classical (640), Electronic (229), Jazz Blues(52),MetalPunk(90), RockPop(203), World (244).

ProtocolGTZAN dataset: stratified 10-fold cross-validation: Each trainingset consists of 900 audio recordings yielding a training matrixAGTZAN ∈ R768×900

+ .ISMIR 2004 Genre dataset: The ISMIR2004 Audio DescriptionContest protocol defines training and evaluation sets, whichconsist of 729 audio files each.

ParametersW ∈ R768×k is derived from AGTZAN and AISMIR by employing

NMF or PCA with k ∈ 12,48,85,192;random projection matrix for the same k .

Downsampling the auditory temporal modulations with ratios 1/8,1/4, 1/3, and 1/2.

ClassifiersSRClinear SVMsNearest Neighbor (NN) with cosine similarity measure (CSM)

SRC accuracy on the GTZAN and ISMIR2004 datasets

0 50 100 150 20020

Feature Dimension

NMFPCARandomDownsample

0 50 100 150 20020

Feature Dimension

Linear LVM accuracy on the GTZAN and ISMIR2004 datasets

0 50 100 150 20020

Feature Dimension

0 50 100 150 20020

Feature Dimension

NN accuracy on the GTZAN and ISMIR2004 datasets

0 50 100 150 20020

Feature Dimension

0 50 100 150 20020

Feature Dimension

1 Introduction

Locality Preserving Non-negative Tensor Factorization

Multilinear Dimensionality Reduction TechniquesUnsupervised ones: Non-Negative Tensor Factorization (NTF)a,Multilinear Principal Component Analysis (MPCA)b, Non-negativeMPCA (NMPCA)c;Supervised ones: General Tensor Discriminant Analysis (GTDA)d,Discriminant Non-Negative Tensor Factorization (DNTF)e.

aE. Benetos and C. Kotropoulos, “A tensor-based approach for automatic music genre classification,” in Proc. XVI European

Signal Processing Conf., Lausanne, Switzerland, 2008.b

H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos: “MPCA: Multilinear principal component analysis of tensor objects,”IEEE Trans. Neural Networks, vol. 19, no. 1, pp 18-39, 2008.

cY. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal

modulations for music genre classification,” IEEE Trans. Audio, Speech, and Language Processing, to appear.d

D. Tao, X. Li, X. Wu, and S. J. Maybank, “General tensor discriminant analysis and Gabor features for gait recognition,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1700-1715, 2007.

eS. Zafeiriou, “Discriminant nonnegative tensor factorization algorithms,” IEEE Trans. Neural Networks, vol. 20, no. 2, pp.

217-235, 2009.

Local structure in a nonlinear manifoldLet Aq|Qq=1 be a set of Q non-negative tensors of order N, whichlie in a nonlinear manifold A embedded into the tensor space. Theset can be represented as an (N + 1)-order tensor A ∈ RI1×I2×...

+×IN×IN+1 with IN+1 = Q.The local structure of A can be modeled by the nearest neighborgraph G whose weight matrix S has elements

e−||Aq−Ap||2

τ if Aq and Ap belong to the same class0 otherwise

with || ||2 denoting the tensor norma.The Laplacian matrix is L = Γ− S, where Γ is a diagonal matrixwith elements γqq =

∑p sqp, i.e. the column sums of S.

aT. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, to appear.

e−||Aq−Ap||2

Problem statementLet Z(i) = U(N+1) . . . U(i+1) U(i−1) . . . U(1).Subject to U(i) ≥ 0, i = 1,2, . . . ,N + 1, minimize

fLPNTF(U(i)|N+1

)= ||A(i)−U(i)[Z(i)]T ||2 +λtr

[U(N+1)

]T L U(N+1)

where U(i) ∈ RIi×k+ , k is the desirable number of rank-1 tensors

approximating A when linearly combined, and λ > 0.

Problem statementLet Z(i) = U(N+1) . . . U(i+1) U(i−1) . . . U(1).Subject to U(i) ≥ 0, i = 1,2, . . . ,N + 1, minimize

fLPNTF(U(i)|N+1

)= ||A(i)−U(i)[Z(i)]T ||2 +λtr

[U(N+1)

]T L U(N+1)

where U(i) ∈ RIi×k+ , k is the desirable number of rank-1 tensors

approximating A when linearly combined, and λ > 0.

Gradients

∇U(i) fLPNTF =

U(i)[Z(i)]T Z(i)︸︷︷︸∇+

U(i) fLPNTF

− A(i)Z(i)︸︷︷︸∇−

U(i) fLPNTF

for i = 1,2, . . . ,N

U(N+1)[Z(N+1)

]T Z(N+1) + λ ΓU(N+1)︸︷︷︸∇+

U(N+1)fLPNTF

−(A(N+1)Z(N+1) + λ S U(N+1)

)︸︷︷︸∇−

U(N+1)fLPNTF

for i = N + 1.

Robust multiplicative update rulesExtending (Lin, 2007)a, it is proven:

U(i)[t+1] = U(i)

[t] −U(i)

U(i)[t]

fLPNTF + δ∗ ∇U(i)

[t]fLPNTF

U(i)[t] =

[t] if ∇U(i)[t]

fLPNTF ≥ 0

σ otherwise

for σ, δ small positive numbers, typically 10−8. The division iselementwise and t denotes the iteration index.

aC. -J. Lin, “On the convergence of multiplicative update algorithms for nonnegative matrix factorization,” IEEE Trans. Neural

Networks, vol. 18, no. 6, pp. 1589-1596, 2007.

Impact on the SRC

Data tensor A ∈ RI1×I2×I3×I4+ , where I1 = Iscales = 6,

I2 = Irates = 10, I3 = Ifrequencies = 128, and I4 = Isamples.SRC problem: c? = argminc ||c||1 subject to W D c = Wy, whereW ∈ Rk×7680 with k min(7680, Isamples) is a projection matrix.When LPNTF, NTF, or DNTF is applied to A, four factor matricesU(i) ∈ RIi×k

+ , i = 1,2,3,4, are obtained, which are associated toscale, rate, frequency, and sample modes, respectively.W = (U(3) U(2) U(1))T or W = (U(3) U(2) U(1))†.D = AT

(4) = WT [U(4)]T .

For MPCA or GTDA, three factor matrices U(i) ∈ RIi×Ji , withJi < Ii , i = 1,2,3, are obtained. W = (U(3) ⊗ U(2) ⊗ U(1))T orW = (U(3) ⊗ U(2) ⊗ U(1))†. The columns of D are obtained byapplying W to vectorized training tensors vec(Aq).

Impact on the SRC

(4) = WT [U(4)]T .

Impact on the SRC

(4) = WT [U(4)]T .

Impact on the SRC

(4) = WT [U(4)]T .

1 Introduction

Experimental setupThe training tensor is constructed by stacking the corticalrepresentations: AGTZAN ∈ R6×10×128×900

+ ;AISMIR ∈ R6×10×128×729

+ .λ = 0.5 and τ = 1 (heat kernel) are empirically set for LPNTF;Need for cross-validation.To determine the dimensionality of factor matrices, the ratio of thesum of eigenvalues retained over the sum of all eigenvalues foreach mode-n tensor unfolding is employed.The same J1 = Jscales, J2 = Jrates, and J3 = Jfrequencies are used inMPCA and GTDA; k = J1J2J3 for LPNTF, NTF, DNTF, and randomprojection.

+ ;AISMIR ∈ R6×10×128×729

Total number of retained principal components in each mode forthe GTZAN and ISMIR2004 datasets

78 80 82 84 86 88 90 92 942

Portion of total scatter retained (%)

Rate subspaceScale subspaceFrequency subspace

78 80 82 84 86 88 90 92 942

Rate subspaceScale subspaceFrequency subspace

Feature dimension for the GTZAN and ISMIR2004 datasets

78 80 82 84 86 88 90 92 9440

78 80 82 84 86 88 90 92 9420

SRC accuracy on the GTZAN and ISMIR2004 datasets

78 80 82 84 86 88 90 92 9430

LPNTFNTFDNTFMPCAGTDARandom

78 80 82 84 86 88 90 92 9430

Experimental Evaluation of SRC on CorticalRepresentations

Linear SVM accuracy on the GTZAN and ISMIR2004 datasets

78 80 82 84 86 88 90 92 9430

Portion of total scatter retained(%)

1 Introduction

Comparison with the State of the Art

GTZAN datasetMethod Accuracy (in %)Topology preserving NTF +SRC 93.7LNPTF+SRCa 92.4 ± 2NMF+SRCb 91 ± 1.76Ensemble discriminant sparse projections 84.96 ± 2.22NMPCA + SVM-RBFc 84.3Adaboostd 82.5Daubechies wavelet coefficient histograms + SVMe 78.5Daubechies wavelet coefficient histograms + LDA 71.3

aY. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor

factorization and sparse representations, in Proc. 2009 Int. Conf. Music Information Retrieval, Kobe, Japan, October 2009.b

Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification via sparse representations of auditory temporalmodulations,” in Proc. 17th European Signal Processing Conf., Glasgow, August 2009.

J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kegl, “Aggregate features and AdaBoost for music classification,”Machine Learning, vol. 65, no. 2-3, pp. 473-484, 2006.

eT. Li, M. Ogihara, and Q. Li, “A comparative study on content-based music genre classification,” in Proc. 26th Int. ACM

SIGIR Conf. Research and Development in Information Retrieval, Toronto, Canada, 2003, pp. 282-289.

Comparison with the State of the Art

ISMIR2004 datasetMethod Accuracy (in %)Topology preserving NTF +SRC 94.93NTF+SRC (ISMIR2009) 94.38 ± 1.68LPNTF+SRC (ISMIR 2009) 94.25 ± 1.70PCA+SRC (EUSIPCO 2009) 93.56 ± 1.79NMF+GMMa 83.5NTF + SVM-RBF (IEEE TSLP 2009) 83.15Adaboost 82.3NMPCA + SVM-RBF (IEEE TSLP 2009) 82.19

aA. Holzapfel and Y. Stylianou, “Musical genre classification using nonnegative matrix factorization-based features,” IEEE

Trans. Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 424-434, 2008.

1 Introduction

Conclusions-Future Work

SummaryA robust music genre classification framework has been proposedby taking into account the properties of the auditory humanperception.2D auditory temporal modulations and 3D cortical representationsyield rich tensors for feature extraction, while sparse conceptshave been employed for feature selection and classification.The best classification accuracies reported outperform any rateever obtained by the state of the art music genre classificationalgorithms when applied to either the GTZAN or the ISMIR2004Genre datasets.

Future WorkThe dependence of the SRC on the dimensionality reductiontechnique deserves further research.The design of discriminant overcomplete dictionaries within theclassifier ensemble and/or the substitution of the dual LDA bysparse LDA in order to enforce sparsity on the columns of W couldbe pursued.Efficient implementations using incremental update rules areneeded.In many commercial and private applications, the number ofavailable audio recordings per genre is limited. Thus, it isdesirable the music genre classification algorithm to perform wellin such small sample sets.

Thank You!Questions?

music genre classification with multilinear and sparse techniques … · 2009-10-19 ·...

Documents