identifying words that are musically meaningful david torres, douglas turnbull, luke barrington,...

Identifying Words that are Musically Meaningful

David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet

Computer Audition Lab

UC San Diego

ISMIR

September 25, 2007

2

Introduction

Our Goal: Create a content-based music search engine for natural language queries.

– CAL Music Search Engine [SIGIR07]

Problem: picking a vocabulary of musically meaningful words?– Word is present pattern in audio

content

Solution: find words that are correlated with a set of acoustic signals

3

Two-View Representation

Consider a set of annotated songs. Each song is represented by:

1. Annotation vector in a Semantic Space

2. Audio feature vector(s) in an Acoustic Space

Acoustic Space (2D)

x

y

Semantic Space (2D)

‘funky’

‘Ireland’

Mustang Sally The Commitments


Riverdance Bill Whelan

Riverdance Bill Whelan

Hot Pants James Brown

Hot Pants James Brown

4

Semantic Representation

Vocabulary of words:1. CAL500: 174 phrases from a human survey

• Instrumentation, genre, emotion, usages, vocal characteristics

2. LastFM: ~15,000 tags from social music site

3. Web Mining: 100,000+ words mined from text documents

Annotation Vector, denoted s1. Each element represents the ‘semantic association’ between

a word and the song.2. Dimension (DS) = size of vocabulary

3. Example: Frank Sinatra’s ‘Fly Me to the Moon”1. Vocabulary = {funk, jazz, guitar, female vocals, sad,

passionate }

2. Annotation (si) = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4]

Data is represented by a N x DS Matrix S =

- s1 -.

- si -.

- sN -

5

Acoustic Representation

Each song is represented by an audio feature vector a that is automatically extracted from the audio-content.

Data is represented by NxDA matrix A =

- a1 -.

- ai -.

- aN -

Acoustic Space (2D)Semantic Space (2D)

‘funky’

‘Ireland’


x

y


6

Canonical Correlation Analysis (CCA)CCA is a technique for exploring dependencies between two related spaces.– Generalization of PCA to multiple spaces– Constrained optimization problem

• Find vectors weight vectors ws and wa:

– 1-D projection of data in the semantic space - Sws

– 1-D projection of data in the acoustic space - Awa

• Maximize correlation of the projections

– max (Sws)T(Awa)

• Constrain ws and wa to prevent infinite correlation

max (Sws)T

(Awa) wa, ws

subject to: (Sws)T (Sws) = 1

(Awa)T(Awa) = 1

7

CCA VisualizationAudio feature spaceSemantic space

‘funky’

‘Ireland’

a

a

a

b

b

b b

cc

cc

dd

d

1 1 0 -1 0 -1-1 -1

S 1 -1 1 1 -1 -1-1 1

A100-1

200-2

= =10

ws

1-1

wa(Sws)T (Awa)

200-2

1 0 0 -1 = 4

x

y

SparseSolution

8

What Sparsity means…

In the previous example,

• ws,’funky’ 0

‘funky’ is correlated w/ audio signals a musically meaningful word

• ws,’Ireland’ = 0

‘Ireland’ is not correlated No linear relationship with the acoustic representation

In practice, ws is dense even if most words are uncorrelated

– ‘dense’ means many non-zero values – due to random variability in the data

Key Idea: reformulate CCA to produce a sparse solution.

9

Introducing Sparse CCA [ICML07]

Plan: penalize the objective function for each non-zero semantic dimensions• Pick a penalty function f(ws)

• Penalizes each non-zero dimension

• Take 1: Cardinality of ws: f(ws) = |ws|0

• Combinatorial problem - np-hard

• Take 2: L1 relaxation: f(ws) = |ws|1

• Non-convex, not very tight approximation• Take 3: SDP relaxation

• Prohibitive expensive for large problem

• Solution: f(ws) = i log |ws,i|

• Non-convex problem, but• Can be solved efficiently with DC program• Tight approximation

10

Introducing Sparse CCA [ICML07]

Plan: penalize the objective function for each non-zero semantic dimensions

• Pick a penalty function f(ws)

• Penalizes each non-zero dimension

• f(ws) = i log |ws,i|

• Use tuning parameter to control importance of sparsity

• Increasing smaller set of ‘musically relevant’ words

max (Sws)T (Awa) wa, ws

subject to: (Sws)T (Sws) = 1

(Awa)T(Awa) = 1

- f(ws)

11

Experimental Setup

CAL500 Data Set [SIGIR07]

– 500 songs by 500 Artists

– Semantic Representation• 173 words

– genre, instrumentation, usages, emotions, vocals, etc…

• Annotation vector is average from 3+ listeners

• Word Agreement Score

– measures how consistently listeners apply a word to songs

– Acoustic Representation• Bag of Dynamic MFCC Vectors [McKinney03]

– 52-D vector spectral modulation intensities

– 160 vectors per minute of audio content

• Duplicate annotation vector for each Dynamic MFCC

12

Experiment 1: Qualitative Results

Words with high acoustic correlation

hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping

Words with no acoustic correlation

classic rock, normal, constant energy, going to sleep, falsetto

13

Experiment 2: Vocabulary Pruning

AMG2131 Text Corpus [ISMIR06]

– AMG Allmusic song reviews for most of CAL500 songs– 315 word vocabulary – Annotation vector based on the presence or absence

of a word in the review– More noisy word-song relationships then CAL500

Experimental Design:1. Merge vocabularies: 173+315 = 488 words

2. Prune noisy words as we increase amount of sparsity in CCA

Hypothesis: – AMG words will be pruned before CAL500 words

14

Experiment 2: Vocabulary Pruning

Experimental Design:

1. Merge vocabularies: 488 words

2. Prune noisy words as we increase amount of sparsity in CCA

Result:As Sparse CCA is more aggressive, more AMG words are pruned.

Vocabulary Size

# CAL500 Words

# AMG2131 Words

% Web2131 Words

488

173

315

0.64

249

118

131

0.52

149

85

64

0.42

50

39

11

0.22

15

Experiment 3: Vocabulary SelectionExperimental Design:1. Rank words by

• how aggressive Sparse CCA is before word gets pruned.• how consistently humans use a word across CAL500 corpus.

2. As we decrease vocabulary size, calculate Average AROC

Result: Sparse CCA does predict words that have better AROC

.68

.76

AROC

173 120 20Vocab Size

70

16

Recap

Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music.

Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words.– i.e., semantic dimensions linearly correlated with audio

features

Automatically pruning words is important when using noisy sources of semantic information – e.g., LastFM Tags or Web Documents

17

Future Work

Theory: moving beyond linear correlation with kernel methods

Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space

Practice: handling large, noisy semantically annotated music corpora

Identifying Words that are Musically Meaningful

David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet

Computer Audition Lab

UC San Diego

ISMIR

September 25, 2007

19

Experiment 3: Vocabulary Selection

Our content-based music search engine rank orders songs given a text-based query [SIGIR 07]

– Area under the ROC curve (AROC) measures quality of each ranking

• 0.5 is random, 1.0 is perfect

• 0.68 is average AROC for all 1-word queries

Can Sparse CCA pick words that will have higher AROC?– Idea: words with high correlation have more signal in the audio representation and will be easier to model.

– How does it compare picking words that humans consistently use to label songs.