deep learning architectures for music audio classification: a …jordipons.me/media/upc-2018.pdf ·...

51
Deep learning architectures for music audio classification: a personal (re)view Jordi Pons jordipons.me – @jordiponsdotme Music Technology Group Universitat Pompeu Fabra, Barcelona

Upload: others

Post on 07-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Deep learning architectures for music audio classification: a personal (re)view

Jordi Pons

jordipons.me – @jordiponsdotme

Music Technology GroupUniversitat Pompeu Fabra, Barcelona

Page 2: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Acronyms

MLP: multi layer perceptron ≡ feed-forward neural networkRNN: recurrent neural networkLSTM: long-short term memoryCNN: convolutional neural networkBN: batch normalization

..the following slides assume you know these concepts!

Page 3: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Outline

Chronology: the big picture

Audio classification: state-of-the-art review

Music audio tagging as a study case

Page 4: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Outline

Chronology: the big picture

Audio classification: state-of-the-art review

Music audio tagging as a study case

Page 5: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones#

pape

rs

Page 6: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

# pa

pers

Page 7: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

# pa

pers

LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)

Page 8: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)

# pa

pers

LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)

Page 9: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)

# pa

pers

LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)

CNN learns from spectrograms for music audio classification (Lee et al., 2009)

Page 10: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)

# pa

pers

LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)

CNN learns from spectrograms for music audio classification (Lee et al., 2009)

End-to-end learning for music audio classification (Dieleman et al., 2014)

Page 11: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: milestones

RNN from symbolic data for automatic music composition (Todd, 1988)

MLP from symbolic data for automatic music composition (Lewis, 1988)

MLP learns from spectrograms data for note onset detection (Marolt et al, 2002)

# pa

pers

LSTM from symbolic data for automatic music composition (Eck and Schmidhuber, 2002)

CNN learns from spectrograms for music audio classification (Lee et al., 2009)

End-to-end learning for music audio classification (Dieleman et al., 2014)

Page 12: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

0

10

20

30

40

50

60

70

80

“Deep learning & music” papers: data trends#

pape

rs

symbolic data

spectrograms data

raw audio data

Page 13: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

“Deep learning & music” papers: some referencesDieleman et al., 2014 – End-to-end learning for music audioin International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Lee et al., 2009 – Unsupervised feature learning for audio classification using convolutional deep belief networksin Advances in Neural Information Processing Systems (NIPS)

Marolt et al., 2002 – Neural networks for note onset detection in piano musicin Proceedings of the International Computer Music Conference (ICMC)

Eck and Schmidhuber, 2002 – Finding temporal structure in music: Blues improvisation with LSTM recurrent networks in Proceedings of the Workshop on Neural Networks for Signal Processing

Todd, 1988 – A sequential network design for musical applicationsin Proceedings of the Connectionist Models Summer School

Lewis, 1988 – Creation by Refinement: A creativity paradigm for gradient descent learning networksin International Conference on Neural Networks

Page 14: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Outline

Chronology: the big picture

Audio classification: state-of-the-art review

Music audio tagging as a study case

Page 15: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Which is our goal / task?

input outputmachine learning

waveform

or any audio representation!

phonetic transcription

describe musicwith tags

event detection

deep learning model

Page 16: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline

input outputfront-end back-end

waveform

or any audiorepresentation!

phonetic transcription

describe musicwith tags

event detection

Page 17: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: input?

input

?

Page 18: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

How to format the input (audio) data?

Waveformend-to-end learning

Pre-processed waveforme.g.: spectrogram

Page 19: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: front-end?

input front-end

waveform

spectrogram?

Page 20: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

waveform pre-processed waveform

based on domain

knowledge?filters

config?

input signal?

Page 21: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

CNN front-ends for audio classification

Waveformend-to-end learning

Pre-processed waveforme.g.: spectrogram

3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1

Sample-level Small-rectangular filters

Page 22: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

waveform pre-processed waveform

3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1

small-rectangular filterssample-level

based on domain

knowledge?filters

config?

nominimal

filter expression

input signal?

Page 23: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Domain knowledge to design CNN front-ends

Waveformend-to-end learning

Pre-processed waveforme.g.: spectrogram

Page 24: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Domain knowledge to design CNN front-ends

Waveformend-to-end learning

Pre-processed waveforme.g.: spectrogram

frame-level vertical or horizontal filters

filter length: 512 window length?stride: 256 hop size?

Explicitly tailoring the CNN towards learning temporal or timbral cues

Page 25: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

waveform pre-processed waveform

3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1

small-rectangular filterssample-level

based on domain

knowledge?filters

config?

yes

nominimal

filter expression

single filter shape in 1st CNN layer

frame-level vertical OR horizontal

or

input signal?

Page 26: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

DSP wisdom to design CNN front ends

Waveformend-to-end learning

Pre-processed waveforme.g.: spectrogram

Frame-level (many shapes!) Vertical and/or horizontal

Explicitly tailoring the CNN towards learning temporal and timbral cues

Efficient way to represent 4 periods!

Page 27: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

waveform pre-processed waveform

3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1

small-rectangular filterssample-level

based on domain

knowledge?filters

config?

yes

yes

nominimal

filter expression

many filter shapes in 1st CNN layer

single filter shape in 1st CNN layer

frame-level vertical OR horizontal

frame-level vertical AND/OR horizontal

or

input signal?

Page 28: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

CNN front-ends for audio classificationSample-level: Lee et al., 2017 – Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms in Sound and Music Computing Conference (SMC)

Small-rectangular filters: Choi et al., 2016 – Automatic tagging using deep convolutional neural networks in Proceedings of the ISMIR (International Society of Music Information Retrieval) Conference

Frame-level (single shape): Dieleman et al., 2014 – End-to-end learning for music audioin International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Vertical: Lee et al., 2009 – Unsupervised feature learning for audio classification using convolutional deep belief networks in Advances in Neural Information Processing Systems (NIPS)

Horizontal: Schluter & Bock, 2014 – Improved musical onset detection with convolutional neural networksin International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Frame-level (many shapes): Zhu et al., 2016 – Learning multiscale features directly from waveformsin arXiv:1603.09509

Vertical and horizontal (many shapes): Pons, et al., 2016 – Experimenting with musically motivated convolutional neural networks in 14th International Workshop on Content-Based Multimedia Indexing

Page 29: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: back-end?

input front-end back-end

waveform

spectrogram

several CNNarchitectures ?

Page 30: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

What is the back-end doing?

Back-end adapts a variable-length feature map to a fixed output-size

front-end

back-end

same length output

latent feature-map

front-end

back-end

same length output

latent feature-map

Page 31: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

● Temporal pooling: max-pool or average-pool the temporal axis

Pons et al., 2017 – End-to-end learning for music audio tagging at scale, in proceedings of the ML4Audio Workshop at NIPS.

● Attention: weighting latent representations to what is important

C. Raffel, 2016 – Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD thesis.

● RNN: summarization through a deep temporal model

Vogl et al., 2018 – Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks, In proceedings of the ISMIR conference.

Back-ends for variable-length inputs

..music is generally of variable length!

Page 32: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

● Fully convolutional stacks: adapting the input to the output with a stack of CNNs & pooling layers.

Choi et al., 2016 – Automatic tagging using deep convolutional neural networks in proceedings of the ISMIR conference.

● MLP: map a fixed-length feature map to a fixed-length output

Schluter & Bock, 2014 – Improved musical onset detection with convolutional neural networks in proceedings of the ICASSP.

Back-ends for fixed-length inputsCommon trick: let’s assume a fixed-length input

..such trick works very well!

Page 33: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: output

input outputfront-end back-end

waveform

spectrogram

several CNNarchitectures

MLP

RNN

attention

phonetic transcription

describe musicwith tags

event detection

Page 34: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: output

input outputfront-end back-end

waveform

spectrogram

several CNNarchitectures

MLP

RNN

attention

phonetic transcription

describe musicwith tags

event detection

Page 35: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Outline

Chronology: the big picture

Audio classification: state-of-the-art review

Music audio tagging as a study case

Pons et al., 2017. End-to-end learning for music audio tagging at scale, in ML4Audio Workshop at NIPS Summer internship @ Pandora

Page 36: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

input outputfront-end back-end

describe musicwith tags?

The deep learning pipeline: input?

Page 37: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

How to format the input (audio) data?

waveform log-mel spectrogram

already: zero-mean & one-variance

NO pre-procesing!

– STFT & mel mappingreduces size of the input by removing perceptually irrelevant information

– logarithmic compressionreduces dynamic range of the input

– zero-mean & one-variance

Page 38: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: input?

input outputfront-end back-end

describe musicwith tags

waveform

log-melspectrogram

Page 39: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: front-end?

input outputfront-end back-end

describe musicwith tags?

waveform

log-melspectrogram

Page 40: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

waveform pre-processed waveform

3x3 3x3 ... 3x3 3x33x1 3x1 ... 3x1 3x1

small-rectangular filterssample-level

based on domain

knowledge?filters

config?

yes

yes

nominimal

filter expression

many filter shapes in 1st CNN layer

single filter shape in 1st CNN layer

frame-level vertical OR horizontal

frame-level vertical AND/OR horizontal

or

input signal?

Page 41: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Studied front-ends: waveform model

sample-level (Lee et al., 2017)

Page 42: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Studied front-ends: spectrogram model

vertical and horizontalmusically motivated CNNs

(Pons et al., 2016 – 2017)

Page 43: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: front-end?

input outputfront-end back-end

describe musicwith tags

waveform

log-melspectrogram

sample-level

vertical and horizontal

Page 44: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: back-end?

input outputfront-end back-end

describe musicwith tags

waveform

log-melspectrogram

sample-level

vertical and horizontal?

Page 45: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Studied back-end: music is of variable length!

Temporal pooling (Dieleman et al., 2014)

Page 46: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

The deep learning pipeline: back-end?

input outputfront-end back-end

describe musicwith tags

waveform

log-melspectrogram

sample-level

vertical and horizontal

temporal pooling

Page 47: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

MagnaTTMillion song dataset 1M25k 250k

songs songssongs

Page 48: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

MagnaTTMillion song dataset 1M25k 250k

songs songssongs

spectrograms > waveforms

Page 49: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

MagnaTTMillion song dataset 1M25k 250k

songs songssongs

spectrograms > waveforms

waveforms > spectrograms

Page 50: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Let’s listen to some music: our model in action

acoustic

string ensemble

classical music

period baroque

compositional dominance of lead vocals

major

Page 51: Deep learning architectures for music audio classification: a …jordipons.me/media/UPC-2018.pdf · 2019-05-27 · in International Conference on Acoustics, Speech and Signal Processing

Deep learning architectures for music audio classification: a personal (re)view

Jordi Pons

jordipons.me – @jordiponsdotme

Music Technology GroupUniversitat Pompeu Fabra, Barcelona