+ advances in the parallelization of music and audio applications eric battenberg, david wessel...

+Advances in the Parallelization

of Music and Audio Applications

Eric Battenberg, David Wessel

& Juan Colmenares

+Overview

Parallelism today in the popular interactive music languages

Parallel Partitioned Convolution

Accelerating Non-Negative Matrix Factorization (NMF) for use in audio source separation and music information retrieval and the importance of Selective, Embedded Just In Time Specialization (SEJITS)

Real-time in the Tessellation OS

A plea for more flexible I/0 with GPUs

+Current Support for Parallelism is Copy-Based

The widely used languages for music and audio applications are fundamentally sequential in character – this includes Max/MSP, PD, SuperCollider, and CHUCK among others.

Limited multithreading One approach to exploiting multi-core processors is to run

copies of the applications on separate cores. Max/MSP provides a useful multi-threading mechanism

called poly~ . PD provides PD~ each instance of which runs in a separate

thread inside a PD patch.

+Partitioned Convolution

First real-time app in the Par Lab.

Partitioned Convolution – an efficient way to do low-latency filtering with a long (> 1 sec) impulse response.

Important in real-time reverb processing for environment simulation.

Sound examples:

Acoustic Guitar …in a giant mausoleum …convolved with a sine sweep

Impulse response Impulse response

+Partitioned Convolution

Convolution: a way to do linear filtering with a finite impulse response (FIR) filter.

Direct convolution: For length L filter, O(L) ops per output point, zero delay. L can be greater than 100,000 samples (> 3 sec of audio)

Block FFT Convolution: Only O(log(L)) ops per output point, but delay of L.

How can we trade off between complexity and latency?

y[n] h[k]x[n k]k

FFT Complex Mult IFFTx y

H H = FFT(h)

+Uniform Partitioned Convolution

We would like the latency to be less than 10ms (512 samples)

Cut an impulse response up into equal-sized blocks.

Then we can use a parallellayout of Block FFT convolverswith delays to implement the filter.

The latency is now N, and we still get complexity savings.

LN

1 43 52

delay(N)

delay(N)

delay(N)

delay(N)

1

2

3

4

5

+

x

y

Block FFT Convolver

+Frequency Delay Line Convolution We can also exploit linearity of the FFT so that only one

FFT/IFFT is required.

So the parallel Block FFT Convolver above becomes a Frequency Delay Line (FDL) Convolver:

delay(N)

delay(N)

1

2

3

+

x

y

FFT

Complex Mult IFFT

H1

Block FFT Convolver

delay(N)

delay(N)

+

x

y

Complex Mult

H1

Complex Mult

H2

Complex Mult

H3

FFT

IFFT

Frequency Delay Line Convolver

+Multiple FDL Convolution

If L is big (e.g. > 100,000) and N is small (e.g. < 1000), our FDL will have 100’s of partitions to handle.

We can connect multiple FDL’s in parallel to get the best of both worlds.

xdelay(Nx

6)

delay(4Nx4)

FDL 1

FDL 2

FDL 3

+y

x FDL y

+Scheduling Multiple FDLs

FDLs are run in separate threads. Each is allowed to compute for a length of time

corresponding to its block size. Synchronization is performed at the vertical lines.

+Auto-Tuning for Real-Time

We are not trying to only maximize throughput.

We are trying to improve our ability to make real-time guarantees.

For now, we estimate a Worst-Case Execution Time (WCET) for each size of FDL. Then we combine the FDLs that are most likely to meet

their scheduling deadlines.

In the future, we will use a notion of predictability along with more robust scheduling.

We are finishing development on a Max/MSP object, Audio Unit plugin, and a portable standalone version of this.

+ Accelerating Non-Negative Matrix Factorization (NMF)

NMF is widely used in audio source separation. The idea is to factor the time/frequency representation (spectogram) into source coupled spectral (W) and gain (H) matricies.

+The Importance of SEJITS

in Developing an Information Retrieval (MIR) Application

Rather using a domain restricted language developers write in a full blown scripting language such as PYTHON or RUBY.

Functions are selected by annotation as performance critical.

If efficiency layer implementations of these functions are available appropriate code is generated and JIT compiled.

If not the selected function is executed in the scripting language itself.

The scripted implementation remains as the portable reference implementation.

With this simple music computer application we expect to initially show that Tessellation can provide acceptable performance and time predictability

In cooperation with the OS Group

2nd-level RT scheduler A

Cell A

2nd-level RT scheduler B

Cell BInitial Cell

Sound card

Shell

F

Output

Input

Music Program

End-to-end Deadline

Intermediate Deadline

Audio Processing & Synthesis Engine

Channel

F

Most of the engine’s

functionality

FilterParallel version of a partition-based convolution algorithm

Audio Input

Additional Cells

A real-time application in Tessellation

2nd-level Scheduling

Cell

Tessellation Kernel(Partition Support)

(*) Bottom part of the diagram was adapted from Liu and Asanovic, “Mitosys: ParLab Manycore OS Architecture,” Jan. 2008.

1.A) Cell and Space Partitioning

A Spatial Partition (or Cell) comprises a group of processors acting within a hardware boundary

Each cell receives a vector of basic resources– Some number of processors, a portion of

physical memory, a portion of shared cache memory, and potentially a fraction of memory bandwidth

A cell may also receive – Exclusive access to other resources

(e.g., certain hardware devices and raw storage partition)

– Guaranteed fractional services (i.e., QoS guarantees) from other partitions (e.g., network service and file service)

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

DRAM & I/O InterconnectDRAM & I/O Interconnect

L1 InterconnectL1 Interconnect

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

CCPPUULL11

L2L2BaBanknk

DRDRAMAM

(+) Fraction of memory bandwidth

+

Time-sensitive Network

Subsystem

Time-sensitive Network

Subsystem

Network Service

(Net Partition)

Network Service

(Net Partition)

Input device(Pinned/TT Partition)

Input device(Pinned/TT Partition)

Graphical Interface

(GUI Partition)

Graphical Interface

(GUI Partition)

Audio-processing / Synthesis Engine(Pinned/TT partition)

Audio-processing / Synthesis Engine(Pinned/TT partition)

Output device(Pinned/TT Partition)

Output device(Pinned/TT Partition)

GUI SubsystemGUI Subsystem

Communication with other audio-processing nodes

Music program

Preliminary

Example of Music Application

+

A plea for more flexible GPU I/O

+

Thanks for your attention.

+Reserve Slides

+

Tessellation OSTessellation: 19

November 12t

h, 200

9

Tessellation in Server Environment

DiskDiskI/OI/O

DriversDrivers

OtherOtherDevicesDevices

NetworkNetworkQoSQoS

MonitorMonitorAndAnd

AdaptAdapt

Persistent Storage &Persistent Storage &Parallel File SystemParallel File System

Large Compute-BoundLarge Compute-BoundApplicationApplication

Large I/O-BoundLarge I/O-BoundApplicationApplication

DiskDiskI/OI/O

DriversDrivers




AdaptAdapt




DiskDiskI/OI/O

DriversDrivers




AdaptAdapt




DiskDiskI/OI/O

DriversDrivers




AdaptAdapt




QoS

QoS

Guarantees

Guarantees

Cloud Cloud StorageStorageBW QoSBW QoS

QoSQoS

Guarantees

Guarantees

QoSQoSGuaranteesGuarantees

QoS

Q

oS

G

uara

nte

es

Gu

ara

nte

es

+ advances in the parallelization of music and audio applications eric battenberg, david wessel...

Documents

hardware boundaryeach

resources partition

exclusive access

dedicated resources

portion of physical

portion of shared cache

group of processors

certain hardware devices