introduction to audio content analysis - tunisgn24006/pdf/l01-audio-analysis-intro.pdfintroduction...

5
Audio analysis 1 SGN-24006 Introduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri Contents: challenges, applications speech recognition music transcription auditory scene analysis sources of information: acoustic signal, internal models mid-level data representations decomposing polyphonic signals along various dimensions Audio analysis 2 SGN-24006 Understanding audio Human listeners are very skilled at making sense of complex audio signals On a busy city street: noticing cars passing, footsteps approaching, people discussing nearby, etc. In music: ability to focus on a certain instrument in a polyphony Speech recognition: recognizing speech despite of speaker- dependent variation, environmental noise etc. Audio analysis 3 SGN-24006 Computational analysis of audio Audio content analysis is hard for many reasons: Polyphony: natural sounds do not occur in isolation, but often several active sound sources combine into a complex mix Real-world signals (speech, music, environmental audio) are dirty : analysis is a less controlled situation than synthesis (think of 3D graphics vs. machine vision; or speech synthesis vs. speech recognition) Perceptual attributes of sounds (pitch, timbre, loudness, etc.) have a non-trivial connection to acoustic properties. Cognition is ultimately AI-complete: humans interprete sounds utilizing the context and a high-level model of the world ( artificial intelligence) Audio analysis 4 SGN-24006 Audio analysis is of current interest Computational power allows more complex problems and bigger amounts of data to be processed Amount of digital information increases rapidly futile information without efficient management tools should process by content, not just as a pile of bits Computers lack perception communication between humans and a computer is unnatural and inflexible (figure shows how dead-blind communicate) let s bring computers to the real world, not humans to the computers world

Upload: others

Post on 15-Mar-2020

57 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Audio Content Analysis - TUNIsgn24006/PDF/L01-audio-analysis-intro.pdfIntroduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri

Audio analysis 1SGN-24006

Introduction to Audio Content Analysis*Slides for this lecture were created by Anssi Klapuri

Contents:challenges, applicationsspeech recognitionmusic transcriptionauditory scene analysissources of information: acoustic signal, internal modelsmid-level data representationsdecomposing polyphonic signals along various dimensions

Audio analysis 2SGN-24006Understanding audio

Human listeners are very skilled at making sense ofcomplex audio signals

On a busy city street: noticing cars passing, footstepsapproaching, people discussing nearby, etc.In music: ability to focus on a certain instrument in a polyphonySpeech recognition: recognizing speech despite of speaker-dependent variation, environmental noise etc.

Audio analysis 3SGN-24006Computational analysis of audio

Audio content analysis is hard for many reasons:Polyphony: natural sounds do not occur in isolation, but oftenseveral active sound sources combine into a complex mixReal-world signals (speech, music, environmental audio) aredirty : analysis is a less controlled situation than synthesis (think

of 3D graphics vs. machine vision; or speech synthesis vs. speechrecognition)Perceptual attributes of sounds (pitch, timbre, loudness, etc.) havea non-trivial connection to acoustic properties.Cognition is ultimately AI-complete: humans interprete soundsutilizing the context and a high-level model of the world (artificial intelligence)

Audio analysis 4SGN-24006Audio analysis is of current interest

Computational power allows more complex problems andbigger amounts of data to be processedAmount of digital information increases rapidly

futile information without efficient management toolsshould process by content, not just as a pile of bits

Computers lack perceptioncommunication between humans anda computer is unnatural and inflexible(figure shows how dead-blind communicate)let s bring computers to the real world,not humans to the computers world

Page 2: Introduction to Audio Content Analysis - TUNIsgn24006/PDF/L01-audio-analysis-intro.pdfIntroduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri

Audio analysis 5SGN-24006Applications

Speech recognition: Siri, Google voice searchMusic search: Shazam, SoundHoundMusic transcription: GuitarBots game, Wavetick lighting controlAudio classification: Movie soundtrack segmentationAuditory scene analysis: Mobile devices that adapt to situationalcontext, hearing aids (still in the making)

Audio analysis 6SGN-24006Automatic speech recognition

Ultimate goal: to accurately convert an acoustic signal intoa word sequence, independent of speaker andenvironmentTo improve accuracy, often assumptions are made

Speaker-dependent (or adapted) vs. speaker-independentClean-speech vs. environmentally-robust recognitionIsolated-word vs. continuous-speech recognition

ApplicationsSpeech interfacesSpoken documentretrievalSpeech-to-speechtranslation

He knew what taboos he was violating

Audio analysis 7SGN-24006

7 / klapMusictranscription

Excerpt from Song 34in RWC popularmusic database

Figures top-down:1. time-domain signal2. spectrogram3. musical notation4. piano roll

Applications includemusic retrievalintelligent processing

music tutors, gamesauto-accompanimentfor a soloist, etc.

Audio analysis 8SGN-24006Music transcription: subtopics

Beat tracking and meter analysis (beat/tactus = foot-tapping rate)

Multi-pitch estimation (potentially several concurrent pitches)

Drum and percussion transcription (+ instrument recognition)

Page 3: Introduction to Audio Content Analysis - TUNIsgn24006/PDF/L01-audio-analysis-intro.pdfIntroduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri

Audio analysis 9SGN-24006Auditory scene analysis

Analysis of sounds from our living environmentRecognition of the context (home, street, restaurant, shop,train, office,...) and detection of individual sound sourcesand eventsApplications

context-awaremobile deviceshearing aidsmovie soundtracksegmentation andanalysis (footballgoal detection etc.)

Audio analysis 10SGN-24006Auditory scene analysis vs.

audio classification

Auditory scene analysis (of a car crash):very hard in the general case

Audio segmentation and classification (movie soundtrack):more straightforward

Audio analysis 11SGN-24006Sources of information:

Acoustic signal and internal modelsInternal models are crucial for robust analysis

Speech recognition systems depend on language models (e.g.probabilities of different word sequences)Musicological models are equally important for music transcriptionAuditory scene analysis employs event probabilities and sequentialdependencies in different contexts

Internal models can be learned from training material, andfurther adapted

Acousticsignal

Internal models

AnalysisResult

Audio analysis 12SGN-24006Mid-level data representations

Signal analysis can be viewed as a sequence of representations fromlow (audio signal) to high (recognition result)

Intermediate (mid-level) representations are necessary since the high-level information is usually not directly visible in the input audiosignalFigure below shows some example mid-level representations:1) spectrogram, 2) sinusoid tracks, 3) critical-band energiesMore about these on the coming lectures

Page 4: Introduction to Audio Content Analysis - TUNIsgn24006/PDF/L01-audio-analysis-intro.pdfIntroduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri

Audio analysis 13SGN-24006Breaking up polyphonic audio signals

There are various dimensions along which an audio signal can bedecomposed

Time (temporal segmenting)Frequency (filtering)Space (angle of arrival)Sinusoids vs. noiseSound source separation (various approaches)

The various dimensions allow extracting layers of sound to someextent

Fundamentals

Intermediate difficulty, but straightforward

Ultimate goal, very difficult

Audio analysis 14SGN-24006

Audio analysis 15SGN-24006Spatial information (angle of arrival)

Important for human auditory scene analysis (natural environments)Usability of spatial information for music analysis depends on genre

Left Right Left Right

Audio analysis 16SGN-24006

Perceptually and musically, it is meaningful to make adistinction between tonal and noisy spectral elements

noise masks a tone more efficiently than vice versa, for exampleIn a musicspectrogram,horizontal (tonal)and vertical(noisy, percussive)structures areoften evident

Tonal vs. noise-like spectral components

Brentwood jazz quartet

Page 5: Introduction to Audio Content Analysis - TUNIsgn24006/PDF/L01-audio-analysis-intro.pdfIntroduction to Audio Content Analysis *Slides for this lecture were created by Anssi Klapuri

Audio analysis 17SGN-24006More demos

http://arg.cs.tut.fi/demos