computer music analysis - university of reginagerhard/publications/sfu97.pdfcomputer music analysis...

Computer Music Analysis ∗

David GerhardSchool of Computing Science

Simon Fraser UniversityBurnaby, BC

V5A 1S6email: [email protected]

Simon Fraser University, School of Computing ScienceTechnical Report CMPT TR 97-13

Keywords: Automatic Music Transcription, Musical Grammars,Computer Music, Psychoacoustics.

July 11, 2002

Abstract

Computer music analysis is investigated, withspecific reference to the current research fieldsof automatic music transcription, human mu-sic perception, pitch determination, note andstream segmentation, score generation, time-frequency analysis techniques, and musicalgrammars. Human music perception is in-vestigated from two perspectives: the com-putational model perspective desires an algo-rithm that perceives the same things that hu-mans do, regardless of how the program ac-complishes this, and the physiological modelperspective desires an algorithm that modelsexactly how humans perceive what they per-ceive.

∗This research is partially supported by The Nat-ural Sciences and Engineering Research Council ofCanada and by a grant from The BC Advanced Sys-tems Institute.

1 Introduction

A great deal of work in computer music dealswith synthesis—using computers to composeand perform music. An interesting exam-ple of this type of work is Russell Ovens’thesis, titled “An Object-Oriented ConstraintSatisfaction System Applied to Music Com-position” [Oven88]. In contrast, this reportdeals with analysis—using computers to ana-lyze performed and recorded music. The in-terest in this field comes from the realizationthat decoding a musical representation intosound is fairly straightforward, but translat-ing an arbitrary sound into a score is a muchmore difficult task. This is the problem of Au-tomatic Music Transcription (AMT) that hasbeen looked at since the late 1970’s. AMTconsists of translating an unknown and arbi-trary audio signal into a fully notated pieceof musical score. A subset of this problem ismonophonic music transcription, where a sin-gle melody line played on a single instrument

1

in controlled conditions is translated into asingle note sequence, often stored in a MIDItrack1. Monophonic music transcription wassolved in 1986 with the publication of MartinPiszczalski’s Ph.D. thesis, which will be dis-cussed in Section 3.0.1 on page 6.

AMT is a subset of Music Perception, afield of research attempting to model the wayhumans hear music. Psychological studieshave shown that many of the processing ele-ments in the human auditory perceptual sys-tem do the same thing as processing elementsin the human visual perceptual system, andresearchers have recently begun to draw frompast research in computer vision to developcomputer audition. Not every processing ele-ment is transferable, however, and the differ-ences between vision and audition are clear.There are millions of each of the four types ofvision sensors, L, M, and S cones and rods,while there are only two auditory sensors, theleft and right ears. Some work has been doneon using visual processing techniques to aid inaudition (see Section 2.1 on page 4 and Sec-tion 4.2.2 on page 13), but there is more to begained from cautiously observing the connec-tions between these two fields.

Many research areas relate to computer mu-sic and AMT. Most arise from breaking AMTdown into more manageable sub-problems,while some sprouted from other topics andstudies. Six areas of current research work willbe presented in this report:

• Transcription

– Music perception

– Pitch determination

– Segmentation

– Score generation

• Related Topics

– Time-frequency analysis1MIDI will be discussed in Section 6.1 on page 15.

– Musical grammars

Part I

Transcription

The ultimate goal of much of the computermusic analysis research has been the develop-ment of a system which would take an audiofile as input and produce a full score as out-put. This is a task that a well-trained humancan perform, but not in real time. The person,unless extremely gifted, requires access to thesame audio file many times, paying attentionto a different instrument each time. A mono-phonic melody line would perhaps require asingle pass, while a full symphony might notbe fully transcribable even with repeated au-ditions. Work presented in [Tang95] suggeststhat if two instruments of similar timbres play“parallel” musical passages at the same time,these instruments will be inseperable. An ex-ample of this is the string section of an orches-tra, which is often heard as a single melody linewhen all the instruments are playing together.

Some researchers have decided to work ona complementary problem, that of extractingerrors from performed music [Sche95]. Insteadof listening to the music and writing the score,the computer listens to the music, follows thescore, and identifies the differences betweenwhat the ideal music should be and what theperformed music is. These differences could bedue to expression in the performance, such asvibrato, or to errors in the performance, suchas incorrect notes. A computer system thatcould do this would be analogous to a novicemusic student who knows how to read musicand can listen to a piece and follow along inthe music, but cannot yet transcribe a piece.Such a system gives us a stepping stone to thefull problem of transcription.

2

2 Music Perception

There have been two schools of thought con-cerning automatic music transcription, one re-volving around computational models and onerevolving around psychological models.

The psychological model researchers takethe “clear-box” approach, assuming that theultimate goal of music transcription researchis to figure out how humans hear, perceive andunderstand music. A working system is not asimportant as little bits of system that accu-rately (as close as we can tell) model the hu-man perceptual system. This attitude is valu-able because by modeling a working system –the human auditory system, we can develop abetter artificial system, and gain philosophicalinsight into how it is that we hear things.

In contrast, the computational model re-searchers take the “black-box” approach, as-suming that a music transcription system isacceptable as long as it transcribes music.They are more concerned with making a work-ing machine and less concerned with modelingthe human perceptual system. This attitude isvaluable because in making a working systemwe can then work backward and say “How isthis like or unlike the way we hear?” The prob-lem is that the use of self-evolving techniqueslike neural nets and genetic algorithms limitsour ability as humans to understand what thecomputer is doing.

These two fields of research rarely work to-gether and are more often at odds with eachother. Each does have valuable insights to gainfrom the other, and an interdisciplinary ap-proach, using results from both fields, wouldbe more likely to succeed.

Aside: Computational Models. A pointof dispute in interdisciplinary research has of-ten been the notion of a computational modelfor explaining human psychological phenom-ena. Computer scientists and engineers havebeen using computational models for manyyears to explain natural phenomena, but whenwe start going inside the mind, it hits a little

closer to home.When a scientific theory tries to explain

some natural phenomenon, the goal is to beable to predict that phenomenon in the fu-ture. If we can set up a model that will pre-dict the way the world works, and we use acomputer algorithm to do this prediction, wehave a computational model. The argument isthat these algorithms do not do the same thingthat is happening in the world, even thoughthey predict what is happening in the world.Kinematics models do not take into accountquantum effects in their calculations, and sincethey do not model the way the world reallyworks, they are not valuable, some would say.These models do explain and predict motionaccurately within a given domain and whetheror not they model the complete nature of theuniverse, they do explain and predict naturalphenomena.

This is less easy to accept when dealing withour own minds. We want to know the under-lying processes that are going on in the brain,so how useful is a theory, even if it is good atpredicting the way we work, if we don’t knowhow close it is to our underlying processes?Wouldn’t it be better to develop a theory ofthe mind from the opposite viewpoint and saythat a model is valid if it simulates the way wework from the inside – if it concurs with howthe majority of people react to a certain stim-ulus? This is very difficult because psycholog-ical testing cannot isolate individual processesin the brain, it can only observe input and out-put of the brain as a whole, and that is a sys-tem we cannot hope to model at present.

The other problem with this is one of prag-matics. Ocham’s razor says that when theo-ries explain something equally well, the sim-pler theory is more likely to be correct. A sim-pler theory is more likely to be easier to pro-gram as well, and so simpler theories tend tocome out of computational models. Of course,we must be careful that the theories we com-pare do in fact explain the phenomena equallywell, and we must be aware that the simplest

3

computational model we get is not necessarilythe same processing that goes on in the mind.

This brings another advantage of computa-tional models, their testability. Psychologicalmodels can be tested, but it is much more dif-ficult to control such experiments, because allsystems in the brain are working together atthe same time. In computational models, onlythe processing we are interested in is present,and that bit of processing can be tested withcompletely repeatable results.

2.1 Auditory Scene Analysis

Albert Bregman’s landmark book in 1990[Breg90] presented a new perspective in hu-man music perception. Until then, much workhad been done in the organization of humanvisual perception, but little had been doneon the auditory side of things, and what lit-tle there was concentrated on general conceptslike loudness and pitch. Bregman realized thatthere must be processes going on in our brainsthat determine how we hear sounds, how wedifferentiate between sounds, and how we usesound to build a “picture” of the world aroundus. The term he used for this picture is “theauditory scene”.

The classic problem in auditory scene anal-ysis is the “cocktail party” situation, whereyou are in a room with many conversations go-ing on, some louder than the one you are en-gaged in, and there is background noise suchas music, clanking glasses, and pouring drinks.Amid all this cacophony, humans can readilyfilter out what is unimportant and pay atten-tion to the conversation at hand. Humans cantrack a single auditory stream, such as a per-son speaking, through frequency changes andamplitude changes. The noise around may bemuch louder than your conversation, and stillyou have little trouble understanding what theother person is saying. For a recent attemptto solve this problem, see [GrBl95].

An analogy that shows just how much pro-cessing is done in the auditory system is

@

JJ@

�

�

Figure 1: The Face-Vase Illusion.

the lake analogy. Imagine digging two shorttrenches up from the shore of a lake, and thenstretching handkerchiefs across the trenches.The human auditory system is then like deter-mining how many boats are on the lake, whatkind of engines are running in them, which di-rection they are going, which one is closer, ifany large objects have been recently thrownin the lake, and almost anything else, merelyfrom observing the motion of the handker-chiefs. When we bring the problem out to ourconscious awareness, it seems impossible, andyet we do this all the time every day withoutthinking about it.

Bregman shows that there are many phe-nomena going on in the processing of audi-tory signals that are similar to those in visualperception. Exclusive allocation indicates thatproperties belong to only one event. When itis not clear which event that property appliesto, the system breaks down and illusions areperceived. The most common visual exam-ple of this is the famous “face-vase” illusion,Figure 1, where background and foregroundare ambiguous, and it is not clear whether theboundary belongs to the vase or the two faces.This phenomenon occurs in audition as well.In certain circumstances, musical notes can beambiguous. Depending on what follows a sus-pended chord, the chord can be perceived asboth major and minor, until the ambiguity isremoved by resolving the chord.

Apparent motion occurs in audition as itdoes in vision. When a series of lights areflashed on and off in a particular sequence,it seems like there is a single light traveling

4

along the line. If the lights are flashed too slowor they are too far apart, the illusion breaksdown, and the individual lights are seen turn-ing on and off. In audition, a similar kind ofstreaming occurs, in two dimensions. If a se-ries of notes are of a similar frequency, theywill tend to stream together, even if there arenotes of dissimilar frequencies interspersed.A sequence that goes “Low-High-Low-High...”will be perceived as two streams, one high andone low, in two circumstances. If the tempois fast enough, the notes seem closer togetherand clump into streams. Also, if the differencebetween the “Low” and the “High” frequen-cies is large enough, the streaming will also oc-cur. If the tempo is slow and the frequenciesdo not differ by much, however, the sequencewill be perceived as one stream going up anddown in rapid succession.

There are more examples of the link be-tween vision and audition in Bregman’s book,as well as further suggestions for a model ofhuman auditory perception. Most of the bookexplains experiments and results that reinforcehis theories.

2.2 Axiomatization of MusicPerception

In 1995, Andranick Tanguiane presented theprogress of a model of human music perceptionthat he had been working on for many years[Tang95]. The model attempts to explain hu-man music perception in terms of a small setof well-defined axioms. The axioms that Tan-guiane presents are:

Axiom 1 (Logarithmic Pitch) The frequencyaxis is logarithmically scaled.

Axiom 2 (Insensitivity to the phase of thesignal) Only discrete power spectra areconsidered.

Axiom 3 (Grouping Principle) Data can begrouped with respect to structural iden-tity.

Axiom 4 (Simplicity Principle) Data are rep-resented in the least complex way inthe sense of Kolmogorov (least memorystored).

Axiom 1 stems from the well-recognized factthat a note with a frequency twice that ofsome reference note is an octave higher thanthe reference note. An octave above “A” at440 Hz would be “A” at 880 Hz, and then “A”at 1760 Hz and so on. This is a logarithmicscale, and has been observed and documentedexhaustively.

Axiom 2 is a bit harder to recognize. Twodifferent time signals will produce the same au-ditory response if the power frequency spectraare the same. The phase spectrum of a signalindicates where in the cycle each individual si-nusoid starts. As long as the sinusoidal com-ponents are the same, the audio stimulus willsound the same whether the individual com-ponents are in phase or not.

The third axiom is an attempt to describethe fact that humans group audio events inthe same gestalt manner as other perceptualconstructs. Researchers in musical grammarshave identified this, and there is more materialon this phenomenon in Section 2.1 on page 4.

The last axiom suggests that while we mayhear very complicated rhythms and harmoniesin a musical passage, this is done in men-tal processing, and the mental representationused is that which uses the least memory.

Tanguiane argues that all of these axiomsare necessary because without any one ofthem, it would be impossible for humans torecognize chords as collections tones. The factthat we are able to perceive complex tones asindividual acoustical units, claims Tanguiane,argues for the completeness of the axiom set.This perception is only possible when the mu-sical streams are not parallel.

5

2.3 Discussion

Research has approached the problem of mu-sic perception and transcription from two dif-ferent directions. In order to fully under-stand music enough to develop a computa-tional model, we must recognize the psycho-logical processing that is going on. Early workin automatic music transcription was not con-cerned with the psychological phenomena, butwith getting some sort of information from onedomain to another. This was a valuable placeto start, but in its present state, the study ofmusic perception and transcription has muchto gain from the study of the psychology of au-dition.

3 Beginnings

In the mid 70’s, a few brave researchers triedto tackle the whole transcription problem byinsisting on a limited domain. Piszczalskiand Galler presented a system that tried totranscribe musical sounds to a musical score[PiGa77]. The intent was to take the presentedsystem and develop it toward a full transcrip-tion system. In order to make the system func-tional, they required the input audio to bemonophonic from a flute or a recorder, pro-ducing frequency spectra which are easily an-alyzed.

3.0.1 Piszczalski and Galler

At this early point in the research, Piszczal-ski and Galler recognized the importance ofbreaking the problem down into stages. Theydivided their system into three components,working bottom-up. First, a signal process-ing component identifies the amplitudes andstarting and stopping times of the componentfrequencies. The second stage takes this infor-mation and formulates note hypotheses for theidentified intervals. Finally, this informationis analyzed in the context of notation to iden-tify beam groups, measures, etc., and a score

is produced.A windowed Fourier transform is used to

extract frequency information. Three stagesof heuristics are used to identify note startsand stops. The system first recognizes funda-mental frequencies below or above the rangeof hearing as areas of silence. More difficultboundaries are then identified by fluctuationsin the lower harmonics and abrupt changes infundamental frequency amplitude. The sec-ond phase averages the frequencies within eachperceived “note” to determine the pitch whichis then compared to a base frequency. In the fi-nal phase, a new base frequency is determinedfrom fluctuations in the notes, and grouping,beaming and other notational tasks are per-formed.

Since this early paper, Piszczalski has pro-posed a computational model of music tran-scription in his 1986 Ph.D. Thesis [Pisz86].The thesis describes the progress and comple-tion of the system started in 1977. There isalso a thorough history of AMT up to 1986.This thesis is a good place to start, for thosejust getting in to computer music research. Itis now more than 10 years old and much hasbeen accomplished since then in some areas,but monophonic music transcription remainstoday more or less where Piszczalski left it in1986.

3.0.2 Moorer

In 1977, James Moorer presented a paper[Moor77] on the work he accomplished sincethen, which would turn out to be the firstattempt at polyphonic music transcription.Piszczalski’s work required the sound input tobe a single instrument and a single melodyline, while Moorer’s new system allowed fortwo instruments playing together. Restric-tions on the input to this system were tighterthan Piszczalski’s work: there could only betwo instruments, they must both be of a typesuch that the pitches are voiced (no percussiveinstruments) and piecewise constant (no vi-

6

brato or glissando) and the notes being playedtogether must be such that the fundamentalfrequencies and harmonics of the notes do notoverlap. This means that no note pairs are al-lowed where the fundamental frequencies aresmall whole number ratios of each other.

Moorer claims that these restrictions, apartfrom the last one, are merely for convenienceand can easily be removed in a larger sys-tem. He recognized that the fundamental fre-quency restriction is difficult to overcome, asmost common musical intervals are on wholenumber frequency ratios. The other restric-tions he identified have been shown to be moredifficult to overcome than he first thought.Research into percussive musical transcrip-tion has shown that it is sufficiently differ-ent from voiced transcription to merit inde-pendent work, and glissando and vibrato haveproved to be difficult problems. Allowingmore than two instruments in Moorer’s sys-tem would require a very deep restriction onthe frequencies.

These restrictions cannot be lifted withouta redesign of the system, because notes on theoctave, or even a major third apart have fun-damental frequencies that are whole numbermultiples of each other, and thus cannot beseparated by his method.

Moorer uses an autocorrelation function in-stead of a Fourier transform to determine theperiod of the signal, which is used to determinethe pitch ratio of the two notes being played2.The signal is then separated into noise seg-ments and harmonic segments, by assigninga quality measure to each part of the signal.In the harmonic portions, note hypotheses areformed based on fundamental frequencies andtheir integer multiples. Once hypotheses ofthe notes are confirmed, the notes are groupedinto two melody lines (which do not cross) byfirst finding areas where the two notes over-lap completely, and then filling in the gaps by

2Pitch and Period are not necessarily interchange-able. For a discussion, see Section 4.1.2 on page 10.

heuristics and trial and error. Moorer thenuses an off-the-shelf program to create a scoreout of the two melody lines.

3.1 Breakdown

The early attempts at automatic music tran-scription have shown that the problem mustbe limited and restricted to be solved, and thepartial solutions cannot easily be expanded toa full transcription system. Many researchershave chosen to break the problem down intomore manageable sub-problems. Each re-searcher has her own ideas as to how the prob-lem should be decomposed, and three promi-nent ones are presented here.

3.1.1 Piszczalski and Galler

In [PiGa77], discussed above, Piszczalski andGaller propose three components, breaking theprocess down temporally as well as computa-tionally.

1. (low-level) Determine the fundamentalfrequency of the signal at every point, aswell as where the frequencies start andstop.

2. (intermediate-level) Infer musical notesfrom frequencies determined in stage 1.

3. (high-level) Add notational informationsuch as key signature and time signature,as well as bar lines and accidentals.

Piszczalski’s 1986 thesis proposes a largerand more specific breakdown, in terms of theintermediate representations. In this break-down, there are eight representations, suggest-ing seven processing stages. The proposeddata representations are, from lowest level tohighest level:

• Time waveform: the original continuousseries of analog air pressures representingthe sound.

7

• Sampled signal: a series of discrete volt-ages representing the time waveform atevery sample time.

• Digital spectrogram: the sinusoid spec-trum of the signal at each time window.

• Partials: the estimation of the frequencyposition of each peak in the digital spec-trogram.

• Pitch candidates: possibilities for thepitch of each time frame, derived from thepartials.

• Pitch and amplitude contours: a descrip-tion of how the pitch and amplitude varywith time.

• Average pitch and note duration: discreteacoustic events which are not yet assigneda particular chromatic note value.

• Note sequence: the final representation.

The first four representations fit into stage1 of the above breakup, The next three fit intostage 2, and the last one fits into stage 3. In hissystem, Piszczalski uses pre-programmed no-tation software to take the note sequence andcreate a graphical score. The problem of infer-ring high-level characteristics, such as the keysignature, from the note sequence has been re-searched and will be covered in Section 6.2 onpage 16.

3.1.2 Moorer

James Moorer’s proposed breakdown is similarto that of Piszczalski and Galler, in that it sep-arates frequency determination (pitch detec-tion) from note determination, but it assignsa separate processing segment to the identi-fication of note boundaries, which Piszczal-ski and Galler group together with pitch de-termination. Moorer uses a pre-programmedscore generation tool to do the final nota-tion. Indeed, this is a part of automatic mu-sic transcription which has been, for the most

part, solved since the time of Moorer and hiscolleagues, and there exist many commercialsoftware programs today that will translate aMIDI signal (essentially a note sequence) intoa score. For more on MIDI see Section 6.1 onpage 15.

3.1.3 Tanguiane

In 1988, Andranick Tanguiane published a pa-per on recognition of chords[Tang88]. Be-fore this, chord recognition had been ap-proached from the perspective of polyphonicmusic—break down the chord into its compo-nent notes. Tanguiane didn’t believe that hu-mans actually did this dissection for individualchords, and so his work on music recognitionconcentrated on the subproblems that wereparts of music. He did work on chord recogni-tion, separate from rhythm recognition, sepa-rate again from melody recognition. The divi-sion between the rhythm component and thetone component has been identified in laterwork on musical grammars, and will be dis-cussed in Section 8 on page 18.

3.2 An Adopted Breakdown

For the purposes of this report, AMT will bebroken down into the following sub-categories:

1. Pitch determination: identification of thepitch of the note or notes in a piece of mu-sic. Work has been done on instantaneouspitch detection as well as pitch tracking.

2. Segmentation: breaking the music intoparts. This includes identification ofnote boundaries, separation of chords intonotes, and dividing the musical informa-tion into rhythm and tone information.

3. Score generation: taking the segmentedinformation and producing a score. De-pending on how much processing wasdone in the segmentation section, this

8

could be as simple as taking fully de-fined note combinations and sequencesand printing out the corresponding score.It could also include identification of keyand time signature.

4 Pitch Determination

Pitch Determination has been called Pitch Ex-traction and Fundamental Frequency Identifi-cation, among a variety of other titles, howeverpitch and frequency are not exactly the samething, as will be discussed later. The task isthis: given an audio signal, what is the musicalpitch associated with the signal at any giventime. This problem has been applied to speechrecognition as well, since some languages suchas Chinese rely on pitch as well as phonemesto convey information. Indeed, spoken Englishrelies somewhat on pitch to convey emotionalor insinuated information. A sentence whosepitch increases at the end is interpreted as aquestion.

In monophonic music, the note being playedhas a pitch, and that pitch is related to thefundamental frequency of the quasi-periodicsignal that is the musical tone. In polyphonicmusic, there are many pitches acting at once,and so a pitch detector may identify one ofthose pitches or a pitch that represents thecombination of tones but is not present in anyof them separately. While pitch is indispens-able information for transcription, more fea-tures must be considered when polyphonic mu-sic is being transcribed.

Pitch following and spectrographic analysisdeal with the continuous time-varying pitchacross time. As with instantaneous pitch de-termination, many varied algorithms exist forpitch tracking. Some of these are modifiedimage processing algorithms, since a time-varying spectrum has three dimensions (fre-quency, time and amplitude) and thus can beconsidered an image, with time correspondingto width, frequency corresponding to height,

and amplitude corresponding to pixel value.Pitch determination techniques have been

understood for many years, and while im-provements to the common algorithms havebeen made, few new techniques have beenidentified.

4.1 Instantaneous Pitch Tech-niques

Detecting the pitch of a signal is not as easy asdetecting the period of oscillation. Dependingon the instrument, the fundamental frequencymay not be the pitch, or the lowest frequencycomponent may not have the highest ampli-tude.

4.1.1 Period Detectors

Natural music signals are pseudo-periodic,and can be modeled by a strictly periodicsignal time-warped by an invertible func-tion[ChSM93]. They repeat, but each cycle isnot exactly the same as the previous, and thecycles tend to change in a smooth way overtime. It is still meaningful to discuss the pe-riod of such a signal, because while each cycleis not the exact duplicate of the previous, theydiffer only by a small amount (within a musi-cal note) and the distance from one peak to thenext can be considered one cycle. The periodof a pseudo-periodic signal is how often thesignal “repeats” itself in a given time window.

Period detectors seek to estimate exactlyhow fast a pseudo-periodic signal is repeatingitself. The period of the signal is then usedto estimate the pitch, through more complextechniques described below.

Fourier Analysis. The “old standard”when discussing the frequency of a signal. Asignal is decomposed into component sinu-soids, each of a particular frequency and am-plitude. If enough sinusoids are used, the sig-nal can be reconstructed within a given er-ror limit. The problem is that the discreteFourier transform centers sinusoids around a

9

given base frequency. The exact period of thesignal must then be inferred by examining theFourier components. The common algorithmthat is used to calculate the fourier spectrumis called the fast fourier transform (FFT).

Chirp Z Transform. A method presentedin [Pisz86], it uses a charge-coupled device(CCT) as an input. The output is a frequencyspectrum of the incoming signal, and in theoryshould be identical to the output of a Fouriertransform on the same signal. The methodis extremely fast when implemented in hard-ware, but performs much slower than the FFTwhen simulated in software. The CCD acts asa delay line, creating a variable filter. The fil-ter coefficients in the delay line can be set upto produce a spectrum, or used as a more gen-eral filter bank.

Cepstrum Analysis. This technique usesthe Fourier transform described above, withanother layer of processing. The log magni-tude of the Fourier coefficients is taken, andthen inverse Fourier-transformed. The resultis a large peak at the frequency of the originalsignal, in theory. This technique sometimesneeds tweaking as well.

Filter Banks. Similar to Fourier analysis,this technique uses small bandpass filters todetermine how much of each frequency band isin the signal. By varying the center frequencyof the filters, one can accurately determine thefrequency that passes the largest componentand therefore the period of the signal. Thisis the most psychologically faithful model, be-cause the inner ear acts as a bank of filters,providing output to the brain through a num-ber of orthogonal frequency channels.

Autocorrelation. A communication sys-tems technique, this consists of seeing howsimilar a signal is to itself at each point. Theprocess can be visualized as follows: take acopy of the signal and hold it up to the orig-inal. Move it along, and at each point makea measurement of how similar the signals are.There will be a spike at “0”, meaning that thesignals are exactly the same when there is no

time difference, but after a little movement,there should be another spike where one cyclelines up with the previous cycle. The periodcan be determined by the location of the firstspike from 0.

The problem with most of these techniquesis that they assume a base frequency, and allhigher components are multiples of the first.Thus, if the frequency of the signal does notlie exactly on the frequency of one of the com-ponents for example on one of the frequencychannels in a bank of filters, then the result isa mere approximation, and not an exact valuefor the period of the signal.

4.1.2 Pitch from Period

A common assumption is that the pitch of asignal is directly confessed by its period. Forsimple signals such as sinusoids, this is correctin that the tone we hear is directly related tohow fast the sinusoid cycles. In natural mu-sic, however, many factors influence the pe-riod of the signal apart from the actual pitchof the tone within the signal. Such factors in-clude the instrument being played, reverbera-tion, and background noise. The difference be-tween period and pitch is this: a periodic sig-nal at 440 Hz has a pitch of “A”, but a periodof about 0.00227 seconds.

A technique proposed in [Pisz86] to extractthe pitch from the period consists of formu-lating hypotheses and then scoring them andselecting the highest score as the fundamentalfrequency of the note. Candidates are selectedby comparing pairs of frequency componentsto see if they represent a small whole numberratio with respect to other frequency compo-nents. All pairs of partials are processed inthis way, and the result is a measure of pitchstrength versus fundamental frequency.

4.1.3 Recent Research

Xavier Rodet has been doing work with mu-sic transcription and speech recognition since

10

before 1987. His fundamental frequency esti-mation work has been done with Boris Doval,and they have used techniques such as HiddenMarkov Models and Neural Nets. They haveworked with frequency tracking as well as es-timation.

In [DoRo93], Doval and Rodet propose asystem for the estimation of the fundamentalfrequency of a signal, based on a probabilis-tic model of pseudo-periodic signals previouslyproposed by them in [DoRo91]. They considerpitch and fundamental frequency to be two dif-ferent entities which sometimes hold the samevalue.

The problem they address is the misnamingof fundamental frequency by computer when ahuman can easily identify it. There are caseswhere a human observer has difficulty identi-fying the fundamental frequency of a signal,and in such cases they do not expect the al-gorithm to perform well. The set of partialsthat the algorithm observes is estimated by aFourier transform, and consists of signal par-tials (making up the pseudo-periodic compo-nent) and noise partials (representing roomnoise or periodic signals not part of the mainsignal being observed).

Doval and Rodet’s probabilistic model con-sists of a number of random variables includ-ing a fundamental frequency, an amplitude en-velope, the presence or absence of specific har-monics, the probability density of specific par-tials, and the number and probability of otherpartials and noise partials. Partials are firstclassified as harmonic or not, and then a likeli-hood for each fundamental frequency is calcu-lated based on the classification of the corre-sponding partials. The fundamental frequencywith maximum likelihood is chosen as the fun-damental frequency for that time frame. Thispaper also presented work on frequency track-ing, see Section 4.2 on page 12.

In 1994 Quiros and Enrıquez published awork on loose harmonic matching for pitchestimation [QuEn94]. Their paper describesa pitch-to-MIDI converter which searches for

evenly spaced harmonics in the spectrum.While the system by itself works well, theauthors present a pre-processor and a post-processor to improve performance. The pre-processor minimizes noise and abhorrent fre-quencies, and the processor uses fuzzy neuralnets to determine the pitch from the funda-mental frequency.

The system uses a “center of gravity” typemeasurement to more accurately determinethe location of the spectral peaks. Since theoriginal signal is only pseudo-periodic, an es-timation of the spectrum of the signal is usedbased on a given candidate frequency. The er-ror between the spectrum estimation and thetrue spectrum will be minimal where the can-didate frequency is most likely to be correct.

Ray Meddis and Lowel O’Mard presented asystem for extracting pitch that attempts todo the same thing as the ears do [MeOM95].Their system observes the auditory input atmany frequency bands simultaneously, thesame way that the inner ear transforms thesound wave into frequency bands using a fil-ter bank. The information that is present ineach of these bands can then be compared, andthe pitch extracted. This is a useful methodbecause it allows auditory events to be seg-mented in terms of their pitch, using onsetcharacteristics. Two channels whose input be-gins at the same time are likely to be rec-ognizing the same source, and so informationfrom both channels can be used to identify thepitch.

4.1.4 Multi-Pitch Estimation forSpeech

Dan Chazan, Yoram Stettiner and DavidMalah presented a paper on multi-pitch esti-mation [ChSM93]. The goal of their work wasto segment a signal containing multiple speak-ers into individuals using the pitch of eachspeaker as a hook. They represent the signalusing a sum of quasiperiodic signals, with aseparate warping function for each quasiperi-

11

odic signal, or speaker.

It is unclear if this work can be extended tomusic recognition, because only the separationof the speakers was the goal. Octave errorswere not considered, and the actual pitch ofthe signal was secondary to the signal separa-tion. Work could be done to augment the sep-aration procedure with a more robust or moreaccurate pitch estimation algorithm. The ideaof a multi-pitch estimator is attractive to re-searchers in automatic music transcription, assuch a system would be able to track and mea-sure the overlapping pitches of polyphonic mu-sic.

4.1.5 Discussion

There is work currently being done on pitchdetection and frequency detection techniques,but most of this work is merely applying newnumerical or computational techniques to theoriginal algorithms. No really new ideas seempending, and the work being done now con-sists of increasing the speed of the existing al-gorithms.

If a technique that could accurately deter-mine the fundamental frequency without re-quiring an estimation from the period or thespectrum could be found, it would change thisfield of research considerably. As it stands, theprevalent techniques are estimators, and re-quire checking a number of candidates for themost likely frequency.

Frequency estimators and pitch detectorswork well only on monophonic music. Oncea signal has two or more instruments play-ing at once, determining the pitch from thefrequency becomes much more difficult, andmonophonic techniques such as spectrum peakdetection fail. Stronger techniques such asmulti-resolution analysis must be used here,and these topics will be discussed in Section 7on page 17.

4.2 Pitch Tracking

Determining the instantaneous frequency orpitch of a signal may be a more difficult prob-lem than needs to be solved. No time frame isindependent of its neighbors, and for pseudo-periodic signals within a single note, very lit-tle change occurs from one time frame to thenext. Tracking algorithms use the knowledgeacquired in the last frame to help estimate thepitch or frequency in the present frame.

4.2.1 From the Spectrogram

Most of the pitch tracking techniques that arein use or under development today stem frompitch determination techniques, and these usethe spectrogram as a basis. Individual timeframes are linked together and information ispassed from one to the next, creating a pitchcontour. Windowing techniques smooth thetransition from one frame to the next, and in-terpolation means that not every time frameneeds to be analyzed. Index frames maybe considered, and frames between these keyframes should be processed only if changes oc-cur between the key frames. These framesmust be close enough together not to miss anyrapid changes.

While improvements have been made on thisidea (see [DoNa94]), the basic premise remainsthe same. Use the frequency obtained in thelast frame as an initial approximation for thefrequency in the present frame.

In [DoRo93] presented earlier, a section onfundamental frequency tracking is presentedwhere the authors suggest the use of HiddenMarkov Models. Their justification is thattheir fundamental frequency model is prob-abilistic. A discrete-time continuous-stateHMM is used, with the optimal state sequencebeing found by the Viterbi algorithm. Intheir model, a state corresponds to an inter-val of the histogram. The conclusion that theycome to is that it is possible to use HMMs ona probabilistic model to track the frequency

12

across time frames. HMMs are also used in[DeGR93], where partials are tracked insteadof the fundamental frequency, and the ulti-mate goal is sound synthesis. A natural soundis analyzed using Fourier methods, and noiseis stripped. The partials are identified andtracked, and a synthetic sound is generated.This application is not directly related to mu-sic transcription, rather music compression,however the tracking of sound partials insteadof fundamental frequency could prove a usefultool.

4.2.2 From Image Processing

The time-varying spectrogram can be consid-ered an image, and thus image processing tech-niques can be applied. This analogy has itsroots in psychology, where the similarity be-tween visual and audio processing has been ob-served in human perception. This is discussedfurther in Section 2 on page 3.

In the spectrum of a single time frame, thepitch is represented as a spike or a peak formost of the algorithms mentioned above. Ifthe spectra from consecutive frames were linedup forming the third dimension of time, theresult would be a ridge representing the time-varying pitch. Edge following and ridge follow-ing techniques are common in image process-ing, and could be applied to the time-varyingspectra to track the pitch. The reader is re-ferred to [GoWo92] for a treatment of thesealgorithms in image processing. During thecourse of this research, no papers were foundindicating the application of these techniquesto pitch tracking. This may be a field worthyof exploration.

5 Segmentation

There are two types of segmentation in mu-sic transcription. A polyphonic music pieceis segmented into parallel pitch streams, andeach pitch stream is segmented into sequential

acoustic events, or notes. If there are five in-struments playing concurrently, then five dif-ferent notes should be identified for each timeframe. For convenience, we will refer to thenote-by-note segmentation in time simply assegmentation, and we will refer to instrumentmelody line segmentation as separation. Thus,we separate the polyphonic music into mono-phonic melody streams, and then we segmentthese melody streams into notes.

Separation is the difference between mono-phonic and polyphonic music transcription. Ifa reliable separation system existed, then onecould simply separate the polyphonic musicinto monophonic lines and use monophonictechniques. Research has been done on sourceseparation using microphone arrays, identify-ing the different sources by the delay betweenmicrophones, however it is possible to segmentpolyphonic sound even when all of the soundcomes from one source. This happens when wehear the flute part or the oboe part of a sym-phony stored on CD and played through a setof speakers. For this reason, microphone arraysystems will not be presented in this report,however arrays consisting of exactly two mi-crophones could be considered physiologicallycorrect, since the human system is binaural.

5.1 Piszczalski

The note segmentation section in Piszczalski’sthesis takes the pitch sequence generated bythe previous section as input. Several heuris-tics are used to determine note boundaries.The system begins with the boundaries easi-est to perceive, and if unresolved segments ex-ist, moves on to more computationally com-plex algorithms.

The first heuristic for note boundaries is si-lence. This is perceived by the machine as aperiod of time where the associated amplitudeof the pitch falls below a certain threshold. Si-lence indicates the beginning or ending of anote, depending on whether the pitch ampli-tude is falling into the silence or rising out of

13

the silence.The next heuristic is pitch change. If the

perceived pitch changes rapidly from one timeframe to the next, it is likely that there isa note boundary there. Piszczalski’s systemuses a logarithmic scale independent of abso-lute tuning, with a change of one half of achromatic step over 50 milliseconds indicatinga note boundary.

These two heuristics are assumed to identifythe majority of note boundaries. Other algo-rithms are put in place to prevent inaccurateboundary identifications. Octave pitch jumpsare subjected to further scrutiny because theyare often the result of instrument harmon-ics rather than note changes. Other scruti-nizing heuristics include the rejection of fre-quency glitches and amplitude crevices, wherethe pitch or the amplitude change sufficientlyto register a note boundary but then rapidlychange back to their original level.

The next step is to decide on the pitch andduration for each note. Time frames with thesame pitch are grouped together, and a re-gion growing algorithm is used to pick up anystray time frames containing pitch. Abhorrentpitches in these frames are associated with theproceeding note and the frequency is ignored.The pitch of the note is then determined byaveraging the pitch of all time frames in thenote, and the duration is determined by count-ing the number of time frames in the noteand finding the closest appropriate note dura-tion (half, quarter etc.). Piszczalski’s claim isthat the system generates less than ten per-cent false positives or false negatives.

5.2 Smith

In 1994, Leslie Smith presented a paper to theJournal of New Music Research, discussing hiswork on sound segmentation, inspired by phys-iological research and auditory scene analysis[Smit94]. The system is not confined to theseparation of musical notes, and uses onsetand offset filters, searching for the beginnings

and endings of sounds. The system works ona single audio stream, which corresponds tomonophonic music.

Smith’s system is based on a model of thehuman audio system and in particular, thecochlea. The implications are that while thesystem is more difficult to develop, the finalgoal is less a working system and more an un-derstanding of how the human system works.

The first stage of Smith’s system is to fil-ter the sound and acquire the spectra. Thisclosely models the human process. It is knownthat the cochlea is an organ that converts timewaveforms into frequency waveforms.

One might ask why not use a model of thehuman system as a first stage in any compu-tational pitch perception algorithm. The rea-son is that the cochlea uses 32 widely spacedfrequency channels [Smit94]. The processingnecessary to go from 32 channels to an exactpitch is very complicated, more so than thealgorithms that approximate a pitch from thehundreds of channels in a Fourier spectrum.Until we know how the brain interprets the in-formation in these channels, pitch extractionmight as well use the more information avail-able in modern spectrographic techniques.

The output of the 32 filter bands aresummed to give an approximation of the to-tal signal energy on the auditory nerve, andthis signal is used to do the onset/offset fil-tering. Theories have been stated that humanonset/offset perception is based on frequencyand amplitude, either excitatory or inhibitory.Smith’s simplification is to interpret all the fre-quencies at once, and even he considers thistoo simple. It is evident, however, that untilwe know how the brain interprets the differ-ent frequencies to produce a single onset/offsetsignal, this simplification is acceptable, and in-structive.

The onset/offset filters themselves aredrawn from image processing, and use a con-volution function across the incoming signal.This requires some memory of the signal, butpsychological studies have shown that human

14

audio perception does rely on memory morethan the instantaneous value of the pressureon the eardrum. The beginning of a soundis identified when the output of this convolu-tion rises above a certain threshold, but theend of the sound is more difficult to judge.Sounds that end sharply are easy, but as asound drifts off, the boundary is less obvi-ous. Smith suggests placing the end of thesound at the next appropriate sound begin-ning, However this disagrees with Bregman’stheory that a boundary can correspond to ex-actly one sound, and is ambiguous if appliedto more than one sound.

Smith’s work is intended to model the hu-man perceptual system and to be useful onany sound. He mentions music often, becauseit is a sound that is commonly separated intoevents (notes) but the work is not directlyapplicable to monophonic music segmentationyet. Further study on the frequency depen-dent nature of the onset/offset filters of thehuman could lead to much more accurate seg-mentation procedures, as well as a deeper un-derstanding of our own perceptual systems.

5.3 Neural Oscillators

A number of papers have recently beenpresented using neural nets for segmenta-tion, specifically [Wang95], [NGIO95], and[BrCo95], as well as others in that volume.The neural net model commonly used for thistask is the neural oscillator. The hypothesisis that neural oscillators are one of the struc-tures in the brain that help us to pay atten-tion to only one stream of audition, when thereis much auditory noise going on. The modelis built from single oscillators, consisting of afeedback loop between an excitatory neuronand an inhibitory neuron. The oscillator out-put quickly alternates between high values andlow values.

The inputs to the oscillator network are thefrequency channels that are employed withinthe ear. Delay is introduced within the lines

connecting the inputs to the oscillator net-work, and throughout the network itself.

When an example stream of tones “High-Low-High-Low. . . ” is presented to the net-work, the high tones trigger one set of fre-quency channels, and the low tones trigger an-other set of channels. If the tones are tem-porally close enough together, the oscillatorsdo not have time to relax back to the orig-inal state from the previous high input andare triggered again, thus following the audi-tory stream. If the time between pulses islong enough, then the oscillators relax fromthe high tone and are excited by the low tone,making the stream seem to oscillate betweenhigh and low tones.

6 Score Generation

Once the pitch sequence has been determinedand the note boundaries established, it seemsan easy task to place those notes on a staffand be finished. Many commercial softwareprograms exist to translate a note sequence,usually a MIDI file, into a musical score, buta significant problem which is still not com-pletely solved is determining the key signature,time signature, measure boundaries, acciden-tals and dynamics that make a musical scorecomplete.

6.1 MIDI

MIDI was first made commercially available in1983 and since then has become a standard fortranscribing music. The MIDI protocol wasdeveloped in response to the large number ofindependent interfaces that keyboard and elec-trical instrument manufacturers were comingup with. As the saying goes, “The nice thingabout standards is that there are so many tochoose from.” In order to reduce the num-ber of interfaces in the industry, MIDI, yet an-other standard, was introduced. It did, how-ever, become widely accepted and while most

15

keyboards still use their own internal interfaceprotocol, they often have MIDI as an externalinterface option as well.

MIDI stands for Musical Instrument DigitalInterface, and is both an information trans-fer protocol and a hardware specification. Thecommunications protocol of the MIDI systemrepresents each musical note transition as amessage. Messages for note beginnings andnote endings are used, and other messages in-clude instrument changes, voice changes andother administrative messages. Messages arepassed from a controlling sequencer to MIDIinstruments or sound modules over serial asyn-chronous cables. Polyphonic music is rep-resented by a number of overlapping mono-phonic tracks, each with its own voice.

Many developments have been added toMIDI 1.0 since 1983, and the reader is referredto [Rums94] for a more complete treatment.

6.1.1 MIDI and Transposition

Many researchers have realized the importanceof a standard note representation. If a sys-tem can be developed that will translate fromsound to MIDI, then anything MIDI-esque canthen be done. The MIDI code can be playedthrough any MIDI capable keyboard, it canbe transposed, edited and displayed. The factthat MIDI is based on note onsets and offsetssuggests to some researchers that transcrip-tion research should concentrate on the begin-nings and endings of notes. Between the noteboundaries, within the notes themselves, verylittle interesting is happening, and nothing ishappening that a transcription system is try-ing to preserve, unless dynamics are of con-cern.

What MIDI doesn’t store is informationabout the key signature, the time signature,measure placement and other information thatis on a musical score. This information is eas-ily inferred by an educated human listener,but computers still have problems. JamesMoorer’s initial two-part transcription system

assumed a C major key signature, and placedaccidentals wherever notes were off of the Cmajor scale. Most score generation systems to-day do just that. They assume a user-definedkey and time signature, and place notes on thescore according to these definitions.

The importance of correctly representingthe key and the time signatures is shown in[Long94], where two transcriptions of the samepiece are presented side by side. An expe-rienced musician has no problem reading thecorrect score, but has difficulty recognizing theincorrect one, because the melody is disguisedby misrepresenting its rhythm and tonality.

6.2 Key Signature

The key signature, appearing at the beginningof a piece of music, indicates which notes willbe flat or sharp throughout the piece. Oncethe notes are identified, one can identify whichnotes are constantly sharp or flat, and then as-sign these to the key signature. Key changes inthe middle of the piece are difficult for a com-puter to judge because most algorithms lookat the piece as a whole. Localized statisticscould solve this problem, but current systemsare still not completely accurate.

6.3 Time Signature

Deciding where bar lines go in a piece and howmany beats are in each bar is a much more dif-ficult problem. There are an infinite number ofways of representing the rhythmical structureof a single piece of music. An example givenin [Long94] suggests a sequence of 6 evenlyspaced notes could be interpreted as a singlebar of 6/8 time, three pairs, two triplets, a fullbar followed by a half bar of 4/4 time, or evenbetween beats. Longuet-Higgins suggests theuse of musical grammars to solve this problem,which will be described in Section 8 on page18.

16

Part II

Related Topics

7 Time-frequency Analysis

Most of the pitch detection and pitch track-ing techniques discussed in Section 4 rely onmethods of frequency analysis that have beenaround for a long time. Fourier techniques,pitch detectors and cepstrum analysis, for ex-ample, all look at frequency as one scale, sepa-rate from time. A frequency spectrum is validfor the full time-frame being considered, andif the windowing is not done well, spectral in-formation “leaks” into the neighboring frames.The only way to get completely accurate spec-tral information is to take the Fourier trans-form (or your favorite spectral method) of theentire signal, and then all local informationabout the time signal is lost. Similarly, whenlooking at the time waveform, one is awareof exactly what is happening at each instant,but no information is available about the fre-quency components.

An uncertainty principle is at work here.The more one knows about the frequency of asignal, the less that frequency can be localizedin time. The options so far have been com-plete frequency or complete time, using the en-tire signal or some small window of the signal.Is it possible to look at frequency and time to-gether? Investigating frequency componentsat a more localized time without the need forwindowing would increase the accuracy of thespectral methods and allow more specific pro-cessing.

7.1 Wavelets

The Fourier representation of a signal, andin fact any spectrum-type representation usessinusoids to break down the signal. This iswhy spectral representations are limited to thefrequency domain and cannot be localized in

time: the sinusoids used to break down the sig-nal are valid across the entire time spectrum.If the base functions were localized in time, theresulting decomposition would contain bothtime information and frequency information.

The wavelet is a signal that is localizedin both time and frequency. Because of theuncertainty-type relation that holds betweentime and frequency, the localization cannot beabsolute, but in both the time domain and thefrequency domain, a wavelet decays to zeroabove or below the center time/frequency. Fora mathematical treatment of wavelets and thewavelet transform, the reader is referred to[Daub90] and [Daub92].

The wavelet transform consists of decom-posing the signal into a sum of wavelets of dif-ferent scales. It has three dimensions: loca-tion in time of the wavelet, scale of the wavelet(location in frequency) and amplitude. Thewavelet transform allows a time-frequency rep-resentation of the signal being decomposed,which means that information about the timelocation is available without windowing. An-other way to look at it is that windowing isbuilt in to the algorithm.

Researchers have speculated that waveletscould be designed to resemble musical notes.They have a specific frequency and a specificlocation in time as well as an amplitude enve-lope that characterizes the wavelet. If a sys-tem could be developed to model musical notesinto wavelets, then a wavelet transform wouldbe a transcription of the musical piece. A mu-sical score is a time-frequency representationof the music. Time is represented by the for-ward progression through the score from leftto right, and frequency is represented by thelocation of the note on the score.

Malden Wickerhauser contributed an articleabout audio signal compression using wavelets[Wick92] in a wavelet application book. Thiswork does not deal directly with music applica-tions, however it does have a treatment of themathematics involved. Transcription of musiccan be considered lossy compression, in that

17

the musical score representation can be usedto construct an audio signal that is a recogniz-able approximation of the original audio file(i.e. without interpretation or errors gener-ated during performance). The wavelet trans-form has also been applied as a pre-processorfor sound systems, to clean up the sound andremove noise from the signal [SoWK95].

7.2 Pielemeier and Wakefield

William Pielemeier and Greg Wakefield pre-sented a work in 1996 [PiWa96] discussinga high-resolution time-frequency representa-tions. They argue that windowed Fouriertransforms, while producing reliable estimatesof frequency, are often less than what is re-quired for musical analysis. Calculation of theattack of a note requires very accurate andshort-time information about the waveform,and this information is lost when a windowedFourier transform produces averaged informa-tion for each window. They present a sys-tem called the Modal distribution, which theyshow to decrease time averaging caused bywindowing. For a full treatment, please see[PiWa96].

8 Musical Grammars

It has been theorized that music is a naturallanguage like any other, and the set of rulesthat describe it fits somewhere in the Chom-sky hierarchy of grammar. The questions arewhere in the hierarchy dies it fit, and whatdoes the grammar look like? Is a grammar for12-semitone, octaval western music differentfrom a grammar for pentatonic Oriental mu-sic, or decametric East-Indian music? Withinwestern music, are there different grammarsfor classical and modern music? Top 40 andWestern? Can an opera be translated to a bal-lad as easily (or with as much difficulty) asGerman can be translated to French?

8.1 Lerdahl and Jackendoff

In 1983, Fred Lerdahl, a composer, and RayJackendoff, a linguist, published a book thatwas the result of work aimed at a challenge is-sued in 1973. The book is called “A Gener-ative Theory of Tonal Music”, and the chal-lenge was one presented by Leonard Bernstein.He advocated the search for “musical gram-mar” after being inspired by Chomskian-typegrammars for natural language. Several otherauthors responded to the challenge, includingIrving Singer and David Epstein, who formeda faculty seminar on Music, Linguistics andEthics at MIT in 1974.

The book begins by presenting a detailedintroduction to the concept of musical gram-mar, from the point of view of linguistics andartistic interpretation of music. Rhythmicgrouping is discussed in the first few chap-ters, and tonal grammar is discussed in thelast few chapters. The intent is not to presenta complete grammar of all western music, butto suggest a thorough starting point for fur-ther investigations. The differences betweena linguistic grammar and a musical gram-mar are presented in detail, and an interest-ing point is made that a musical grammarcan have grammatical rules and preferentialrules, where a number of grammatically cor-rect structures are ranked in preference. Thedifference between a masterpiece and an unin-teresting etude is the adherence to preferentialrules. Both pieces are “grammatically” cor-rect, but one somehow sounds better.

8.1.1 Rhythm

In terms of rhythmic structure, Lerdahl andJackendoff begin by discussing the conceptof a grouping hierarchy. It seems that mu-sic is grouped into motives, themes, phrases,periods and the like, each being bigger thanand encompassing one or more of the previousgroup. So a period can consist of a numberof complete phrases, each being composed of

18

a number of complete themes and so on. Thiskind of grouping is more psychologically cor-rect than sorting the piece by repetition andsimilarity. While one can identify similar pas-sages in a piece quite easily, the natural group-ing is hierarchical.

Where accents fall in a piece is another im-portant observation which aids in the percep-tion of the musical intent. Accents tend tooscillate in a self-similar strong-weak-strong-weak pattern. There is also a definite con-nection between the accents and the groups -Larger groups encompassing many subgroupsbegin with very strong accents, and smallergroups being with smaller accents.

The full set of well-formedness rules andpreferential rules for the rhythm of a musi-cal passage is presented in Appendix A. Thereare two sets of well-formedness and preferen-tial rules for the rhythm of a piece, these aregrouping rules and metrical structure rules.

8.1.2 Reductions

The rules quoted in Appendix A provide struc-ture for the rhythm of a musical passage, pre-sented here to provide the flavor of the gram-mar developed by Lerdahl and Jackendoff. Toparse the tonality of a passage, the concept ofreduction is needed. Much discussion and mo-tivation stems from the reduction hypothesis,presented in [LeJa83] as:

The listener attempts to organize allthe pitch-events into a single coher-ent structure, such that they areheard in a hierarchy of relative im-portance.

A set of rules for well-formedness and prefer-ence are presented for various aspects of reduc-tion, including time-span reduction and pro-longational reduction, but in addition to therules, a tree structure is used to investigate thereduction of a piece. Analogies can be drawnto the tree structures used in analysis of gram-matical sentences, however they are not the

same. The intermediate forms at different lev-els of the trees, if translated, would form gram-matically correct musical pieces. The aim ofreduction is to bit by bit strip away the flour-ishes and transitions that make a piece inter-esting until a single pitch-event remains. Thissingle fundamental pitch-event is usually thefirst or last event in the group, but this notnecessary for the reduction to be valid. Aswith linguistic reductions, the goal is first toensure that a sentence (or passage) is gram-matically correct, but more importantly, todiscover the linguistic (or musical) propertiesassociated with each word (or pitch-event).Who is the subject and who is the object of“Talk”? Is this particular C chord being usedas a suspension or a resolution? It is thesequestions that a grammar of music tries to an-swer, rather than “Is this piece of music gram-matically correct in our system?”

8.2 Longuet-Higgins

In a paper discussing Artificial Intelli-gence[Long94], Christopher Longuet-Higginspresented a generative grammar for metricalrhythms. His work convinces us that thereis a close link between musical grammars andthe transcription of music. If one has a gram-mar of music and one knows that the piecebeing transcribed is within the musical genrethat this grammar describes, then rules can beused to resolve ambiguities in the transcrip-tion, just as grammar rules are used to resolveambiguities in speech recognition. He calls hisgrammar rules “realization rules” and are re-produced here for a 4/4-type rhythm.(

12

)-unit →

(12

)-note or

(12

)-rest

or 2 ×(

14

)-units(

14

)-unit →

(14

)-note or

(14

)-rest

or 2 ×(

18

)-units(

18

)-unit →

(18

)-note or

(18

)-rest

or 2 ×(

116

)-units

19

(116

)-unit →

(116

)-note or

(116

)-rest

Different rules would be needed for 3/4-typerhythms, for example, and to allow for dot-ted notes and other anomalies. The generalidea, however, of repeatedly breaking downthe rhythm into smaller segments until an in-dividual note or rest is encountered is insight-ful and simple.

He also discussed tonality, but does notpresent a generative theory of tonality to re-place or augment Lerdahl and Jackendoff’s.Discussions are made instead about resolutionof musical ambiguity even when the notes areknown. He compares the resolution of a chordsequence to the resolution of a Necker cube,which is a two dimensional image that looksthree-dimensional, as seen in Figure 2. It isdifficult for an observer to be sure which sideof the cube is facing out, just as it is difficultto be sure of the nature of a chord without atonal context. He insists that a tonal gram-mar is essential for resolving ambiguities.

8.3 The Well-Tempered Com-puter

In 1994, at the same conference where[Long94] was presented, Mark Steedman pre-sented a paper with insights into the psycho-logical method by which people listen to andunderstand music, and these insights move to-ward a computational model of human musicperception [Stee94]. A “musical pitch space”is presented, adapted from an earlier work byLonguet-Higgins.

Later in the paper Steedman presents a sec-tion entitled “Towards a grammar of melodic

��

��

��

��

Figure 2: A Necker Cube.

tonality”, where he draws on the work of Ler-dahl and Jackendoff. Improvements are madethat simplify the general theory. In somecases, claims Steedman, repeated notes canbe considered as a single note of the cumula-tive duration, with the same psychological ef-fect and the same “grammatical” rules hold-ing. In a similar case, scale progressions canbe treated as single note jumps. The phrase heuses is a rather non-committal “seems more orless equivalent to”. A good example of the dif-ficulties involved in this concept is the fourthmovement of Beethoven’s Choral symphony,Number 9 [Beet25]. In this piece, there is apassage which contains what a listener wouldexpect to be two quarter notes, and in fact thisis often heard, but the score has a single halfnote. Similar examples can be made to chro-matic runs in the same movement. Part of thereason that the two quarter notes are expectedand often heard is that the theme is presentedelsewhere in the piece, with two quarter notesinstead of the half note.

This is the way that the passage is written,

I44" ! ! ! ! ! ! " ! ! ! .

-! "

but when played, it can sound like this,

I44! ! ! ! ! ! ! ! ! ! ! ! ! .

-! "

or like this,

I44" ! ! ︷︷ ! ! ! ! " ! ! ︷︷ ! . ".

depending on the interpretation of the listener.Since different versions of the score can beheard in the same passage, it is tempting tosay that the two versions are grammaticallyequivalent, and that the brain cannot tell onefrom the other. However, it more accurate tosay that the brain is being fooled by the in-strumentalist. We are not confused, saying “Idon’t know if it is one way or the other”, we are

20

sure that it is one way and not the other, butwe are not in agreement with our colleagues asto which way it is.

The substitutions suggested by Steedmancan be made in some cases, but not in all cases,and the psychological similarity does not seemto be universal. It is important to discover howreliable these substitutions are before insertingthem into a general theory of musical tonality.

8.4 Discussions

There seems to be such a similarity be-tween music understanding and language un-derstanding that solutions in one field canbe used analogically to solve problems in theother field. What needs to be investigated isexactly how close these two fields really are.The similarities are numerous. The humanmind receives information in the form of airpressure on the eardrums, and converts thatinformation into something intelligible, be ita linguistic sentence or a musical phrase. Hu-mans are capable of taking a written represen-tation (text or score) and converting it intothe appropriate sound waves. There are rulesthat language follows, and there seem to berules that music follows - It is clear that mu-sic can be unintelligible to us, the best exam-ple being a random jumble of notes.

Identification of music that sounds good andmusic that doesn’t is learned through example,just as language is. A human brought up inthe western tradition of music is likely not tounderstand why a particular East Indian pieceis especially heart-wrenching. On the otherhand, music does not convey semantics. Thereis no rational meaning presented with music asthere is with language. There are no nouns orverbs, no subject or object in music. There is,however, emotional meaning that is conveyed.There are specific rules that music follows, andthere is an internal mental representation thata listener compares any new piece of music to.

9 Conclusions

Since Piszczalski’s landmark work in 1986,many aspects of Computer Music Analy-sis have changed considerably, while littleprogress has been made in other areas. Tran-scription of monophonic music was solved be-fore 1986, and since then improvements to al-gorithms have been made, but no really revo-lutionary leaps. Jimmy Kapadia’s M.Sc. The-sis defended in 1995 had little more thanPiszczalski’s ideas from 1986 in it [Kapa95].Hidden Markov Models have been applied topitch tracking, new methods in pitch percep-tion have been implemented, and cognitionand perception have been applied to the taskof score generation. Polyphonic music recog-nition, however, remains a daunting task toresearchers. Small subproblems have beensolved, and insight gained from computer vi-sion and auditory scene analysis, but the prob-lem remains open.

Much work remains to be done in the field ofkey signature and time signature recognition,and a connection needs to be drawn betweenthe independent research in musical grammarsand music transcription, before a completeworking system that even begins to model thehuman system is created.

The research needed to solve the music un-derstanding problem seems to be distributedthroughout many other areas. Linguistics,psychoacoustics and image processing havemuch to teach us about music. Perhaps thereare more areas of research that are also worthinvestigating.

10 References

[Beet25] Beethoven, Ludwig Van. SymphonyNo. 9 D Minor, Op. 125, Edition Eu-lenburg. London: Ernst Eulenburg Ltd.,1925. First Performed: Vienna, 1824.

[BrCo94] Brown, Guy J. and Cooke, Martin.“Perceptual Grouping of Musical Sounds:

21

A Computational Model.” J. New MusicResearch, Vol. 23, 1994, pp 107-132.

[BrCo95] Brown, Guy J. and Cooke, Mar-tin. “Temporal Synchronization in a Neu-ral Oscillator Model of Primitive Audi-tory Scene Analysis.” IJCAI Workshopon Computational Auditory Scene Anal-ysis, Montreal, Quebec, August 1995, pp40-47.

[BrPu92] Brown, Judith C. and Puckette,Miller S. “Fundamental Frequency Track-ing in the Log Frequency Domain Basedon Pattern Recognition.” J. Acoust. Soc.Am., Vol. 92, No. 4, October 1992, p2428.

[Breg90] Bregman, Albert S. Auditory SceneAnalysis. Cambridge: MIT Press, 1990.

[ChSM93] Chazan, Dan, Stettiner, Yoramand Malah, David. “Optimal Multi-PitchEstimation Using the EM Algorithm forCo-Channel Speech Separation.” IEEE-ICASSP 1993, Vol. II, pp 728-731.

[Daub90] Daubechies, Ingrid. “The wavelettransform, Time-Frequency Localizationand Signal Analysis.” IEEE Trans. Infor-mation Theory, Vol. 36, No. 5, 1990, pp961-1005.

[Daub92] Daubechies, Ingrid. Ten Lectureson Wavelets. Society for Industrial andApplied Mathematics, 1992.

[DeGR93] Depalle, Ph., Garcıa, G. andRodet, X. “Tracking of Partials for Ad-ditive Sound Synthesis Using HiddenMarkov Models.” IEEE-ICASSP 1993,Vol. I, pp 225-228.

[DoNa94] Dorken, Erkan and Nawab, S.Hamid. “Improved Musical Pitch Track-ing Using Principal Decomposition Anal-ysis.” IEEE-ICASSP 1994, Vol. II, pp217-220.

[DoRo91] Doval, Boris and Rodet, Xavier.“Estimation of Fundamental Frequency ofMusical Sound Signals.” IEEE-ICASSP1991, pp 3657-3660.

[DoRo93] Doval, Boris and Rodet, Xavier.“Fundamental Frequency Estimation andTracking Using Maximum LikelihoodHarmonic Matching and HMMs.” IEEE-ICASSP 1993, Vol. I, pp 221-224.

[GoWo92] Gonzales, R. and Woods, R. Dig-ital Image Processing. Don Mills: Addi-son Wesley, 1992.

[GrBl95] Grabke, Jorn and Blauert, Jens.“Cocktail-Party Processors Based on Bin-aural Models.” IJCAI Workshop onComputational Auditory Scene Analysis,Montreal, Quebec, August 1995, pp 105-110.

[KaHe95] Kapadia, Jimmy H. and Hemdal,John F. “Automatic Recognition of Mu-sical Notes.” J. Acoust. Soc. Am., Vol.98, No. 5, November 1995, p 2957.

[Kapa95] Kapadia, Jimmy H. AutomaticRecognition of Musical Notes. M.Sc.Thesis, University of Toledo, August1995.

[Kata96] Katayose, Haruhiro. “AutomaticMusic Transcription.” Denshi JohoTsushin Gakkai Shi, Vol. 79, No. 3, 1996,pp 287-289.

[Kuhn90] Kuhn, William B. “A Real-TimePitch Recognition Algorithm for MusicApplications.” Computer Music Journal,Vol. 14, No. 3, Fall 1990, pp 60-71.

[Lane90] Lane, John E. “Pitch Detection Us-ing a Tunable IIR Filter.” Computer Mu-sic Journal, Vol. 14, No. 3, Fall 1990, pp46-57.

[LeJa83] Lerdahl, Fred and Jackendoff, Ray.A Generative Theory of Tonal Music.Cambridge: MIT Press, 1983.

22

[Long94] Longuet-Higgins, H. Christopher.“Artificial Intelligence and Music Cogni-tion.” Phil. Trans. R. Soc. Lond. A,Vol. 343, 1994, pp 103-113.

[Mcge89] McGee, W. F. “Real-Time Acous-tic Analysis of Polyphonic Music.” Pro-ceedings ICMC 1989, pp 199-202.

[MeCh91] Meillier, Jean-Louis and Chaıgne,Antoine. “AR Modeling of Musical Tran-sients.” IEEE-ICASSP 1991, pp 3649-3652.

[MeOM95] Meddis, Ray and O’Mard,Lowel. “Psychophysically Faithful Meth-ods for Extracting Pitch.” IJCAI Work-shop on Computational Auditory SceneAnalysis, Montreal, Quebec, August1995, pp 19-25.

[Moor77] Moorer, James A. “On the Tran-scription of Musical Sound by Com-puter.” Computer Music Journal,November 1977, pp 32-38.

[Moor84] Moorer, James A. “Algorithm De-sign for Real-Time Audio Signal Process-ing.” IEEE-ICASSP 1984, pp 12.B.3.1-12.B.3.4.

[NGIO95] Nakatani, T., Goto, M., Ito, T.and Okuno, H.. “Multi-Agent Based Bin-aural Sound Stream Segregation.” IJ-CAI Workshop on Computational Audi-tory Scene Analysis, Montreal, Quebec,August 1995, pp 84-90.

[Oven88] Ovans, Russell. An Object-Oriented Constraint Satisfaction SystemApplied to Music Composition. M.Sc.Thesis, Simon Fraser University, 1988.

[PiGa79] Piszczalski, Martin and Galler,Bernard. “Predicting Musical Pitchfrom Component Frequency Ratios.” J.Acoust. Soc. Am., Vol. 66, No. 3,September 1979, pp 710-720.

[Pisz77] Piszczalski, Martin. “AutomaticMusic Transcription.” Computer MusicJournal, November 1977, pp 24-31.

[Pisz86] Piszczalski, Martin. A Compu-tational Model for Music Transcription.Ph.D Thesis, University of Stanford,1986.

[PiWa96] Pielemeier, William J. and Wake-field, Gregory H. “A High-ResolutionTime-Frequency Representation for Musi-cal Instrument Signals.” J. Acoust. Soc.Am., Vol. 99, No. 4, April 1996, pp 2383-2396.

[QuEn94] Quiros, Francisco J. and Enrıquez,Pablo F-C. “Real-Time, Loose-HarmonicMatching Fundamental Frequency Esti-mation for Musical Signals.” IEEE-ICASSP 1994, Vol. II, pp 221-224.

[Rich90] Richard, Dominique M. “GodelTune: Formal Models in Music Recogni-tion Systems.” Proceedings ICMC 1990,pp 338-340.

[Road85] Roads, Curtis. “Research in Musicand Artificial Intelligence.” ACM Com-puting Surveys, Vol. 17, No. 5, June1985.

[Rowe93] Rowe, Robert. Interactive MusicSystems. Cambridge: MIT Press, 1993.

[Rums94] Rumsey, Francis. MIDI Systemsand Control. Toronto: Focal Press, 1994.

[SaJe89] Sano, Hajime and Jenkins, B.Keith. “A Neural Network Model forPitch Perception.” Computer MusicJournal, Vol. 13, No. 3, Fall 1989, pp 41-48.

[Sche95] Scheirer, Eric. “Using MusicalKnowledge to Extract Expressive Perfor-mance Information from Audio Record-ings.” IJCAI Workshop on Computa-tional Auditory Scene Analysis, Montreal,Quebec, August 1995, pp 153-160.

23

[Smit94] Smith, Leslie S. “Sound Segmenta-tion Using Onsets and Offsets.” J. NewMusic Research, Vol. 23, 1994, pp 11-23.

[SoWK95] Solbach, L., Wohrmann, R.and Kliewer, J. “The Complex-ValuedWavelet Transform as a Pre-processorfor Auditory Scene Analysis.” IJCAIWorkshop on Computational AuditoryScene Analysis, Montreal, Quebec, Au-gust 1995, pp 118-124.

[Stee94] Steedman, Mark. “The Well-Tempered Computer.” Phil. Trans. R.Soc. Lond. A, Vol. 343, 1994, pp 115-131.

[Tang88] Tanguiane, Andranick. “An Algo-rithm For Recognition of Chords.” Pro-ceedings ICMC 1988, pp 199-210.

[Tang93] Tanguiane, Andranick. ArtificialPerception and Music Recognition. Lec-ture notes in Artificial Intelligence 746,Germany: Springer-Verlag, 1993.

[Tang95] Tanguiane, Andranick. “TowardsAxiomatization of Music Perception.” J.New Music Research, Vol. 24, 1995, pp247-281.

[Wang95] Wang, DeLiang. “Stream Segre-gation Based on Oscillatory Correlation.”IJCAI Workshop on Computational Au-ditory Scene Analysis, Montreal, Quebec,August 1995, pp 32-39.

[Wick92] Wickerhauser, Malden V. “Acous-tic Signal Compression with WaveletPackets.” in Wavelets - A Tutorial inTheory and Applications. C. K. Chui(ed.), Academic Press, 1992, pp 679-700.

Part III

Appendices

A Musical Grammar Rulesfrom [LeJa83]

Grouping Well-Formedness Rules:

GWFR1: Any contiguous sequence of pitch-events, drum beats, or the like can con-stitute a group, and only contiguous se-quences can constitute a group.

GWFR2: A piece constitutes a group.

GWFR3: A group may contain smallergroups.

GWFR4: If a group G1 contains part of agroup G2, it must contain all of G2.

GWFR5: If a group G1 contains a smallergroup G2, then G1 must be exhaustivelypartitioned into smaller groups.

Grouping Preference Rules:

GPR1: Avoid analyses with very smallgroups - the smaller, the less preferable.

GPR2 (Proximity): Consider a sequence offour notes n1n2n3n4. All else being equal,the transition n2 − n3 may be heard as agroup boundary if

a. (Slur/Rest) the interval of time fromthe end of n2 to the beginning of n3 isgreater than that from the end of n1 tothe beginning of n2 and that from the endof n3 to the beginning of n4, or if

b. (Attack-Point) the interval of timebetween the attack points of n2 and n3

is greater than that between the attackpoints of n1 and n2 and that between theattack points of n3 and n4.

24

GPR3 (Change): Consider a sequence of fournotes n1n2n3n4. All else being equal, thetransition n2−n3 may be heard as a groupboundary if

a. (Register): the transition n2 − n3 in-volves a greater intervallic distance thanboth n1 − n2 and n3 − n4, or if

b. (Dynamics): the transition n2−n3 in-volves a change in dynamics and n1 − n2

and n3 − n4 do not, or if

c. (Articulation): the transition n2 − n3

involves a change in articulation and n1−n2 and n3 − n4 do not, or if

d. (Length): n2 and n3 are of differentlengths and both pairs n1, n2 and n3, n4

do not differ in length.

(One might add further cases to deal withsuch things and change in timbre or in-strumentation)

GPR4 (Intensification): Where the effectspicked out by GPRs 2 and 3 are relativelymore pronounced, a larger-level boundarymay be placed.

PR5 (Symmetry): Prefer grouping analysesthat most closely approach the ideal sub-division of groups into two parts of equallength.

GPR6 (Parallelism): Where two or moresegments of the music can be construedas parallel, they preferably form parallelparts of groups.

GPR7 (Time Span and Prolongational Sta-bility): Prefer a grouping structure thatresults in more stable time-span and/orprolongational reductions.

Metrical Well-Formedness Rules:

MWFR1: Every attack point must be asso-ciated with a beat at the smallest metri-cal level at that point in the piece.

MWFR2: Every beat at a given level mustalso be a beat at all smaller levels presentat that point in the piece.

MWFR3: At each metrical level, strong beatsare spaced either two or three beats apart.

MWFR4: The tactus and immediately largermetrical levels must consist of beatsequally spaced throughout the piece. Atsubtactus metrical levels, weak beatsmust be equally spaced between the sur-rounding strong beats.

Metrical Preference Rules:

MPR1 (Parallelism): Where two or moregroups can be construed as parallel, theypreferably receive parallel metrical struc-ture.

MPR2 (Strong Beat Early): Weakly prefer ametrical structure in which the strongestbeat in a group appears early in thegroup.

MPR3 (Event): Prefer a metrical structure inwhich beats of level Li that coincide withthe inception of pitch-events are strongbeats of Li.

MPR4 (Stress): Prefer a metrical structurein which beats of level Li that are stressedare strong beats of Li.

MPR5 (Length): Prefer a metrical structurein which a relatively strong beat occurs atthe inception of either

a. a relatively long pitch-event,

b. a relatively long duration of a dynamic,

c. a relatively long slur,

d. a relatively long pattern of articulation

e. a relatively long duration of a pitch inthe relevant levels of the time-span reduc-tion, or

f. a relatively long duration of a harmonyin the relevant levels of the time-span re-duction (harmonic rhythm).

25

MPR6 (Bass): Prefer a metrically stablebass.

MPR7 (Cadence): Strongly prefer a metricalstructure in which cadences are metricallystable; that is, strongly avoid violationsof local preference rules within cadences.

MPR8 (Suspension): Strongly prefer a met-rical structure in which a suspension is ona stronger beat than its resolution.

MPR9 (Time-Span Interaction): Prefer ametrical analysis that minimizes conflictin the time-span reduction.

MPR10 (Binary Regularity): Prefer metri-cal structures in which at each level everyother beat is strong.

26

computer music analysis - university of reginagerhard/publications/sfu97.pdfcomputer music analysis...

Documents