introduction - uvicece.uvic.ca/~elec499/2004b/group09/docs/bbvprept.doc · web viewintroduction....

Table of Contents

1. Introduction_________________________________________________________4

1.1. Scope__________________________________________________________4

2. Background_________________________________________________________5

2.1. Beatboxing______________________________________________________5

2.2. Tabla Bols______________________________________________________6

2.3. Computers and Music____________________________________________6

2.4. The BBVP Software______________________________________________7

3. Operator Interface____________________________________________________7

3.1. Control_________________________________________________________8

3.2. Training_______________________________________________________10

3.3. Sound Mappings________________________________________________11

4. Audio Processing____________________________________________________11

4.1. Data Acquisition________________________________________________11

4.2. Signal Preparation______________________________________________11

4.3. Noise Reduction________________________________________________11

4.4. Burst Detection_________________________________________________12

4.5. Voice Recognition Engine________________________________________13

5. Project Extensions___________________________________________________39

5.1. Additional Effects/Features_______________________________________39

6. Conclusions________________________________________________________39

7. Recommendations___________________________________________________39

7.1. Better Tabla Bol Recognition_____________________________________40

8. Literature Review?___________________________________________________40

9. References__________________________________________________________40

1. Introduction

Human voice is the oldest instrument. The Bionic Beatbox Voice Processor is the newest evolution of this indigenous instrument. Even without language, the tones and rhythms of the vocal tract create pleasing, immersing, radiating waves of sound. Just as music progressed from pure acoustic origins to electronic amplification and synthesis, a new level of musical expression can now be reached with computer audio. Computers have found their way into recording studios and performance settings, creating and embellishing radical—and traditional—musical sounds. This project taps the power of a computer to extend and enhance the potential of everyone’s resident instrument, the human voice.

Beatboxing is the art of creating percussive sounds with human voice. The spoken rhythms can back any genre of music or be an expression all their own. The Bionic Beatbox Voice Processor (BBVP) allows musicians to capitalize on their breath control and vocal skills to generate unique new music. Digital audio affords artists limitless possibility to experiment and manipulate their sound. The BBVP is not a full music production environment, but gives users an interface to play with and modify vocal gestures.

The BBVP is an excellent tool for all producers and musicians. Without being a top ‘performance’ beatboxer, a user can intuitively create high-quality drum loops. Current electronic interfaces for generating drumbeats are tedious. Drum machines and audio production software don’t deliver a natural platform to express a beat. Each distinct sound must be individually entered. The visual layout fails to provide a sense of the overall sound and often results in a trial-and-error method of creating the rhythm. Now, in a studio environment or home computer, the BBVP software can transpose a beat directly from mind to machine.

With only a microphone and the BBVP software users are able to beatbox a percussion track and receive a studio quality drum loop in return. The ability to simply voice the beat into a computer presents a quick, natural interface for musicians. But the Bionic Beatbox software is not limited to generating only drum hits, it allows to mapping the human sounds to any user-definable sound. The software is an exciting tool, helping to create unlimited rhythms with custom effects and accompaniments.

1.1. ScopeThis project involved the design of sound recognition software. The recognition scheme and accompanying interface were both implemented in MatLab. This report documents the design, development and operation of the voice processing software. While some background is provided, it is assumed the reader has moderate experience with digital audio.

The report describes the complete progression of the Bionic Beatbox Voice Processor this spring. There are many extensions and opportunities to build on this design project, however this report focuses on the current foundation. We begin by highlighting operation of the user interface, introducing where the key elements of the voice analysis occur. After presenting the features of the software, we detail the internal operation and implementation techniques used in the MatLab code. The critical design decisions are addressed as each key aspect is encountered. The report attempts to provide the rationale behind the decisions and how they reflect in the current implementation. The testing and results included offer objective measures on the software performance. From these we conclude an evaluation of the current software and recommend direction for future development.

2. BackgroundDigital audio is pervasive in our modern culture. The reader is assumed to understand the basic concept of digitzed sound. The technical digitial signal processing (DSP) concepts are explained in more detail the Audio Processing section. Beatboxing itself, however, is still an emerging art form. The sounds created by a gifted beatboxer are musical expression in their own right. The BBVP builds on the origins of beatboxing to enhance the artistic expression.

2.1. BeatboxingBeatboxing originated as an urban art form. The hip-hop culture of the early 1980’s could seldom afford beat machines, samplers, or sound synthesizers [REF]. Without machine-supplied beats to rap over, a new instrument was created. The mouth. Beatboxing became the art of vocal percussion: making drum sounds and beats using your lips, tongue, mouth, throat, and vocals.

The BBVP recognition engine is keyed to identifying vocal kick, snare, and high-hat sounds. While every beatboxer has a different style of delivery, these sounds are fundamental to most drum loops. Initially, the BBVP recognition algorithm was a general-purpose sound matching scheme. However, by focusing on features more relevant to human beatbox sounds, recognition accuracy was improved.

Traditionally, all beatbox techniques include snare drum, high-hat, and kick (or bass) drum sounds. The original scope of the recognition was limited to these short-duration, ‘bursty’ sounds. Advanced vocal percussion often includes addtional noises such as simulated turntable scratches or a simultaneous ‘hum-along’ with the beatboxers’ own rhythm. By first concentrating on the simpler sounds, we were quickly able to develop a practical platform for testing and experimentation. Also, by founding a solid recognition algorithm, expanding the BBVP to identify longer drum rolls, flams, or other more abstract sound become simple extensions.

2.2. Tabla BolsThe creation of tabla beats from tabla bols is also explored as an application of the BBVP. The tabla is a traditional Indian percussive instrument consisting of two drums: The bayan and the dayan. The bayan is a larger bowl shaped drum consisting of a metal body and is used to produce bass tones. The dayan is the smaller wooden drum which sounds a variety of higher pitch tones. Each strike on the tabla is defined as a spoken bol or word. The main tabla bols can be classified as bayan bols: Dha, Dhin, Ge, Ke; and dayan bols: Na, Tin, Tit. In tabla playing it is as important to voice the individual bols as it is to strike them. A player will often voice the rhythm he/she is about to play or voice it as they are playing. The BBVP can help a player improve their bol phrasing and give them more of idea how the played bol should sound.

2.3. Computers and MusicDigital audio is a numerical representation of sound. The intensity of acoustic waves are transduced into electrical impulses, whose amplitudes can be recorded digitally. The reverse, converting the numerical amplitude values into their time-varying electric signal and sound wave representations, recreates the acoustic energy.

Computers are instructed, operated, and fed with numbers. The digital representation of sound allows the exceptional numerical-manipulation abilities of a computer to create, copy, and enhance any audio. Most modern genres of music are either crafted entirely electronically, or include some portions of computer music. Extensive software tools and applications exist to allow people creative expression with these musical digits.

The important parameters of digitized music are the sampling rate (fs) and the number of bits used to encode the value of each sample. The sampling rate indicates how many times per second the amplitude of a sound is recorded. The more values recorded, the closer this discrete representation becomes to the original sound. The number of bits used to encode the sample governs the finite quantity of values the amplitude can have. Having more bits per sample means finer amplitude variations can be stored. Taking 16 bit samples at 44100Hz is considered sufficient to recreate the analogue sound beyond the capabilities of the human auditory system.

As members of a band or orchestra assemble to create music, electronic music is assembled from components. Computer programs like Reason, Recycle, and Pro Tools allow creation, recording, assembly, and sequencing of audio components. The components can be digital recordings of live audio, other sound clips, imported audio samples, or sounds ‘invented’ inside a CPU. One essential component is always the drum line.

The drumbeat drives the tempo and rhythmic feel of a song. Professional quality music production software usually includes a large database of drumbeat samples. Alternatively, many applications have an interface to program drumbeats by indicating the time each percussion sound occurs. Drumbeats, or any sound, can always be imported into the

production environment. The samples are looped and arranged as another component of the music. The BBVP is a vocal interface to record and export drum loops for electronic music production.

2.4. The BBVP SoftwareAs a music production tool, the BBVP is a natural interface [platform?] to create drum loops. The BBVP was created in the MatLab programming environment. Our sound recognition, the heart of the BBVP, is a series of mathematical operations on numerical audio data—ideal for implementation in MatLab. Once the vocal sounds are identified, the output drumbeat has practically determined itself.

Developing the voice recognition in MatLab provided a rich toolset for processing the audio signals. MatLab gives you explicit control over the data-manipulation, but allows use of conceptual commands like ‘filter()’ [and ‘plot()’]. Using the MatLab environment permitted us to concentrate on the algorithms and data-analysis, simplifying ancillary issues with acquiring, formatting, playback, etc. of the input and output.

The BBVP software has a graphical user interface (GUI) to direct the processing of voice to beat. The GUI gives the user control over the BBVP voice recognition parameters and input/output operations.

3. Operator InterfaceThe BBVP acquires and processes digital audio data, in the form of a ‘beatbox’ through a MatLab based GUI. The user can sing a beat into a microphone and use the software to transform the voiced beat into professional high quality drum loop, ready for export into a sampler or any audio recording environment. The voiced beat is chopped up and compared to a user-specific training set of data. Each vocal burst is classified and the appropriate real drum sound is transplanted into the loop. The user has the ability to map any vocal sounds to any WAV file samples.

MatLab’s Guide, a visual GUI tool was used to develop the BBVP’s GUI. The GUI is split into three sections: Control, Mapping, and Training. In the figure below, the status indicator window is reading ‘Ready’.

Figure 1: BBVP GUI

3.1. ControlThis is the top left section of the GUI in figure ?. These buttons and sliders are for recording, processing and playback of the voiced input.

3.1.1. Record and StopWhen ‘Record’ is clicked, the software starts acquiring the audio input. The digital input stream arrives from whichever sound card is currently selected for microphone recording in the Window Sounds and Multimedia setting page.

A possible addition would allow the user to select the audio source. The sample rate of data acquisition is currently fixed in the software to 44100Hz. This ensures that no aliasing will occur. However the program will run faster if a lower sample rate is chosen. Because the input is primarily human voice, the sample rate could be lowered to the Nyquist sampling frequency of 8khz, assuming the highest frequency of the human voice is 4khz. The sample rate was chosen to be kept at 44100hz so that the software can be extended beyond human voice; perhaps animal sounds or any other noises. The software ends data acquisition upon the user clicking the ‘Stop’ button. Now the data is ready for processing.

3.1.2. Process BeatThe ‘Process Beat’ button triggers the transformation of the voice input into a real drum loop. The complex set of events begins with the full time-domain input signal to find the start and end points of each beatbox sound burst. The points are used later to decide where to put the drum samples.

The algorithm used to detect the bursts scans through the one-dimensional array representing the digitized microphone input. This can be complicated if background noise or erroneous spikes in the input create the appearance of extra artifacts.

The time-domain signal is initially de-noised and scanned. The scan ignores low amplitude bursts that could be mistaken for voice beats. The thresholds for marking the beginning and end of the bursts can be adjusted for variations in the operating environment and users’ beatbox magnitude. These levels are controlled with ‘Sensitivity’ sliders before each processing.

3.1.2.1. Sensitivity SlidersThese sliders increase and decrease the burst detection thresholds. The thresholds determine the voltage value (sound wave amplitude) at which the burst is detected. If the user is in a particularly noisy environment they may want to decrease the sensitivity so that background noise artefacts do not get confused with beatbox bursts. On the other hand a user may want to increase the sensitivity to pick up lower amplitude bursts such as high hats or other subtle vocal gestures. The beatbox can be re-processed at different levels of sensitivity until the desired vocal bursts are detected.

3.1.2.2. QuantizeQuantization capabilities are under development. If a vocal beat is unintentionally off-tempo, this would allow automatic correction in the output beat. The output tempo would also be controlled by the beats per minute (BPM) slider, at the top of Figure ?. Currently the ‘Quantize’checkbox has no effect on input processing.

3.1.2.3. Output Beat GenerationOnce the bursts have been located each burst is ripped out of the array and sent to the voice recognition engine. From each burst, defining features are extracted. These features then get compared to the users training set (see ‘Training’ below). Once the sounds have been identified, the BBVP looks to the mappings (see ‘Mapping’ below) to appropriately transplant the mapped sounds for each burst into a new WAV file. The status window will let the user know when the processing is complete.

3.1.3. Play, Loop and the Dry Wet Mix SliderNow the loop is ready to be heard! The user has a dry/wet mix option to hear the processed loop. If the slider is all the way dry when the ‘Play’ button is pressed, only the original voiced beatbox will be heard. At the other end, the slider will cause playback of only the transformed beat. An interesting variation on the drum loop can be obtained by moving the slider. The play back can also be infinitely looped with ‘loop’ radio button.

3.1.4. Save BeatThe processed beat can be saved to a WAV file for later use in another application such as Pro Tools, or Recycle by Propellerheads. Recycle is loop making program that can be used to split beats up into individual slices for use in an hardware or software sampler. It also converts the WAV file loops into the REX format used by Propellerheads Reason software.

needs © , ™ ??

3.1.5. Play Click TrackTo help the user stay on beat, or to create a beat with a precise tempo, a click track can be generated. Using the topmost slider (see Fig. X), the user can adjust the number of beats per minute (BPM). When the click track is played, the software outputs a short beep sound according to the BPM frequency selected. The click track continues as long as the toggle button is depressed. While recording, it is recommended the click track be played through headphones to avoid interfering with the microphone input.

3.2. TrainingTo identify sounds in the voiced beat, a template for each unique sound must be established by the software. The features of these sounds are used to train a Neural Net. The properly trained Neural Net will classify the users sounds in an input beatbox. The training area (top right of Figure X) is used to configure the types of sounds identifiable in each recorded beat.

The pull-down list selects which sound the user is entering to the recognition set. The number associated with the training sounds, corresponds its number in the mapping section (see below). Upon clicking the ‘Say 5 Repetitions’ button, the software waits for five repetitions of the sound from the microphone. To be acceptable, the sounds must be consistent with each other, as well as distinguishable from other sounds in the template database. The repetitions ensure a consistent training sound from which to extract a set of averaged features for the feature database. A text message indicates the if the new sound is accepted.

Once the beatbox sounds are recorded, the neural net must ‘learn’ the features of each sound. Clicking the ‘Train Neural Net’ begins the iterative process. The ‘Drum Kit’ and

‘Tabla’ check boxes in the training section configure the type of features the net will train itself with. Currently only the drum kit feature recognition works satisfactorily.

The training continues until the neural network can differentiate inputs within an acceptable error threshold. The status window indicates when training is complete.

3.3. Sound MappingsThe mapping section (bottom left of Figure X) configures the sounds inserted in the output beat. As the human beatbox sounds are identified, they are correlated with the ‘Sound number’ they were trained as. The mappings control what WAV files are heard in the output for each occurrence of the trained sound.

The BBVP comes with many digital drum samples. The large variety of pre-loaded bass, snare, high hat, and tabla sounds offer users many unique mapping combinations. Of course, a user can import their own WAV files for mapping.

If the user wants to preserve the feel of a particular drum ‘kit’, the mappings can be saved. Clicking ‘Save Kit’ records the current mappings. The sounds are referenced by filename, so moving the WAV files will lose the mapping. Clicking ‘Load Kit’ restores a previously saved combination of percussion mappings.

4. Audio ProcessingMain section detailing DSP concepts and algorithms of BBVP

all the goodies on the digital dancing the BBVP does

4.1. Data AcquisitionEquipment and experimental setup:

1.8 Ghz Athlon, 512 Rammicrophone, sound card

Capturing analogue dataMatLab audio stream objectssampling rate, WAV audio

4.2. Signal Preparationnormalize, dc remove so that there will not be huge FFT spikes at low freq, and noise removal

4.3. Noise Reductiondescription and why needed

unwanted burtsspectral sub / wavelets denoising

advantages/disadvantagesshow code samples (or refer to Appendix)

4.4. Burst DetectionEven using more advanced de-noising algorithms can never completely eliminate unintentional or distorted input. Any ‘real’ environment has broad-spectrum background noise. To operate in variable conditions, the BBVP software must provide a high level of noise immunity.

Noise is a sound in itself (often labelled with a colour, depending on its frequency spectrum). The sound recognition engine could be configured to identify the background noise as a sound, and ignore it in the output. However, this would require the engine to process and distinguish the entire recorded vocal input. Due to the bursty nature of beatbox audio, it is more efficient to isolate the relevant artifacts and pass them to the recognition engine.

Isolating the vocal artifacts while ignoring random noise and erroneous spikes was the first challenge in developing the BBVP. Simply triggering off a voltage level above an absolute threshold will cause: false detection of artifacts if the threshold is too low, or, late detection of artifacts as the burst takes time to ramp up to the threshold. The BBVP provides sensitivity thresholds to adjust these levels, but in a more intelligent sense.

The algorithm used to detect the bursts scans through the one-dimensional array representing the digitized microphone input. Only the absolute magnitudes of the input are considered. A window of 512 samples scans this array. At a sampling rate of 44100 Hz, the window represents approximately 11.6msec. The window advances in discrete steps along the time-domain signal, calculating the average of all absolute magnitudes currently within the window.

When the mean value of this window surpasses the ‘Start Sensitivity’ threshold, the beginning of a vocal sound is marked. This is where the mapped sound will be inserted after processing. The ‘End Sensitivity’ setting determines when a burst has subsided. It is important to capture the entire sound so the correct features can be passed to the recognition algorithm. The window size is fixed at run-time but could also be adjusted to deal with noisier signals. An increased window size will tend to ignore noisy low amplitude bursts that could be mistaken for voice beats.

BURST CODE

Figure X shows a time domain plot of a simple bass drum/ high hat/ snare drum/ high hat/snare drum beatbox. This plot actually shows the absolute value of the time-domain signal:

Figure 2: Time-Domain Plot of a Simple Beatbox (absolute magnitudes).

The bursts are removed, showing the algorithm found all the beats:

Figure 3: Simple Beatbox with bursts removed.

A real drum resonates when physically contacted. The decay of the resonation is dependant on the type of the drum and strength of the strike. The beatbox sounds we are analysing are typically on the order of milliseconds. While the absolute start and end of the bursts are marked, it does not mean the output sounds are confined to the burst duration. For realistic effect, the output sounds are triggered by the start of an artifact and played until completion, or until a repeat strike initiates the drum sound again. Identifying the end of a burst is to ensure the complete vocal sound is analysed by the Voice Recognition Engine.

4.5. Voice Recognition EngineThe Voice Recognition Engine (VRE) lies at the heart of the BBVP. Its main purpose is to extract distinguishing features from the audio and compare these features to the known training set of features in order to find a match. A feature is a number or array representing a defining property of a piece of audio. The VRE is split into two main parts: Feature Extraction and Pattern Matching. All feature extraction is performed on an entire audio burst. Meaning that the audio bursts did not get segmented into frames, a common method in audio feature extraction. Since the audio bursts dealt within the scope of this report are short, typically less than 300 samples or 7msec, frame segmentation is not required. However, longer voiced sounds with varying pitch and timbre would need to be segmented for accurate feature extraction.

Initially, a rather crude algorithm, involving the use of 256 digital band pass filters was used to match the inputted vocal sounds to the training set. Then a similar algorithm was testing using the Fast Fourier Transform of the vocal burst. Due to their poor and slow performance, a new pattern-matching algorithm using Neural Networks along with a whole new feature set was implemented.

4.5.1. Multi-Band Pass Method (MBP)This method is the initial attempt at a voice recognition algorithm for the BBVP.

4.5.1.1. MotivationOriginally, only three basic voiced sounds were considered: Bass drum, snare drum, and high hat. The bass drum sound is voiced by pressing the lips together and releasing a quick burst of air through them. A snare drum is imitated by pushing the top plane of the tongue against the middle upper part of the mouth, and the high hat is voiced by clenching the teeth, pushing the tongue against the inside of them and releasing a quick burst of breath. These three different sounds are quite distinguishable in the frequency domain. Figures ?,?, and ? display the Fast Fourier Transforms (FFT) of each sound.

Figure 4: FFT of Voiced Bass Drum Sound

Figure 5: FFT of Voiced Snare Drum Sound

Figure 6: FFT of Voiced High Hat Sound

It is obvious to see that each of the sounds peak at significantly different frequencies. Based of these observations the initial MBP recognition algorithm was implemented.

4.5.1.2. The AlgorithmThe MBP algorithm is designed to classify vocal beatbox bursts based entirely on varying frequency content. The initial feature used to the classify audio input consists of a single-dimensional array of length 256. Each recognized audio burst is filtered through 256, overlapped 4th order Butterworth band pass filter of approximate band width 157Hz. Figure ? shows a plot of the amplitude and phase response of one of the filters centred around 4.8khz. The Butterworth was chosen for its easy implementation and suitable characteristics. The Butterworth provides minimum phase distortion and provides a nice roll off.

Figure 7: Amplitude and Phase response of 4th order Butterworth bandpass filter

The filters were overlapped to ensure that important parts of the signal were not getting lost at intermediate frequencies. The average amplitude is then taken from each filtering event and placed into the appropriate bin (1-256). For instance, a band pass filter with cut off frequencies 20-177 Hz first filters the audio burst. Then the resultant time domain signal is averaged and the resultant number is placed into element one of the feature array. After 256 filtering events, the VRE has a completed feature array of length 256. This array is then used to find a nearest match to the feature arrays in the training database. The feature array in the training set are plotted in FIGURE??. The three voiced sound, bass drum, high hat, and snare drum are separable with respect to their feature arrays. The arrays produce similar results

Figure 8: Features of Beatbox Sounds

A simple nearest neighbour calculations is performed to find the pattern match between the inputted vocal audio burst feature array and the previously derived training feature array. Here is the calculation:

eqn??

X represents the current sound burst feature array. X’ represents the user trained feature array.

4.5.1.3. Testing and ConclusionsThe multi-band pass algorithm was put to the test. Three main vocal sounds plus two pitch deviations on each sound were voiced 10 times and the resultant matches were observed. A fourth mid-frequency ‘ooh’ sound was included in the training set to try and throw off the algorithm. Table?? shows the results of the experiment.

Table 1: Drum Set BeatBox test results for MBP method

Vocal Sound Proposed WAV file

Matched WAV file

Times Matched

Result

Bass Drum Bass Drum Bass DrumSnare DrumHigh HatOhh

10000

100%

Snare Drum Snare Drum Bass DrumSnare DrumHigh HatOhh

0802

80%

High Hat High Hat Bass DrumSnare DrumHigh HatOhh

00100

100%

Lower pitch Bass Drum

Bass Drum Bass DrumSnare DrumHigh HatOhh

10000

100%

Lower pitch Snare Drum

Snare Drum Bass DrumSnare DrumHigh HatOhh

5401

40%

Lower pitch High Hat

High Hat Bass DrumSnare DrumHigh HatOhh

2332

30%

Higher pitch Bass Drum


6202

60%

Higher pitch Snare Drum


0190

10%

Higher pitch High Hat


00100

100%

The algorithm performed well for beatboxed sounds around the same pitch, however, the MBP mismatched vocal beatbox bursts at different pitches. The worst result came with 10% recognition of higher pitch snare drums. 90% were matched to high hats. The bass drum performed perfectly at lower pitches and high hats performed perfectly at higher

pitched. This result is due to the frequency positioning of bass drums and high hats. There is nothing lower in frequency than a bass drum so lowering the pitch still gave the result of a bass. Similarly, there is nothing higher than a high hat so increasing the pitch of a high hat will still return a high hat. The ‘ooh’ sound threw off the snare the most because it existed mainly in the snares frequency range.

This algorithm works well for the user for which it has been trained (same pitch). However, a different users voicing at different pitches would receive poor results. An algorithm less dependant on spectral shaped is needed. Furthermore, the MBP method is very slow. It took about 7 seven seconds to recognize and process a single burst beat box. Most of this is due to the creation of 256 band pass filters used to obtain the feature array. After the implementation of this algorithm, it was realized that the MBP’s feature array is similar to a low-resolution version of a Fast Fourier Transform (FFT). The same experiment just discussed was performed with a feature array obtained with an FFT.

4.5.2. FFT MethodThe difference between this method and the MBP is that a larger array, or higher resolution feature is used. The size of the feature array equals the length (number of samples) of the sound burst. The main motivation for trying this method was for increased speed and potentially better recognition accuracy. Table ? shows the results.

Table 2: Drum Set BeatBox test results


Matched WAV file

Times Matched

Result

Bass Drum Bass Drum Bass DrumSnare DrumHigh HatOhh

10000

100%


1702

70%


1090

90%

Lower pitch Bass Drum


10000

100%

Lower pitch Snare Drum

Snare Drum Bass DrumSnare DrumHigh Hat

400

0%

Ohh 6Lower pitch High Hat


4060

60%

Higher pitch Bass Drum


10000

100%



0550

50%

Higher pitch High Hat


3070

70%

The FFT method performed worse than the MPB method. The bass voicing at normal pitch worked perfectly and the high hats worked well, however, the snare voicing recognized with 70% accuracy. The worst result was 0% accuracy with lower a pitch snare drum. Overall using an FFT for a feature array did not improve accuaracy but did speed up the processing time. It took approximately two seconds to process a single burst. Instead of trying to improve upon these previous two methods, it was decided that a new advanced method was required. A plethora of characteristics of the vocal bursts were needed to be studied, not just spectral shape.

4.5.2.1. Attempted Tabla Bol RecognitionThese previous two methods of recognition and pattern matching were further tested with recognition of tabla bols. The results were hopeless. A formal experiment was not performed. However, either algorithm could not distinguish between a ‘Dha’ and a ‘Na’, which is not an easy task due to their phonetic similarities. Some of the times ‘Ge’ was detected correctly. Overall the results proved that a more robust set of features needs to be researched as well as a more knowledgeable method of pattern classification.

4.5.3. A New Feature SetRamp time, average power, zero crossing, peak frequency, and linear predictive coding (LPC) were explored as possible features to recognize beatbox bursts and tabla bols. After extraction, each feature is normalized by dividing it by the maximum of that feature across the training set. This ensures that features with larger value will not have more meaning or weighting than smaller features. Also, there is a plot for each feature that shows the feature value for six of each bass drum, high hat, and snare drum used in a particular training set. A circle, an asterisk, and a star denote them respectively.

4.5.3.1. Ramp TimeThe ramp time of a time domain signal burst can be defined as the number of samples it takes to reach its maximum value from the start point. To find the ramp time, the envelope of the signal is first detected. This achieved by low-pass filtering the sound burst with a second order Butterworth digital filter. Figure?? illustrates the ramp time of the envelope of a voiced high hat burst. (REF http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-3.html)

Figure 9: Ramptime of the envelope of a voiced high hat time domain signal

Here is example code that finds the ramp time in number of samples of the sound array ‘vector’.

% ramptime.m% Authors: Adam Tindale and Ajay Kapur%% Finds ramp Time of infile% % ex. vector = wavread('sound.wav')% [ramptimesample_1] = ramptime(vector);

function [result] = ramptime(vector)

window_size=500;

% find envelope of signal[b,a]=butter(2,.0018); % create low pass filterenv=filter(b,a,abs(vector)); %envelope of the Arrayenv(length(env):length(env)+window_size)=0;%to prevent exceeding indices

% get length of vectorveclen = length(vector); % store length of vector % find y value maximummaxValue = max(env);

% search for x value of maximumfor(i = 1:veclen) if(env(i) == maxValue) ramptimesamples_1 = i; break; endend

result = ramptimesamples_1;

Figure ?? shows how six of each of the vocal drum sounds are classified by ramp time: six bass drums (circles), six high hats (asterisk), and six snare drums (stars).

Figure 10: Bass Drum (Circles), High Hats (Asterisk), and Snare Drum (Stars) classification by ramp time

Ramp time managed to classify four of the snare hits separately, however three of the snare hits plus the bass drums and high hats are scattered intermediately from 0.5 to 1.

4.5.3.2. Root Mean Square (Average Power)Extracted in the time domain, the root mean square (RMS) or average power is a measure of the perceived loudness of an audio signal. (ref http://vision.poly.edu:8080/paper/audio-mmsp.html#intro)

eqn ??

The MatLab implementation is trivial:

% rmtfinder.m%

% Finds the root mean square of a signal in the time domain. % square each sample and divide by number of samples.%% Authors: Ajay Kapur

function [rmt] = rmtfinder(signal)

% find length of signalN = length(signal);

sum = 0;for i= 1:N sum = sum + (signal(i)*signal(i));end

rmt = sqrt(sum/N);

Figure 11: Bass Drum (Circles), High Hats (Asterisk), and Snare Drum (Stars) classification by RMS or average power.

RMS proved more successful than ramp time as a classifier. Average power clearly classified all six bass drums away from the snares and high hats. Typically, when a bass drum is voiced the greatest amount of air pressure is released from the mouth; it is not a surprise that bass drums have the greatest average power.

4.5.3.3. Zero CrossingZero crossing is a feature measured in the time domain that counts the number of times the signal burst crosses the time axis. Zero crossing is proportional to the frequency of the signal, the greater the number of zero crosses the greater the frequency (http://profs.sci.univr.it/~dafx/Final-Papers/pdf/GouyonPachetDelerue.pdf). Figure ?? below plots circles upon a time domain bass drum burst at some of the zero point intersections.

Figure 12: Zero Crossings in a Bass Drum Sound

Here is the MatLab implementation of zero crossing:

function [out, index] = zcr(in, slope, level)% [out, index] = zcr(in, slope, level)%% Zero crossing rate of input data (x-axis crossings)%% out: Number of zero crossings% index: indizes into matrix where zero-xings are found% in: data% slope: (string, only first char is taken into account) % 'positive': all zero-xings with rising slopes are counted% 'negative': all zero-xings with falling slopes% 'all' : all zero-xings (= default)% level: (scalar) crossings over y = level (default: level = 0, i.e. zero-crossings)% % Armin Günter% Rev.1.0, 6.10.97% Rev.2.0, 19.12.97: Input Argument check, Matrix support% Rev.3.0, 6.3.98: optional index output, optional slope selection,% ending pts are not counted any more!

% Fehler abfangen:error(nargchk(1,3,nargin))if (isempty(in))

y = []; returnenderror(nargchk(0,2,nargout))if ~exist('level') level = 0;end

temp = sign(in(:) - level(1));% Eliminate zeros; otherwise a -x,0,y axis crossing would be counted twice! temp(temp==0) = 1;if exist('slope') switch slope(1) case 'p' index = find(diff(temp) > 0); case 'n' index = find(diff(temp) < 0); otherwise index = find(diff(temp)); endelse index = find(diff(temp));endout = length(index);

Figure?? below shows how zero crossing as a feature classifies the three voiced sounds. High hats (asterisk) are well separated as the group with the greatest value. This makes sense because high hats are higher in frequency than snare drums and much higher than bass drums. However, the classification did not perfectly separate all high hats, one was classified with the snare drums. A better training set would yield better results.

Figure 13: Bass Drum(Circles), High Hats(Asterix), and Snare Drum (Stars) classification by zero crossing

4.5.3.4. Peak FrequencyThe peak frequency feature simply observes the frequency at which the FFT of the time domain signal burst reaches maximum amplitude. This feature was a possible candidate because voiced bass drum, snare drums, and high hats all have respectively increasing distinguishable frequency peaks. Refer to figures ? ?+1, and ?+2 for FFT’s of each of these sounds.

Figure 14: Bass Drum(Circles), High Hats(Asterix), and Snare Drum (Stars) classification by peak frequency

According to figure ?? peak frequency classified the three sounds similarly to the zero crossing feature. It actually classified all but one of the high hats quite tightly near the maximum value of one. The bass drums were classified lower, mixed with some snares sounds. Overall a successful feature.

4.5.3.5. Final Feature Selection for Drum SetBased on the above analysis and further testing three features were chosen to classify a voiced bass drum, snare drum and high hat. These are zero crossing, ramp time and peak frequency. As an improvement to the nearest neighbour pattern matching used with the MBP and FFT based algorithms, neural networks were researched. The three features are the three inputs to the double layer back propagating neural network used to pattern match the users inputted drum sounds to the training set.

4.5.3.6. Linear Predictive Coding (LPC)LPC filter coefficients are used as features to classify voiced tabla bols (http://www.cs.princeton.edu/ugprojects/projects/a/achatwan/senior/IW.pdf).

http://www.cs.princeton.edu/ugprojects/projects/a/achatwan/senior/IW.pdf

Through trial and error based testing, the previous single value features were deemed useless in terms of tabla bol recognition. Therefore LPC was embarked upon.

4.5.3.6.1. LPC BackgroundLinear Predictive Coding is a powerful speech analysis technique for representing speech for low bit rate transmission or storage. The LPC algorithm decomposes speech signals into its formants and its residues (Digital Processing of Speech Signals.L. R. Rabiner and R. W. Schafer. Prentice-Hall (Signal Processing Series), 1978.).Because speech is always changing in time, the speech is chopped up into 20-30msec frames before it is analysed. For each frame the formant and the residue is then calculated. The formant can be thought of like the vocal tube, and the residue represents the buzz created by the larynx. To reduce the amount of data transmitted across a phone line only the information of the formant is sent. At the receiving end a residue similar to the one left at the transmission side excites the formant.

The formant is described as an all-pole filter (finite impulse response) with a user determined number of filter coefficients. More coefficients lead to a better representation of the formant. The filter coefficients are determined through the linear predictor eqn below.

eqn??

The ai ‘s represent the all pole filter coefficients. x(n) is called the linear predictor and is a prediction of the current signal value based on p previous samples. e(n) is the estimated error. The filter coefficients are adaptively determined through the minimization of least mean square error between the predicted speech and the actual speech.

The residue obtained through LPC represents the excitation source required to recreate the speech signal once convolved with the formant. The residue usually takes the form of white noise for a non-speech signal and a pulse train for a speech signal.

4.5.3.6.2. Tabla Bol BBVP ImplementationThe filter coefficients obtained by the LPC system are used as features to help classify tabla bols in the BBVP software. MatLab has a simple command lpc that computes a desired number of coefficients for any inputted real numbered array. Each tabla bol burst was sent to this command and LPC filter coefficients two through eight were used to classify the bols. The coefficients are normalized therefore the first one which is always one has no use in classification.

Features were extracted from six repetitions of four different tabla bols: Ge, Dha, Na, and Tin. Figures ??-?? show the value of each LPC coefficient for six repetitions of Ge, Dha, Na, and Tin respectively:

Figure 15: ‘Ge’ classification by LPC coefficients 2-8

Figure 16: ‘Dha’ classification by LPC coefficients 2-8

Figure 17: ‘Na’ classification by LPC coefficients 2-8

Figure 18: ‘Tin’ classification by LPC coefficients 2-8

The features values for each set of sounds turned out to be quite similar. In particular, the ‘Dha’ and ‘Na ‘ LPC coefficients were very similar. This is expected since Dha and Na are phonetically similar. Figure ?? plots the LPC coefficients for all sounds iterated six time overlaid; Circles, asterix, stars, and arrows representing ‘Ge’, ‘Dha’, ‘Na’, and ‘Tin’ respectively.

Figure 19: LPC coefficients of all tabla bols overlaid. ‘Ge’ (circles), ‘Dha’ (asterix), ‘Na’ (stars), ‘Tin’ (arrows)

There is not much class separation except for ‘Tin’. Even though LPC coefficients two through eight did not separate the four tabla bols graphically, testing still went ahead using LPC as the feature set. Similar to the drum sounds, a neural network is used for pattern matching.

4.5.4. The Neural NetworkNeural were chosen as a method of pattern matching for three main reasons. First the MatLab software has a fantastic implementation of several different types of neural networks in its Neural Network Toolbox. Secondly, a more advanced matching over nearest neighbour definitely needed to be employed for tabla bol recognition. Tabla bol voicings are much closer to regular human speech, harnessing the use of many consonant and vowel sounds (http://www.aspire.cs.uah.edu/textbook/voice001.html). Finally, a neural network has the ability to adaptively learn how to classify a data set based on initial examples. This suits the current framework of the BBVP, which requires a user to train the software before use.

http://www.aspire.cs.uah.edu/textbook/voice001.html

4.5.4.1. BackgroundNeural networks process information in a similar way the human brain does. The network is composed of a large number of highly interconnected processing elements(neurones) working in parallel to solve a specific problem.Figure ?? shows a neuron.

Figure 20: Diagram of human neuron.

The dendrites receive input and the soma integrates all the inputs. If the inputs exceed a certain threshold, an impulse is sent down the axon to the outputs. The outputs then act as the inputs to other neurons (Mohamad H. Hassoun. Fundamentals of Artificial Neural Networks. The MIT Press, 1995. ISBN 0-262-08239-X.).

Figure ?? below is a diagram of a modelled neuron. The vector x is the input to the synapse. These inputs are then multiplied by the weight vector w, which is determined through the training phase. The inputs are then summed and sent into the activation function . The output equals v*. This is summarized in equation ??

Figure 21: Diagram of a mathematically modelled neuron.

eqn ??

There are a variety of activation functions that are commonly used; here are a few of them:

1. Linear where y = v.

2. Step

3. Piecewise Linear

4. Sigmoidal

y=(v)

In general, hard limiting activation functions such as the step and piecewise linear are good for pattern classification problems. Activation functions such as the sigmoidal are used for function approximation. A number of these modelled neurons joined together form a neural network. Networks typically consist of three layers: An input layer, a hidden layer, and an output layer. This means that the input layer takes the inputs and provides, as its outputs, inputs into the hidden layer, which in turn provides the inputs to the output layer. Neural networks can further be classified into feed-forward and feedback networks. The outputs of feed-forward networks depend entirely on the current

input. Conversely, the outputs of feedback networks depend on current inputs as well as past inputs. Feedback networks are used the most, since there are no issues with feedback instability.

4.5.4.2. Training a Neural NetworkTraining is the process of adjusting the weight vector at each layer of the network. Initially the weight vector w(n) is chosen randomly. The input pattern is presented to the network and the output is compared with the desired result d(n). The error is then found by comparing d(n) and the actual output y(n).

e(n)=d(n)-y(n) eqn ??

The values of w(n) are then updated

w(n+1)=w(n)+e(n)x(n) eqn ??

is the step size or gain factor for each epoch. One iteration of this process is known as an epoch.

4.5.4.3. BBVP ImplementationThe Bionic Beat Box Voice Processor uses a triple layer, feed-forward, back propagating neural network. The activation function used is sigmoidal. For this application a hard limiting function such as the step should be used. However, there were complications trying to make the matlab network with a hard limiting threshold function. More time is needed to research this issue. The matlab code below sets up the neural net:

net = newff( minmax(trainingdata), [12,10, typemax], {'logsig', 'logsig', 'logsig'});net.trainParam.goal = 0.01;% epochsnet.trainParam.show = 50;% sets max number of times the complete data set may be used for trainingnet.trainParam.epochs = 6000;

The numbers 12,10, and typemax tell the neural net how many neurons to use for layers one, two and three respectively. Typemax equals the number of outputs or voiced sounds (ie bass drum, snare, high hats).

4.5.5. Testing the New Voice Recognition EngineWith the addition of a new feature set and the improvement of using neural networks for pattern matching, drum set beatbox, and tabla bol voicing were tested.

4.5.5.1. Beat Box Results The same voiced sounds were used for this test as were used for the previous MBP and FFT algorithm tests. The same training set was also used. Results of the test are recorded in table??

Table 3: Drum Set BeatBox test results with neural nets and new feature set

Vocal Sound Proposed WAV file Matched WAV file Times Matched ResultBass Drum Bass Drum Bass Drum

Snare DrumHigh HatOhh

10000

100%


01000

100%


00100

100%

Lower pitch Bass Drum Bass Drum Bass DrumSnare DrumHigh HatOhh

9001

90%

Lower pitch Snare Drum Snare Drum Bass DrumSnare DrumHigh HatOhh

01000

100%

Lower pitch High Hat High Hat Bass DrumSnare DrumHigh HatOhh

0370

70%

Higher pitch Bass Drum Bass Drum Bass DrumSnare DrumHigh HatOhh

8002

80%



0910

90%

Higher pitch High Hat High Hat Bass DrumSnare DrumHigh HatOhh

00100

100%

Overall the experiment was a success. The worst result was 70% recognition for the lower pitch high hat. This test matched or beat the two old methods with flying colours. All regular pitch sounds were recognized perfectly. The detection of lower pitched snare drums showed the greatest improvement. Recognition improved from 0% to 40% to 100% for the FFT to MBP to neural net methods respectively. The neural network method of recognition beat the MBP and FFT method for burst processing speed with a swift 1.5 seconds. However, it takes approximately 9 seconds to train the neural network; training is only performed once per instance of the application.

4.5.5.2. Tabla Bol ResultsEach of the tabla bols (ge, dha, na, tin) were voiced and processed with the BBVP. Initially each bol was voiced ten times each and the output was observed. Then each bol was once again repeated ten times at a lower pitch and then at a high pitch. Table ? summarizes the results.

Table 4: Tabla test results with neural nets and a new feature set


Matched WAV file Times Matched Result

‘Ge’ Ge GeDhaNaTin

3231

30%

‘Dha’ Dha GeDhaNaTin

1612

60%

‘Na’ Na GeDhaNaTin

0118

10%

‘Tin’ Tin GeDhaNaTin

6004

40%

Lower pitch ‘Ge’ Ge GeDhaNaTin

2206

20%

Lower pitch ‘Dha’ Dha GeDhaNaTin

0037

0%

Lower pitch ‘Na’ Na GeDhaNaTin

1136

30%

Lower pitch ‘Tin’ Tin GeDhaNaTin

4033

30%

Higher pitch ‘Ge’ Ge GeDhaNaTin

4033

40%

Higher pitch ‘Dha’ Dha GeDhaNaTin

1162

10%

Higher pitch ‘Na’ Na GeDhaNa

033

30%

Tin 4Higher pitch ‘Tin’ Tin Ge

DhaNaTin

0433

30%

The experiment proved to be unsuccessful. The best result was 60% recognition of dha at original pitch. The worst result was 0% success with lower pitch dha. These results do not seem to present any consistent trends to identify what might be causing the poor performance. Further research needs to be done to correctly recognize tabla bols.

5. Project Extensionswhere does it go from herethings still under developmentmore complex drum sounds (tablas)improved speed

not real-timevst plugin/ Max implementation

increases integration with other audio production toolsif not plugin, improve interface

snazzy GUI (java?)or cleaned-up/prettied-up MatLab GUI

5.1. Additional Effects/Featuresmelodic accompaniments (arpegiator)rhythmic accompaniments : automatic shaker or high-hat track or tamborinevideo loops with audio loops

6. ConclusionsMatLab rulessuccess of recognition

summarize test resultslimitations of recognition

drums vs tablasfastest bpm detectable

comments on usability / practicalitypotential commercial utility

7. Recommendationscontinue development in MatLab

refine algorithms

add ‘Additional Features’flesh out softwareexplore possibilities in MatLab before attempting Max/MSP

when professional level tool is ready, make available for general consumptionfinally, make sweet love to hot ladies

7.1. Better Tabla Bol RecognitionAs far as pattern matching is concerned neural networks can be further explored. A number of sub neural networks could be used for each sound. Each tabla bol would have a dedicated neural that is trained for one sound only. In the case of ge, dha, na, tin classification, four neural networks would used each having a single output (http://vision.poly.edu:8080/paper/audio-mmsp.html#intro). Each neural network would be a master at a single sound instead of a single neural net trying to learn all the sounds.

Feature extraction can be greatly improved in many ways, two of these ways are: Modified LPC with principal component analysis (PCA) and subband extraction.The current method of LPC extraction is not working well. It may be of use to find the LPC coefficients of just the steady state region of the sound burst, the region after the attack and before the decay. A greater number of LPC coefficients could be found, 20-30. Then PCA could be used to decrease the 20-30 length LPC coefficients to a lower dimensionality that would be suitable for input into the neural network. PCA returns a lower dimensional set of linearly independent variable values. It can be described as taking the redundancy out of the data set.

LPC coefficients and other features such as ramp time and zero crossing can be extracted from various sub bands of the signal burst. This is achieved by first filtering the burst into sub bands with various band pass filter. Ratios of the feature values at sub bands can then be used as a set of new features (http://vision.poly.edu:8080/paper/audio-mmsp.html#intro).

8. Literature Review?show relevant researchprove we are not simply reinventing

9. References

(REF http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-3.html) Features

(ref http://vision.poly.edu:8080/paper/audio-mmsp.html#intro) VRE intro and RMS

http://vision.poly.edu:8080/paper/audio-mmsp.html#intro

http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-3.html


(REF http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-3.html) Features

(ref http://vision.poly.edu:8080/paper/audio-mmsp.html#intro) VRE intro and RMS, and Recommendations For Better Tabla Bol Recognition.

(REF http://profs.sci.univr.it/~dafx/FinalPapers/pdf/GouyonPachetDelerue.pdf) Zero crossing(http://www.aspire.cs.uah.edu/textbook/voice001.html) neural net intro

(http://www.cs.princeton.edu/ugprojects/projects/a/achatwan/senior/IW.pdf LPC

(Digital Processing of Speech Signals.L. R. Rabiner and R. W. Schafer.Prentice-Hall (Signal Processing Series), 1978.) LPC background

(Ref Mohamad H. Hassoun. Fundamentals of Artificial Neural Networks. The MIT Press, 1995. ISBN 0-262-08239-X.) Neural Nets background

http://www.irishhiphop.f2s.com/articles/beatbox.htm

http://www.irishhiphop.f2s.com/articles/beatbox.htm

http://www.cs.princeton.edu/ugprojects/projects/a/achatwan/senior/IW.pdf

http://www.aspire.cs.uah.edu/textbook/voice001.html


http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-3.html

introduction - uvicece.uvic.ca/~elec499/2004b/group09/docs/bbvprept.doc · web viewintroduction....

Documents