frenso state

81
Department of Electrical and Computer Engineering Title: Speech Recognition Using FPGA Senior Design Project Report Student: Tyler Havner Ismael Perez Technical Advisor: Dr. Reza Raeisi Dr. Daniel Bukofzer Dr. Sean Fulop FALL 2012 Comments

Upload: taruntejv

Post on 08-Nov-2014

12 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Frenso State

Department of Electrical and Computer Engineering

Title: Speech Recognition Using FPGA

Senior Design Project Report

Student: Tyler Havner

Ismael Perez

Technical Advisor: Dr. Reza Raeisi

Dr. Daniel Bukofzer

Dr. Sean Fulop

FALL 2012

Comments

Page 2: Frenso State

Speech Recognition Using FPGA

TABLE OF CONTENTS Section Page

Course Evaluation Rubric ..…………………………………………………………. i

Definition of Key Terms ……………………………………………………………. ii

1. Problem and its Setting ………………………..…………………………….... 1

1.1 Introduction ………………………………………………………………. 1

1.2 General Statement of the Problem …………………………………………. 1

1.3 Objective Solution …………………………………………………………. 1

1.4 Scope of the Study …………………………………………………………. 1

1.5 Project Limitation ………………………………………………………….. 2

2. Background Theory ……………………………………………………………… 3

2.1 Analog to Digital Converter (ADC) ……………………………………… 3

2.2 Frequency Spectrum ……………………………………………………… 4

2.3 Digital Filters ……………………………………………………………… 6

2.3.1 Infinite Impulse Response (IIR) Filters ……………………………… 6

2.3.2 Finite Impulse Response (FIR) Filters ………………………………… 7

2.4 FPGA ….…………………………………………………………………… 7

2.5 Summary …………………………………………………………………… 9

3. Monetary Costs of Project …………………………………………………….... 9

4. Methodology …………………....……………………………………………… 10

4.1 Theoretical Concept ……………………………………………………… 10

4.2 Detail Algorithm and Design Approach …………………………………… 13

4.2.1 Data Acquisition and the ADC ……………………………………… 13

4.2.2 Start of the Word Detection ……………………………………… 15

4.2.3 Frequency Analysis ……………………………………………… 16

4.2.4 Fingerprint Generation ……………………………………………… 21

Page 3: Frenso State

Speech Recognition Using FPGA

4.2.5 Comparison Function ………………………………………………… 22

4.2.6 Driving Outputs ……………………………………………………… 23

4.2.7 System Architecture ………………………………………………………… 26

4.2.8 Training the System ….………………………………………………. 25

4.3 Work Breakdown ………………………………………………………… 30

5. Parts Ordering …………………………………………………………………..... 34

6. Finding and Conclusion ...……………………………………………….……… 34

6.1 MatLab Findings ,,,,,………………………………………………………. 34

6.2 Testing Results …….……………………………………………………… 39

6.3 System Improvements ……..……………………………………………… 40

7. Conclusion …………….………………………………………………………... 41

References …...……………………………………………………………………... 42

Appendix A: MatLab Code ………………………………………………………… 43

Appendix B: Ordering Receipts ……...…………………………………………….. 52

Appendix C: DE2 Board Code …………………………………………………….. 54

Page 4: Frenso State

i

ECE 186B Course Evaluation Rubric

1. How successfully you were able to convert your problem statement or project objectives to your own engineering domain such as digital domain, control domain, microcontroller domain, etc. in order to find an approach to come up with solution. We were very successful at being able to convert our project objectives to your own engineering domain. We were able to create a MatLab model to observe the digital time as well as frequency domain. Also, were able to create the equivalent model in the microprocessor domain to program the DE2 board in C language.

2. How successfully you were able to determine the right engineering tools for the purpose of your project. I was successfully able to determine the right engineering tool through my years of lab equipment and software familiarity. When our project when it became clear that our project entailed a lot of digital signal processing (DSP) we quickly realized that MatLab would cut down design time by providing powerful DSP toolboxes to analyze each step of our design. Also, the bench tools allowed us to test all of our hardware.

3. The effectiveness of using the tools. We were effective at using the engineering tools for our project. We had to do a lot of research into some of the tools we used such as how to sample data from the microphone input channel on MatLab and programming the DE2 board in C. Due to our sound foundation in those areas we were able to understand how to accurately and effectively use those tools.

4. Your experience on being able to develop a prototype and simulation of it. We feel confident in our ability to utilize engineering tools come up with the best solution within the given constraints. When developing a prototype many issues arise that are usually not anticipated and the design needs to be adapted. We feel as though we made good design decisions in order to provide a working prototype in a timely manner.

5. Overall correctness of your design. Our design was correct according to the goals of the method we set out to test. There was some error associated with the casting of floating point accumulations into integer values but those were minor. The design could have been improved with a higher order of filter as had originally designed for but that would have required even longer computation times and memory.

Page 5: Frenso State

Speech Recognition Using FPGA

ii

Definition of Key Terms

The following is a list of key terms along with their definitions from Wikipedia that will be needed in order to grasp the concept of our proposed system [6].

Analog Signal: any continuous signal for which the time varying feature (variable) of the signal is a representation of some other time varying quantity

Digital Signal: is a physical signal that is a representation of a sequence of discrete values (a quantified discrete-time signal)

Pulse-code modulation (PCM): A PCM stream is a digital representation of an analog signal, in which the magnitude of the analog signal is sampled regularly at uniform intervals, with each sample being quantized to the nearest value within a range of digital steps.

Frequency Spectrum: is a representation of a time-domain signal in the frequency domain. The frequency spectrum can be generated via a Fourier transform of the signal, and the resulting values are usually presented as amplitude and phase, both plotted versus frequency.

Low Pass Filter: is an electronic filter that passes low-frequency signals but attenuates (reduces the amplitude of) signals with frequencies higher than the cutoff frequency.

Band Pass Filter: is an electronic filter that passes frequencies within a certain range and rejects (attenuates) frequencies outside that range.

Digital Filter: is characterized by its transfer function, or equivalently, its difference equation.

Logarithmic Scale: is a scale of measurement using the logarithm of a physical quantity instead of the quantity itself. A simple example is a chart whose vertical axis has equally spaced increments that are labeled 1, 10, 100, 1000, instead of 1, 2, 3, 4. Each unit increase on the logarithmic scale thus represents an exponential increase in the underlying quantity for the given base (10, in this case).

Decibel: is a logarithmic unit that indicates the ratio of a physical quantity (usually power or intensity) relative to a specified or implied reference level. A ratio in decibels is ten times the logarithm to base 10 of the ratio of two power quantities.

Accumulator: is a register in which intermediate arithmetic and logic results are stored. Without a register like an accumulator, it would be necessary to write the result of each calculation (addition, multiplication, shift, etc.) to main memory.

Aliasing: an effect that causes different signals to become indistinguishable (or aliases of one another) when sampled.

FPGA: is an integrated circuit designed to be configured by the customer or designer after manufacturing—hence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC).

Page 6: Frenso State

Speech Recognition Using FPGA

1

Chapter 1: Problem and Its Setting

1.1 Introduction

Speech recognition is expanding its reaches with modern technology. Calls made to large

companies heavily rely on voice recognition to efficiently route the calls to the proper

department. Luxury cars are incorporating systems so that the driver has an interactive

experience with the automobile. Smart phone programs like Siri are pushing the envelope in

artificial intelligence using speech recognition. The next logical progression is using speech

recognition around the home.

1.2 General Statement of Problem

According to the U.S. Census Bureau, 11 million Americans need personal assistance with

everyday activities and over 3.3 million use a wheelchair [5]. For these Americans simple tasks

such as turning on a ceiling fan or opening a door becomes a chore. There is an obvious need for

voice recognition in household devices to make for a hand-free environment.

1.3 Objective Solution

Our project will employ the programmability structure of a field-programmable gate array

(FPGA) in order to design ordinary household objects, like a door and a ceiling fan, into hands-

free devices. For example, if the user needed to open a door they could simply say ‘open’ and the

system would fully open the door. This system could alleviate some of the day to day struggles

that physically disabled people go through in their home by allowing them to interact with

household devices simply by the use their voice. This could potentially allow them to become

more independent in their home. With a successful implementation of our system the speaker

will be able to open and close a door as well as turn a ceiling fan off and on by using four simple

commands that will be discussed in detail in Chapter 4.

1.4 Scope of the Study

A vast knowledge of engineering background is needed for this project. A strong understanding

of signals and systems with an emphasis on signal analysis via Fourier Transform methods will

provide the basic foundation. Digital signal processing (DSP) is where the core of the project lies

Page 7: Frenso State

Speech Recognition Using FPGA

2

as well as programming in hardware description language (HDL) and C. The hardware portion of

the project will require a background in electric motors and electronics to drive them. The

courses we have taken to prepare for this project are as follows:

Tyler:

Course Number Course Description ECE 71 Programming in C ECE 121 Electromechanical Systems ECE 124 Signals and Systems ECE 134 Communications ECE 138 Electronics II

Ismael:

Course Number Course Description ECE 107 Digital Signal Processing ECE 124 Signals and Systems ECE 176 Verilog Coding ECE 178 Embedded Systems CSCI 150 Software Engineering

1.5 Project Limitations

Speech processing is a very robust problem and modern speech analysis is accomplished by

using complex probabilistic characterizations of words and sentence structures known as Hidden

Markov Chains [3]. The goal of our project is to create a reliable and accurate system that does

not rely on such complex models. The system will attempt to be simplistic in order to lay a basis

for future consumer products. One such restriction on our system will be speaker dependence.

Creating a system that does not rely on a specific speaker is a very complex problem but our

system will have a primary speaker (the homeowner). Also, our system will use single word

recognition not continuous speech recognition. In other words the speaker will have to first train

our system with several versions of the same word, thus yielding a “reference fingerprint”. The

reference fingerprint represents the set of values that result from averaging the three set of values

from the training words. Subsequent words can be recognized based upon how closely they

relate to the saved reference fingerprint.

Page 8: Frenso State

Speech Recognition Using FPGA

3

Chapter 2: Background Theory

In the following sections we will discuss the relevant aspects of our project including the analog

to digital conversion, frequency spectrum analysis, and digital filters. All these sections will be

discussed in detail how they relate to our project. We will use an analog to digital converter to

convert the voltage representation of our spoken words into digital information. Also, once we

have a digital representation of the word spoken we need to extract its significant frequency

component so that a decision can be made upon whether the information is in fact one of the

correct words.

2.1 Analog to Digital Converter (ADC)

Everything in the real world is analog this includes sound, light, and even temperature [2].

Computers are not able to handle analog information so everything that needs to be manipulated

by a computer has to be converted into digital. Analog information has to be converted into

strings of ones and zeros, which is what digital is. To achieve this we need an analog to digital

converter. For our purpose we will be converting sound into its digital version. To convert a

waveform back from digital to analog a digital to analog converter is used.

The two most important variables that determine how closely a digitally sampled waveform

patterns the original continuous time waveform are the sampling rate and the bit resolution. The

ADC will take discrete points on the waveform based on a specified rate or sampling frequency.

One of the most popular methods for analog to digital conversion is called pulse code

modulation (PCM). In PCM, the amplitude of the waveform (most commonly voltage) is

quantized into discrete levels that have encoded binary representations. The number of

quantization levels L is directly related to the number of binary bits or bit resolution being used

to represent each sample which is shown in Equation 2.1.1.

(2.1.1)

The size of the quantization levels are then based upon the amplitude bounds and the number of

levels. Equation 2.1.2 shows this relationship with equal to the size of the quantization levels

is the peak amplitude and is the number of quantization levels.

Page 9: Frenso State

Speech Recognition Using FPGA

4

(2.1.2)

Figure 2.1.1 shows a PCM encoded waveform with 3-bit resolution.

Figure 2.1.1: 3-bit PCM Encoded Waveform

In Figure 2.1.1, the quantization levels are numbered on the left hand side of the y-axis while the

binary encoded representation of the levels is shown on the right-hand side of the y-axis. The x-

axis shows the encoded sampled values of the waveform. The sampling frequency that we will

be using will be based on the Nyquist formula [1]. This formula says that in order to avoid

aliasing of the reconstructed waveform the sampling frequency should be at least twice that of

the highest frequency component or Nyquist frequency in the waveform. Equation 2.1.3 gives

the Nyquist sampling theorem with equal to the Nyquist sampling rate and equal to the

highest frequency component in the sampled waveform.

(2.1.3)

2.2 Frequency Spectrum

Speech processing is explored by performing spectral analysis to characterize the time-varying

properties of the signal [7]. In other words speech processing requires a frequency domain

representation of the signal to be analyzed. The Fourier transform, shown in Equation 2.2.1, does

exactly that and transforms a time domain signal into its equivalent frequency domain

representation, .

Page 10: Frenso State

Speech Recognition Using FPGA

5

(2.2.1)

The Fourier transform often reveals characteristics of the signal that would not otherwise be

readily apparent in the time domain. For example, the Fourier transform of the band limited

rectangle function in the time domain becomes an infinite banded sinc function in the frequency

domain as shown in Figure 2.2.1.

Figure 2.2.1: Fourier Transform of rect(t)

In the case of a sampled digital waveform, , the discrete-time Fourier transform (DFT) is

used. It is described by Equation 2.2.2.

(2.2.2)

The power spectral density (PSD) of a waveform builds on the Fourier transform of a waveform

and gives the relationship to the energy of the signal with relation to frequency [8]. This property

is described by Rayleigh’s Theorem and is shown in Equation 2.2.3.

(2.2.3)

Using Rayleigh’s Theorem to gain the PSD of the waveform the significant frequency

components of the waveform become more apparent [8]. When using a discrete time waveform

Rayleigh’s Theorem becomes Equation 2.2.4.

Page 11: Frenso State

Speech Recognition Using FPGA

6

(2.2.4)

2.3 Digital Filters

There are two types of digital filters available for our purposes. They are the infinite impulse

response (IIR) or finite impulse response (FIR) filters. We studied the different attributes of each

and based on that we made our decision. The following discusses some of the major differences

between the two digital filters available to us.

2.3.1 IIR filters

These types of filters are generally difficult to control and are typically unstable. We want to be

able to control our filters but we also expect them to be stable. Normally there is no particular

phase to describe the operation of these filters and they also have limited cycles. While infinite

response filters have no particular phase the finite response filters have a linear phase, which is

part of what makes them stable. The fact that these filters are infinite impulse response makes

them non-causal. Both the poles and the zeros have an effect on these filters. Since IIR filters

require less coefficients than FIR filters their cutoff will not be as sharp but for the same reason

they require less memory than FIR filters. Figure 2.3.1 shows an IIR filter with a non-linear

phase.

Figure 2.3.1: IIR Filter Phase Plot

Page 12: Frenso State

Speech Recognition Using FPGA

7

2.3.2 FIR filters

These filters always have a linear phase therefore behave how one expects them to. Unlike IIR

filters FIR filters are stable and have no limit to how many cycles you can have. They are stable

because the output only depends on present and past values of the input. Another aspect that is

different between these filters and IIR filters is that these types don’t have analog history. This is

because IIR filter are derived from analog. We wanted to use filters that were completely digitize

or digital so FIR filters were the logical choice. FIR filters require less multiplications and

additions than the alternative because they are of higher order. Delays are easier to implement on

FIR filters but FIR filters require more memory than IIR filters. One of the reasons why FIR

filters require more memory is because they typically require more coefficients for the sharp

cutoff unlike IIR filters. These FIR filters only depend on the zeros of the transfer function.

Figure 2.3.2 shows an FIR filter with a linear phase graph.

Figure 2.3.2: FIR Phase Plot

2.4 FPGA

An FPGA is an IC that contains an array of identical logic cells with programmable

interconnections also know as configurable logic blocks (CLBs) [9]. The user can program the

functions realized by each logic cell and the connections between the cells. A typical CLB

contains two or more function generators, often referred to as look-up tables or LUTs,

programmable multiplexers, and D-CE flip-flops. The D-CE flip flop is just a normal D flip flop

Page 13: Frenso State

Speech Recognition Using FPGA

8

with a clear enable bit. As long as the CE bit is not set the flip flop acts like a regular D flip flop.

Figure 2.4.1 shows a simplified version of a CLB.

Figure 2.4.1: Simplified Configurable Logic Block (CBL)

The CLB shown in Figure 2.4.1 contains two function generators, two flip-flops, and various

multiplexers for routing signals within the CLB. Each function generator has four inputs and can

implement any function of up to four variables. The function generators are implemented as

lookup tables (LUTs). A four input LUT is essentially a reprogrammable read-only memory

(ROM) with 16 1-bit words [9]. This ROM stores the truth table for the function being generated.

The array of CLBs is then surrounded by a ring of input-output (I/O) interface blocks. Figure

2.4.2 shows the layout of part of a typical FPGA.

Figure 2.4.2: Typical FPGA Layout

Page 14: Frenso State

Speech Recognition Using FPGA

9

The I/O blocks in Figure 2.4.2 connect the CLB signals to directly to the IC pins [9]. Normally

an FPGA contains other components such as memory blocks, clock generators, tri-state buffers,

as well as other useful digital components. The user defined flexibility coupled with the large

amount of memory on an FPGA makes it a great choice to handle the immense amount of DSP

that our project will entail. Unlike the custom application-specific integrated circuit (ASIC)

approach using FPGA technology one can work off of a simple design and build up on it. The

ASIC approach is built for specific project while an FPGA is reprogrammable and can be used

for various applications. The FPGA is programmed with the use of electrically programmable

switches similar to other logic devices [10].

2.5 Summary

We briefly mentioned some of the theory that will be essential in order to complete our project.

Understanding of ADC will be essential to gain an accurate representation of incoming speech

signal, knowledge of the frequency spectrum will be needed to characterize the significant

components of the frequency and digital filtering will be used to accomplish the spectral

analysis. Knowledge of the inner workings of the FPGA will also be vital to the implementation

of our project. The techniques will be demonstrated in detail in our methodology.

Chapter 3: Monetary Costs of Project

The budget for our system will be fairly small (under $400) and the bulk of the cost will be due

to the FPGA board (about $320). We will also need a microphone that has a 3.5mm headphone

jack output so that we can simply plug it into the input jack input on the FPGA board. The other

components will be used to drive our outputs. A DC motor will be used to drive the scale model

fan propeller. The door will be opened using a servo motor. In order to control the motors using

the header pins on the board we will need MOSFET’s to switch power. The servo motor will also

need a 555 timer circuit to create the control signal it requires for position control. Table 1 shows

the list of the parts needed in order to implement our project along with their corresponding

costs.

Page 15: Frenso State

Speech Recognition Using FPGA

10

Table 1: Bill of Materials Part Needed Manufacture Cost

FPGA Development Board Altera $269.00 3.5mm Microphone Logitech $6.80

Low Speed DC Motor HobbyTech $2.95 Plastic Fan Propeller HobbyTech $1.25

Servo Motor Futaba $15.85 LM 324N Quad Op-Amp Texas Instruments $0.53

(2) 1N4148 Diode Fairchild Semiconductor $0.20 NE555P 555 Timer Texas Instruments $0.39

(2) N-Channel MOSFET Philips $1.04 Assorted Lab Resistors/Caps - -

Subtotal $297.77 Taxes $23.82

Shipping/Handling $45.55 Total $367.14

Chapter 4: Methodology

Our project will activate a door to open or a fan to turn on by using voice recognition. A word

will be spoken into the microphone and once the word has been recognized and compared a

signal will be sent to either the door or the fan to perform the specified operation. Extensive

digital signal processing (DSP) will be used to process a word that is spoken into the

microphone. Use of digital filters will be necessary to accomplish the DSP.

4.1 Theoretical Concept

We need a simple method to gain the significant frequency content of a speech signal. Spectral

analyzers are often used to gain the frequency content of a waveform but these devices are bulky,

expensive, and far more sophisticated than our project’s needs. A better, simplistic approach to

reveal the frequency content of a speech signal is a band pass filter bank. The Fourier transform

of a waveform can be thought of as a series of band pass filters with infinitesimally small

bandwidths and center frequencies that grow infinitesimally larger so that essentially the output

of each filter would represent one point on the Fourier transform of a waveform [3]. Obviously

this is an idealized system that is not realizable but does emphasize that a band pass filter bank

Page 16: Frenso State

Speech Recognition Using FPGA

11

can be used in order to expose the frequency spectrum of a waveform. Figure 4.1.1 shows a

magnitude plot of a realistic bank of band pass filters [8].

Figure 4.1.1: Band Pass Filter Bank [8]

This array of filters will capture frequencies that fall within their respective bandwidth. Based

upon the outputs of each filter we can make inferences about the frequency content in that

frequency band. The PSD will give a better understanding of the significant frequency content in

the waveform. In order to gain the PSD we can use Rayleigh’s Theorem in Equation 2.2.3 and

take the time average of the energy to get the power in each filter band [8]. The block diagram of

such a system is shown in Figure 4.1.2.

Page 17: Frenso State

Speech Recognition Using FPGA

12

Figure 4.1.2: Power Spectrum using a Filter Bank [8]

In Figure 4.1.2 the signal x(t) is routed through multiple band pass filters. Each filter’s response

is that part of the signal lying in the frequency range of the filter. The output of each filter is the

input of a squarer block that simply takes the square of the signal. The output signal from any

squarer is that part of the instantaneous signal power of the original x(t) that lies in the passband

of the band pass filter. Then the time averager performs the time-average signal power. Each

output response Px (fn) is a measure of the signal power of the original x(t) in a narrow band of

frequencies centered at fn. Taken together, the P’s are an indication of the variation of the signal

power with frequency or the power spectrum.

In the filter bank model in Figure 4.1.1 all the filters are linearly spaced meaning they contain the

same bandwidth. This method wastes a lot of bandwidth because the human ear does not process

all frequencies the same and actually has unique variations [8]. Figure 4.1.3 shows the average

human ear’s perception of the loudness of a constant-amplitude audio tone as a function of

frequency [8].

Page 18: Frenso State

Speech Recognition Using FPGA

13

Figure 4.1.3: Human Ear Perception of Loudness vs. Frequency [8]

Humans can only produce speech signals up to about 10 kHz [8]. From Figure 4.1.3 it is evident

that the human ear has a nonlinear response to frequencies and is highly sensitive to frequency

changes in the first 4 kHz with a significant roll off occurring thereafter. Therefore the filter bank

model for extracting speech analysis can be improved by logarithmically spacing the filters.

Equations for the spacing of these filters will be discussed in further detail in Section 4.2.2.

4.2 Detail Algorithm and Design Approach

In this section of the chapter we will discuss in detail the steps we have to take to complete our

project successfully. At first we were thinking on using the Fast Fourier transform as our design

approach but we decided upon the simpler filter bank processing. We initially wanted to use 10

filters spanning about 10 kHz but due to speed and memory limitations we opted for only 5

filters spanning about 8 kHz.

4.2.1 Data Acquisition and the ADC

We will be acquiring data by inputting an analog signal from a microphone that will be

connected to the mic-in port on the DE2 board. This port is connected to the 24-bit analog to

digital converter (ADC) that is embedded on the board. The output of the ADC is the quantized

Page 19: Frenso State

Speech Recognition Using FPGA

14

form of the input wave form. This quantized waveform is a large string of ones and zeros. Every

24 bits represent one point of the waveform. We did not need to use such a high resolution for

our project so we decided to down convert the 24-bit ADC to a 12-bit ADC. Having a 24-bit

ADC might cause the output of our filters to overflow. Figure 2.1.1 shows how a 3-bit ADC

separates the values on the y-axis so our 12-bit will have values ranging from -2048 to 2047.

Figure 4.2.1.1 shows and example of how to down convert from a 3 bit resolution to a 2 bit

resolution.

Figure 4.2.1.1: Signed Down Conversion

Since we are using the DE2 media computer system for our project the default sampling rate is

48 kHz. For our design purpose this is obviously oversampling so we need to down sample

somehow. We accomplished this by only saving every third value of the sampled waveform. By

doing this we down sampled from 48 kHz to 16 kHz. We came across many projects that stated

that when dealing with voice recognition a 16 kHz sampling rate is ideal. The mic-in ADC on the

board saves audio in 2-channel stereo quality, containing a right channel and a left channel.

Figure 4.2.1.2 shows the audio register ports unto which the left and right channel data is stored.

Page 20: Frenso State

Speech Recognition Using FPGA

15

Figure 4.2.1.2: Audio port registers [11]

Since voice is only mono quality we only need to retrieve one channel because the other channel

will simply be a copy of the data. Thus we could have used the data from either channel so

decided to only use the data from the left channel for our project.

4.2.2 Start of Word Detection

A crucial step to recognizing speech is locating the beginning of the spoken word (if there is

one). For our system the ADC will sample for 3 seconds after the button has been pressed. We

used a windowed approach in which the absolute average of two adjacent windows of n points

each is compared it to a predefined threshold. Once the threshold is surpassed a pointer will then

be specified at the start of the previous window and the samples will be saved into memory from

this point onward for 8 K samples or half a second at a 16 KHz sampling rate. The flow chart in

Figure 4.2.2.1 shows the design approach for programming the beginning of the word detection.

Figure 4.2.2.1: Flow Chart for Word Detection

Page 21: Frenso State

Speech Recognition Using FPGA

16

Equation 4.2.2.1 shows how to calculate the absolute average of the first window, from the

initial sample to the endpoint of the window in the vector of sound samples, .

(4.2.2.1)

The average of the second window, , is computed from the sound samples starting at and

ending at where the number of points in the window is equal to the difference of b and a or

equivalently c and b. The computation for the second window is shown in Equation 4.2.2.2.

(4.2.2.2)

The difference between and is compared to the threshold value Th. If it is larger, then the

spoken word is considered to start at . If this is not the case then the average of the oldest

window ( ) is discarded, and replaced by . Then, the algorithm continually repeats until the

word is detected or it reaches the end of the sound samples in which case no word was detected.

The value for the threshold was calculated empirically using MatLab which can be seen in

Appendix A.

4.2.3 Frequency Analysis

Before we can pass the voice samples through the band pass filter bank we must first pass the

values through a pre-emphasis filter. Speech signals normally experience some spectral roll-off

of about 6-dB per octave [3]. This means that the amplitude is halved for each doubling of

frequency. This phenomenon occurs due to the radiation effects of the sound from the mouth [3].

As a result, the majority of the spectral energy is concentrated in the lower frequencies, which

results in an inaccurate estimation of the higher formants. However, the information in the high

frequencies is just as important in understanding the speech as the low frequencies. To reduce

this effect, the speech signal is filtered prior to the filter bank processing. The pre-emphasis filter

makes the outputs of the filters nearly uniform across the spectrum at the expense of lowering the

Page 22: Frenso State

Speech Recognition Using FPGA

17

amplitudes slightly. Equation 4.2.3.1 shows the how to calculate the output of the pre-emphasis

filter.

(4.2.3.1)

In Equation 4.2.3.1, is a coefficient most commonly in the range of 0.95 to 0.98 for speech

applications. We opted for 0.97 for our design. The magnitude response of our pre-emphasis

filter is shown in Figure 4.2.3.1.

Figure 4.2.3.1: Pre-emphasis Filter Response

From the magnitude plot it is apparent that this filter attenuates the lower frequencies while

amplifying the higher frequencies to take care of the -6 dB roll-off. We originally desired to use

10 filters to create a filter bank which will cover the frequencies from 200 Hz to about 10 kHz.

This however proved to be too ambitious and we had to remove some half of our designed filters

so that we were left with 5 filters spanning 300 Hz to about 7 kHz. Each filter is logarithmically

spaced out because of the way human voice behaves in the frequency domain. In general for a

human spoken word most of the significant components are in the lower frequencies of the

frequency spectrum. This is the reason why we need filters with smaller bandwidths in the lower

spectrum. Normally a human voice falls between a range of about 300 Hz to 14 kHz but for the

most part the significant frequencies for human voice range from 300 Hz to 2 kHz [4]. We

0 1000 2000 3000 4000 5000 6000 7000 8000-80

-60

-40

-20

0

20Pre-emphasis FIlter Response

Frequency (Hertz)

Mag

nitu

de (d

B)

Page 23: Frenso State

Speech Recognition Using FPGA

18

decided to use FIR filters since they have a linear phase without compromising the ability to

approximate the ideal magnitude, unlike IIR filters. Unfortunately, they are computationally

more expensive in implementation as they require more coefficients for the equivalent IIR filter

[3]. After deciding what type of filters we should use we needed to calculate the bandwidths and

center frequencies of each filter. The main equations that were used for calculating the

bandwidths and the center frequencies of the filters will be shown below. Equation 4.2.3.2 and

4.2.3.3 were used to calculate the bandwidths of each filter then with the results obtained

equation 4.2.3.3 was used to calculate the center frequencies of each filter. In equation 4.2.3.2 C

equals the bandwidth of the first filter, we decided on 440 Hz. Then bi is the bandwidth for a

given filter and Q is the total number of filters to be used, which will be 5. The α in equation

4.2.3.3 represents the logarithmic growth factor that typically falls between 1 and 2. The value

for α was calculated to be 1.45 which would allow us to fit 5 filters into the 7 kHz range.

= C

(4.2.3.2)

= α 2 ≤ i ≤ Q

(4.2.3.3)

= + +

(4.2.3.4)

We obtained the coefficients from MatLab for our filters using the fir1 function which is a

Hamming-window based, linear-phase filter. Another critical choice for our filters was the order

of the filter which would determine its sharpness or effectiveness at simply passing frequencies

within its band. We opted for 50th order filters which would give us a good sharpness. The

transfer function for the FIR filter is shown in Equation 4.2.3.5.

(4.2.3.5)

Page 24: Frenso State

Speech Recognition Using FPGA

19

The transfer function of FIR filters only possesses a numerator. This corresponds to an all-zero

filter. In this equation the b terms are the filter coefficients, z is the delay element, and M is the

order of the filter which in our case is 50. Equation 4.2.3.6 gives the difference equation to solve

for the output of the FIR filter.

(4.2.3.6)

The direct form of the FIR filter structure is shown in Figure 4.2.3.2.

Figure 4.2.3.2: FIR Direct Form Structure

From Equation 4.2.3.6 and Figure 4.2.3.2 it is apparent that the output of the filter is obtained

through the linear combination of the last input samples weighted by the b coefficients.

The figure shown in Figure 4.3.3 is that of 5 ideal filters that were generated using MatLab.

Appendix A contains the code that was used in MatLab to obtain the graph in Figure 4.3.3. As

you can see this filters have cutoff which are impossible to implement using real filters. This is

because those idealized band pass filters are rectangular functions in the frequency domain

which from Figure 2.2.1 becomes an infinite banded sinc function in the time domain. The sinc

function is non-causal and has an infinite delay thus they can only be approximated in the time

domain. By observing Figure 4.2.3.3 you can there are 5 filters logarithmically distributed from

about 900 Hz to about 6.5 kHz. By this we mean that after each filter the next one keeps

increasing in bandwidth by a growth factor alpha shown in Equation 4.2.3.3. Our alpha for the

chosen BW represents bandwidth in the Figure 4.2.3.2. We had to limit the number of filters in

our band pass filter band due to speed and memory constraints.

Page 25: Frenso State

Speech Recognition Using FPGA

20

Figure 4.2.3.3: Idealized Logarithmically Spaced BPF Bank

A more realistic way of implementing the filters that we are going to use in our project is shown

in figure 4.2.3.4. The figure shows the 5 filters with a somewhat sharp cutoff. The MatLab code

for graphing figure 4.2.3.4 is shown in Appendix A.

Figure 4.2.3.4: Realizable FIR BPF Bank

Since the actual filters are not ideal like those in Figure 4.2.3.3 thus there needs to be some

overlap so that there is not a spectral loss between the filters. This is not desirable because a

frequency “smearing” occurs where frequency content appears in neighboring filters due to the

0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

Frequency (Hertz)

Mag

nitu

de

Ideal FIR Filter Bank

0 1000 2000 3000 4000 5000 6000 7000 8000-150

-100

-50

0

50FIR Filter Bank

Frequency (Hertz)

Mag

nitu

de (d

B)

Page 26: Frenso State

Speech Recognition Using FPGA

21

overlap. This will not really affect the recognition because the training words saved to memory

and word to be recognized will be subjected to the same spectral smearing.

4.2.4 Fingerprint Generation

Once we have obtained all our points from the filters we need to calculate the energy for each.

The energy is found by using equation 2.2.4, which is the cumulative summation of the squared

output of the filters from each filter. A good rule of thumb is to have window lengths between

10-30 milliseconds. Since we are sampling at 16 kHz this would mean for a 10 milliseconds

window length we would consider the energy at every 160 accumulated points to be a data point

in the fingerprint representation for that respective filter. This is essentially the energy windowed

over a certain length. Since all of our keywords are small we used saved one half of a second of

sound (8000 points) after the detection of the beginning of a word. Once this is accomplished we

will have a 500 point representation (100 points for each of the 5 filters) of the energy in the

banded spectrum of the filter at discrete points in time throughout the spoken word. Figure

4.2.4.1 shows a general flow chart of what our system will go through to generate our main

fingerprint for each word. First the speech signal from the microphone will pass through the

ADC which will digitize the waveform. The output of the ADC will go then pass through the

pre-emphasis filter before reaching the filter bank. The output of the filters will then be squared

to obtain the instantaneous power which will be added to an accumulator to obtain the energy

over a 10 millisecond intervals.

Page 27: Frenso State

Speech Recognition Using FPGA

22

Figure 4.2.4.1: Flow Chart for Fingerprint Extraction

We have to make a reference fingerprint for every word that we need to store in memory. The

reference fingerprint is the average of the individual fingerprints for each training trial. For our

system the user will have to say the keyword three times for the system to gain three individual

fingerprints which will then be averaged together to create a reference fingerprint.

4.2.5 Comparison Function

In the recognition mode the incoming fingerprint will be compared to each reference fingerprint

and closest match will be recognized as the spoken word and displayed on the LCD. Thus we

need a formula to calculate the difference between the reference fingerprint data points and the

spoken word. The Euclidean formula which is derived from the Pythagorean Theorem gives the

straight line distance between two vectors of n points. Equation 4.2.5.1 shows how to calculate

the distance between two vectors p and q.

(4.2.5.1)

In Equation 4.2.5.1 we can extrapolate that the equivalent distance from p to q is the cumulative

distances from each point in p to the corresponding point in q.

Page 28: Frenso State

Speech Recognition Using FPGA

23

4.2.6 Driving Outputs

We have planned to use only two outputs. One will be the dc motor which will power a fan and

the other will be a servo motor that controls the opening of the door. The DC motor only has 2

terminals and simply requires a voltage across the terminals. The servo motor has 3 terminals.

Two of the terminals are connected to Vcc and ground while the last terminal is a position signal

that requires a pulse-width modulated (PWM) voltage. These outputs will be controlled using the

expansion header I/O ports on the DE2 board. The DE2 board provides two 40-pin expansion

headers that connect directly to 36 pins on the Cyclone II FPGA, and also provides DC +5V

(VCC5), DC +3.3V (VCC33), and two GND pins [11]. Each pin on the expansion headers is

connected to two diodes and a resistor that provide protection from high and low voltages.

Depending on which word was recognized the corresponding pins should be set to either output

or input. For example if the word “STOP” was recognized the pin controlling the fan should be

set to input so that no voltage is supplied by that pin. Figure 4.2.6.1 shows the related schematics

for one of the expansion headers (JP1).

Figure 4.2.6.1: Expansion Header I/O Ports [11]

Pins VCC5 and VCC33 which are voltage regulated power supplies and can provide higher

currents but they are always high and cannot be controlled to switch the motor on or off. The

other header I/O ports can be configured as digital outputs but they are current limited. They can

only provide current up to 8 mA [11]. This is not nearly enough current to control the DC motor

thus we will need a power circuit for it. A simple MOSFET can be used to control the movement

Page 29: Frenso State

Speech Recognition Using FPGA

24

of DC motors or brushless stepper motors directly from computer logic [12]. As the motor load

is inductive, a simple flywheel diode is connected across the inductive load to dissipate any back

EMF generated by the motor when the MOSFET turns it off [12]. An additional silicon diode D1

can also be placed across the channel of a MOSFET switch when using inductive loads for

suppressing overvoltage switching transients and noise giving extra protection to the MOSFET

switch if required [12]. Resistor R2 is used as a pull-down resistor to help pull the output voltage

down to 0V when the MOSFET is switched off [12]. We will use components from the lab to

accomplish this circuit. Figure 4.2.6.2 shows the DC motor control circuit.

Figure 4.2.6.2: DC Motor Control Using MOSFET

Referring to Figures 4.2.6.1 and 4.2.6.2, the VCC5 pin from the board will be connected to Vdd

on the DC motor control circuit. One of the I/O pins such as I/O A0 will be connected to VIN and

the circuit will be grounded using the GND pin.

Similar to the DC motor, the servo motor will use the VCC5 pin and GND to power the motor

while an I/O pin will be used to control the motor. The servo motor relies on PWM to control its

position. The I/O pin will have to be programmed to generate a PWM signal. Generally the

minimum pulse width will be about 1 millisecond and the maximum pulse width will be 2

milliseconds with a period of 40 milliseconds but the period is not nearly as critical as the pulse

widths. Figure 4.2.6.3 shows the position of the servo motor with respect to the pulse width.

Page 30: Frenso State

Speech Recognition Using FPGA

25

Figure 4.2.6.3: Servo Motor Position vs. Duty Cycle

In order to open the door 90 degrees from the neutral position we will want either a one or two

millisecond pulse depending upon which direction we want it to open. This would give us a full

rotation of about 180 degrees. The simplest way to accomplish the PWM requirements for the

control signal to the servo motor is to use a 555-timer circuit. The circuit in Figure 4.3.6.4

accomplishes the PWM requirements that we need to control the servo motor.

Figure 4.2.6.4: Servo Motor Control Circuit

Using Equation 4.2.6.1 in order to solve for the time-low of the waveform we get a time low of

40.54 milliseconds.

(4.2.6.1)

Page 31: Frenso State

Speech Recognition Using FPGA

26

Using Equation 4.2.6.2 in order to solve for the minimum time-high of the waveform that occurs

when is shorted by the N-Ch FET produces a time high of 1.039 milliseconds.

(4.2.6.2)

Using Equation 4.2.6.2 in order to solve for the maximum time-high of the waveform in which

and are both equal to 10 kΩ we get a time high of 1.039 milliseconds Thus this circuit will

provide the necessary PWM requirement for the positioning of our servo.

4.2.7 System Architecture

For our system architecture the main component will be the FPGA board. Connected to the input

of the FPGA board will be the microphone and to the output will be the chip that controls the dc

motor. The architecture of the FPGA board is quite complicated to explain everything in detail

but we will mention some of the components that we are going to be using. The core of an FPGA

consists of the adaptive logic module (ALM). Figure 4.2.7.1 shows the structure of an ALM and

its corresponding adders and registers.

Figure 4.2.7.1 Adaptive Logic Module of a typical FPGA [11]

Page 32: Frenso State

Speech Recognition Using FPGA

27

The ALM is the key to the speed of the FPGA technology and to the efficiency of its

architecture. An ALM can implement many functions because it has 8 inputs to its logic block.

The ALM can also be separated into smaller LUTs. The components that are implemented on the

FPGA board are the ADC, the audio in and the memory storage which include SSRAM,

SDRAM, and FLASH. The FPGA board has a pre-configured system which we ended up using

since it has the ADC configured.

We used the pre configured media computer system that is available on the board. The original

bit resolution of the media computer was 24-bit but we needed the resolution to be 12-bit. This

was accomplished by shifting every 24-bit value by 12 bits to the right. Figure 4.2.7.2 shows a

top view of the DE2 board and the components labeled.

Figure 4.2.7.2 DE2 FPGA Board [11]

Figure 4.2.7.2 above has labeled all of the components on the DE2 board that we will be using.

For example, some of these components include the SDRAM, Mic-in, LCD Module, and the

toggle switches. The SRAM and FLASH memory locations can also be seen on the figure.

Page 33: Frenso State

Speech Recognition Using FPGA

28

4.2.8 Training the System

We are utilizing a total of 12 switches that are embedded on the DE2 FPGA board, these include

from sw-0 to sw-11. There can only be one switch high at one time except when in recognizing

mode otherwise if there is an undesired switch high the LCD display will show an error message.

Recognizing mode is active when both sw0 and sw1 are high at the time that the record button is

pressed. Three switches are used for every word that is to be stored in memory. The reason for

using three switches for each word is because we need to record each individual word three

different times and then average the values of each recording to obtain our reference fingerprint

for each word. This procedure has to be done once for each word, which results in doing this four

times because we have four different words. The average of each word will be saved on

independent addresses in the SDRAM. We are saving the words on SDRAM because we think

that the large quantity of values might cause the SRAM to overflow. SDRAM might be a bit

slower than SRAM but do to the processor speed the delay is not significant. Aside from the 12

switches we are also using two pushbuttons. Pushbutton key1 is used for recording and

pushbutton key2 is used for playing back the previously recorded word. We decided to have a

playback function to allow us to listen to the recording so that we can make sure that it was a fair

enough sample of the word. Whenever we have to record we must push key1 and depending on

which switch is set high the appropriate target address should be obtained.

When we are training our system to store the fingerprint of each word we will have to run the

word three times and then take the average banded energy to acquire our fingerprint. The user

will select which word he wishes to train by using the sliding switches on the board. Once all the

words have been trained the system will be able to run in the recognition mode. Every word we

will have 100 points which correspond to each of the 5 filters. Once we run this three times we

will have the sum of the energy for 500 points then we will divide this value by 3 to give us the

average energy, which will be our reference fingerprint. After we have our reference fingerprint

and we speak a word into the microphone we will us the distance formula mentioned before to

measure the difference between the spoken word and the words stored in memory. If the spoken

word matches one of the stored words within distance threshold then the board should output a

signal corresponding to the command that the word represents. However, if the spoken word

Page 34: Frenso State

Speech Recognition Using FPGA

29

does not match any of the stored words the board should output some kind of message notifying

that the word has no match.

Table 4.2.1 shows the words that we will be using as our inputs and the corresponding outputs

for when the words are recognized. When there is no word match the message “WORD NOT

RECOGNIZED” will be displayed on the LCD display that is embedded on the DE2 board.

Table 4.2.1 Inputs and Outputs

Word Match Output GO Turn fan on STOP Turn fan off OPEN Open door CLOSE Close door NO MATCH “WORD NOT RECOGNIZED”

A system diagram is shown in figure 4.2.8.1 where the first image represents the microphone

which is connected to the audio in port on the FPGA board. The second image labeled FPGA

board represents the whole FPGA board which contains the ADC, the audio in port and it’s

where we implemented our filtering design. You can see the SRAM and SDRAM modules on

the DE2 board system diagram. These modules are controlled by the SRAM and SDRAM

controller respectively, which can also be seen in the diagram. The 16x2 LCD display is

controlled by the LCD port. We use the LCD display to show the messages for our program.

Page 35: Frenso State

Speech Recognition Using FPGA

30

Figure 4.2.8.1: System Diagram [11]

At first we attempted to build our own system to use on the FPGA board using the Quartus II

program. We wanted to build our own system so that we could choose which components from

the system diagram to use for our project. We ended up giving up on building our own system

because we were getting many errors that we could not fix when running the system. After we

started using the media computer system that we found on the Altera website we started with

testing the switches, pushbuttons, and the LCD display. We successfully tested the ports that we

planned to use in our project before actually running our program on the DE2 board. The Altera

monitor program was very useful when we needed to find which memory location to use for our

values. Using the memory tab on the monitor program we were able to see where the buffer

stored all the values for the spoken words. The values on the buffer were the ones that we needed

to store somewhere in memory for future use.

4.3 Work Breakdown

Table 4.3.1 shows the division of work for our project. Each task shows its corresponding start

and completion date along with the team member participated in accomplishing that task. It is

proceeded by the Gannt chart for our project.

Page 36: Frenso State

Speech Recognition Using FPGA

31

Table 4.3.1: Division of Work

Task Start Date End Date Team Member MatLab Implementation 8/6/2012 11/17/2012 Tyler/Ismael

Data Acquisition Using Mic 9/10/2012 9/21/2012 Tyler Using 'Analog Input' Function 9/10/2012 9/21/2012 Tyler Establish Sampling Variables 9/17/2012 9/21/2012 Tyler

User Interface for Template Storage 9/20/2012 9/29/2012 Tyler Prompts to Select Word 9/20/2012 9/26/2012 Tyler Saves Word Template Storage 9/25/2012 9/29/2012 Tyler

Quantization Function 10/1/2012 10/11/2012 Tyler Test Bit Resolutions 10/1/2012 10/11/2012 Tyler/Ismael

Word Detection 10/10/2012 10/20/2012 Tyler Window Averaging Function 10/10/2012 10/20/2012 Tyler Threshold Calculations 10/10/2012 10/20/2012 Tyler/Ismael

DSP 8/6/2012 11/17/2012 Tyler/Ismael Pre-emphasis Filter 11/12/2012 11/17/2012 Tyler/Ismael FIR Filters 8/6/2012 8/29/2012 Tyler/Ismael

Cutoff Frequencies 8/6/2012 8/29/2012 Tyler/Ismael Filter Coefficients 8/27/2012 8/29/2012 Tyler/Ismael

Downsampling 10/10/2012 10/13/2012 Tyler/Ismael Resolution Downconversion 10/15/2012 10/19/2012 Tyler/Ismael

DE2 Board Implementation 8/27/2012 12/1/2012

Data Acquisition Using Mic In 8/27/2012 9/25/2012 Ismael User Interface(switches,

buttons) 8/27/2012 9/15/2012 Ismael Wolfson CODEC 9/14/2012 9/25/2012 Ismael Memory Allocation 9/4/2012 10/3/2012 Ismael

DSP 8/27/2012 11/21/2012 Tyler/Ismael Pre-emphasis Filter 11/16/2012 11/21/2012 Tyler/Ismael Downsampling 9/25/2012 10/2/2012 Ismael Resolution Downconversion 10/10/2012 10/16/2012 Ismael Band pass Filter Bank 8/27/2012 11/6/2012 Tyler/Ismael

Filter Coefficients 8/27/2012 8/29/2012 Tyler/Ismael Create FIR Funtion in C 9/24/2012 11/2/2012 Tyler/Ismael Compare to MatLab

Outputs 11/2/2012 11/6/2012 Tyler Fingerprint Generation 11/5/2012 11/20/2012 Tyler/Ismael

Accumulation of Sampled Data 11/5/2012 11/10/2012 Tyler/Ismael Average of Multiple Trials 11/12/2012 11/20/2012 Tyler/Ismael

Comparison Function 10/12/2012 12/1/2012 Tyler/Ismael

Page 37: Frenso State

Speech Recognition Using FPGA

32

Euclidean Distance Function 10/12/2012 10/20/2012 Tyler/Ismael Word Matching Function 10/22/2012 10/26/2012 Ismael Threshold Calculation 11/20/2012 12/1/2012 Ismael Signal to Outputs 11/21/2012 11/23/2012 Ismael

Outputs 10/15/2012 11/6/2012 Tyler/Ismael DC Motor 10/15/2012 11/6/2012 Tyler Configure GPIO Pins 10/15/2012 11/6/2012 Ismael Opamp Buffer 11/1/2012 11/6/2012 Tyler High Side MOSFET Switch 11/1/2012 11/6/2012 Tyler

Servo Motor 10/15/2012 11/6/2012 Tyler/Ismael 555 Timer for PWM Signal 10/22/2012 10/30/2012 Tyler Configure GPIO Pins 10/15/2012 10/25/2012 Ismael High Side MOSFET Switch 11/1/2012 11/6/2012 Tyler

Project Testing / Refinement 11/26/2012 12/7/2012 Tyler/Ismael Debugging & Refinement 11/26/2012 12/7/2012 Tyler/Ismael

Page 38: Frenso State

Speech Recognition Using FPGA

33

Page 39: Frenso State

Speech Recognition Using FPGA

34

Chapter 5: Parts Ordering

Table 5.1 shows the dates ordered and arrival dates of each of the parts we needed for our

project. Receipts for all the parts ordered are shown in Appendix B.

Table 5.1: Shipping Status Part Needed Date Ordered Date Received

Altera DE2 FPGA Board May 5th, 2012 May 10th, 2012 Logitech Microphone May 8th, 2012 May 20th, 2012 Low Speed DC Motor May 5th, 2012 May 15th, 2012

MOSFET’s and Op-Amps May 5th, 2012 May 17th, 2012 Servo Motor May 7th, 2012 May 11th, 2012

Chapter 6: Results and Findings

This chapter will be focused on the outcome of our project. The first section will cover the

extensive analysis of our system and the keywords in MatLab and the second section will

proceed with the system testing results of our keywords on the Altera DE2 FPGA board. Lastly,

we will discuss some of our shortcomings of the project and our proposed improvements to

alleviate those issues.

6.1 MatLab Findings

In order to add some clarity to our project we created a test bench on MatLab that patterned our

system. If we had simply started coding our project onto the board we would have essentially

been blind, not knowing what to expect as far as frequency content, threshold values, and timing

values. Using MatLab we were able to save word templates just as we would when training our

board. This allowed us to modify code and view the changes to the reference fingerprints. This

was a big issue at first because we were simply providing new sound samples through the

microphone each time and it became apparent that a word is never said exactly the same. This is

what makes speech recognition so difficult. We needed to fix some consistency and saving three

sample templates of each word was a start. Figure 6.1.1 is the time-domain plot of the word ‘Go’

which will be used to initiate the dc motor.

Page 40: Frenso State

Speech Recognition Using FPGA

35

Figure 6.1.1: ‘Go’ in Time-Domain

From the plot of “Go” it is evident that the word begins with the humming of the ‘g’ sound then

much more powerful ‘o’ vowel. The output of the filters for the word ‘Go’ shown in Figure 6.1.2

reveals that significant content in the signal occurs in the fifth filter (spanning about 4.8 kHz to 7

kHz) at the beginning of the word. This is consistent with the high frequency consonant ‘g’.

Also, significant content in the signal occurs in the second filter (spanning about 1.2 kHz to 2

kHz) at about 250 ms into the sample. This lower frequency is consistent with the ‘o’ vowel.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6Filter Outputs for "Go"

Am

plitu

de

Time (sec)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.05

00.05

Filter Outputs for "Go"

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.02

00.02

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.01

00.01

Am

plitu

de

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-505

x 10-3

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.02

00.02

Time (sec)

Page 41: Frenso State

Speech Recognition Using FPGA

36

Figure 6.1.2: Output of Filters vs. Time for ‘Go’

Figure 6.1.3 is the time-domain plot of the word ‘Stop’ which will be used to turn off the dc

motor. The high frequency hissing ‘s’ sound is clearly at the start of the word followed by a hard

soft ‘t’ then ‘op’ phoneme.

Figure 6.1.3: ‘Stop’ in Time-Domain

The output of the filters for the word ‘Stop’ shown in Figure 6.1.4 reveals that significant content

in the signal occurs in the fifth filter (spanning about 4.8 kHz to 7 kHz) at the beginning of the

word. This is consistent with the high frequency ‘s’ consonant. Also, the ‘t-o-p’ sound is present

across all of the filters at about 270 milliseconds in an appreciable amount.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Filter Outputs for "Stop"

Am

plitu

de

Time (sec)

Page 42: Frenso State

Speech Recognition Using FPGA

37

Figure 6.1.4: Output of Filters vs. Time for ‘Stop’

Figure 6.1.5 is the time-domain plot of the word ‘Open’ which will be used to initiate the servo

motor to signify the opening of a door. Interestingly enough the same ‘op’ phoneme is present as

it was at the end of the keyword ‘stop’. The word then ends with the low frequency consonant

‘n’ sound.

Figure 6.1.5: ‘Open’ in Time-Domain

The output of the filters for the word ‘Open’ is shown in Figure 6.1.6. The beginning ‘op’ sound

is present across all of the filters in an appreciable amount just as it was in the word ‘Stop’. The

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.5

0

0.5Filter Outputs for "Stop"

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.5

0

0.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.5

0

0.5A

mpl

itude

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.5

0

0.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.5

0

0.5

Time (sec)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Normalized Plot of "Open"

Am

plitu

de

Time (sec)

Page 43: Frenso State

Speech Recognition Using FPGA

38

‘n’ sound does not even appear on our filter outputs. This is most likely due to the fact that ‘n’ is

one of the lowest sounds and is getting attenuated by the pre-emphasis filter.

Figure 6.1.6: Output of Filters vs. Time for ‘Open’

Figure 6.1.7 is the time-domain plot of the word ‘Close’ which will be used to return the servo

motor to its original position signifying the closing of a door. There is not much empty content

for the word ‘Close’ unlike the other keywords in which the different sounds were clearly

visible.

Figure 6.1.7: ‘Close’ in Time-Domain

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.2

0

0.2Filter Outputs for "Open"

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.2

0

0.2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.2

0

0.2

Am

plitu

de

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.2

0

0.2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.2

0

0.2

Time (sec)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06Filter Outputs for "Close"

Am

plitu

de

Time (sec)

Page 44: Frenso State

Speech Recognition Using FPGA

39

The output of the filters for the word ‘Close’ is shown in Figure 6.1.8. The beginning of the word

begins with has very low amplitude outputs from the first two filters. That is then followed by

significant content in the last three filters.

Figure 6.1.8: Output of Filters vs. Time for ‘Close’

6.2 Testing Results

When testing our system on the DE2 Altera board we ran three rounds of ten trials for each word

by each of us. The results of the voice recognition testing are shown in Table 6.2.1.

Table 6.2.1: Results of Test

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-505

x 10-3 Filter Outputs for "Close"

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-202

x 10-3

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-505

x 10-3

Am

plitu

de

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.1

00.1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-0.1

00.1

Time (sec)

Page 45: Frenso State

Speech Recognition Using FPGA

40

From these tests we can infer that not all speakers are created equal with our system. Tyler

finished with an average recognition rate of 62.5% while Ismael finished with an average

recognition rate of only 59.25%. Ismael’s voice is definitely deeper so we suspect this was a

contributing factor to the lower recognition rate. The word that had the highest accuracy at an

average rate of 73.5% was ‘Stop’ while the lowest accuracy at an average rate of 53.5% was

‘Open’. In one of the round with Stop Tyler was able to have a recognition rate of 90% (9 out of

10) but there were rates as low as 40% (4 out of 10) for ‘Open’. While this accuracy is not very

practical for a real world application we are very pleased with these results given that we did not

have any sophisticated pattern-matching algorithm. When testing we could usually tell when the

word was going to be correctly identified just by how loud we inflected our voice and by the

consistency of the speed at which we trained the word. We tested this theory briefly with our

smart phones by recording an input word that yielded a correct output. When we used the

recorded message the recognition rates were repeated in the 80% range so the system was

consistent given a consistent input.

6.3 System Improvements

Our system could become a practical solution given some improvement. The first area that needs

to be addressed is the processing speed. At its current state the board takes about one minute and

20 seconds to complete all the necessary computation and make a decision. Using filters in a

hardware designed parallel structure would greatly cut down on the latency that the serial

software structure causes. Also, we could have convert to using IIR filters to that the equivalent

filter order would not need to be so high yielding less computations. In addition using a fixed

point representation would be a great improvement over the floating point math currently used

by our filters. A method to allow for the use of a variable length for the input sounds would

drastically improve its performance on very short or very long words instead of a fixed length

like we currently have in place. Another area that has much room for improvement is our

comparison function. Our system relies on accumulated energy over windowed bands but does

not incorporate any type of pattern matching or linear regression techniques where the best

match is found. The comparison function that we have in place now is very susceptible to shifts

in the timing of the words so that the peaks and troughs of the fingerprints are out of place,

Page 46: Frenso State

Speech Recognition Using FPGA

41

introducing a lot of error. Lastly, a normalization technique would help to dampen the variability

due to the loudness of the spoken word.

Chapter 7: Conclusion

After applying the background theory, analysis using a MatLab prototype, and implementing a

prototype on the Altera DE2 board, it is evident a speech recognition system can indeed be

successfully implemented using FPGA technology. We achieved all of our proposed goal and

objectives in the time allotted for our project. Improvements are needed to our current system in

order to make it practical for the consumer. We set out to create a simple solution to speech

recognition and our results were modest. Speech is a robust problem. As is often the case, in

order to achieve accurate results complex problems invariably yield complex solutions.

Regardless, we have learned a great deal about the density of speech and would to further our

interest in the subject by continuing to improve upon our system.

Page 47: Frenso State

Speech Recognition Using FPGA

42

References

1. Ifeachor, Emmanuel, and Berrie Jervis. Digital Signal Processing: A Practical Approach. Prentice Hall, 2002. Print

2. Torres, Grabriel. Hardware Secrets. LLC. April 21, 2006. Web. March 22, 2012. http://www.hardwaresecrets.com/article/317

3. Rabiner, Lawrence, and Biing-Hwang, Juang. Fundamentals of Speech Recognition. Prentice-Hall International, Inc. Print.

4. EVP Frequency Ranges. Web. March 29, 2012. http://www.paranormalghost.com/evp_frequency_ranges.htm

5. http://www.census.gov/newsroom/releases/archives/facts_for_features_special_editions/cb10-ff13.html

6. Wikipedia. Wikipedia Foundation, Inc. March 12, 2012. Web. March 28, 2012.

http://www.wikipedia.org/

7. Yarlagadda, R. K. Rao. Analog and Digital Signals and Systems. New York: Springer, 2010. Print.

8. Roberts, Michael J. Signals and Systems: Analysis Using Transform Methods and

MATLAB. New York: McGraw-Hill, 2012. Print.

9. Roth, Charles H., and Larry L. Kinney. Fundamentals of Logic Design. Stamford, CT: Cengage Learning, 2010. Print.

10. Rose, Jonathan. Architecture of Filed-Programmable Gate Arrays. Web. April 30, 2012. http://isl.stanford.edu/groups/elgamal/abbas_publications/J029.pdf

11. DE2 Development and Education Board User Manual. Altera Cooporation, 2006. PDF.

12. "MOSFET as a Switch." Using the Power MOSFET. Web. 28 Mar. 2012.

http://www.electronics-tutorials.ws/transistor/tran_7.html

Page 48: Frenso State

Speech Recognition Using FPGA

43

APPENDIX A

MatLab Code

Code 1: Idealized Band Pass Filter Plots

%% Idealized Band Pass Filter Bank alpha = 1.45; % logrithmic growth coefficient of filters % Bandwidths of each of the 10 filters b=[100 145 210.25 304.863 442.05 640.97 929.41 1347.65 1954.09 2833.43]; % Center Frequencies of each BPF fc=[250 372.5 550.13 807.68 1181.14 1722.65 2507.84 3646.37 5297.24 7691]; f = 0:.5:10000; %Frequency Range %-------------------------------------------------------------------------% % Idealized Magnitude Responses f1 = heaviside(f-(fc(1)-b(1)/2)) - heaviside(f-(fc(1)+b(1)/2)); f2 = heaviside(f-(fc(2)-b(2)/2)) - heaviside(f-(fc(2)+b(2)/2)); f3 = heaviside(f-(fc(3)-b(3)/2)) - heaviside(f-(fc(3)+b(3)/2)); f4 = heaviside(f-(fc(4)-b(4)/2)) - heaviside(f-(fc(4)+b(4)/2)); f5 = heaviside(f-(fc(5)-b(5)/2)) - heaviside(f-(fc(5)+b(5)/2)); f6 = heaviside(f-(fc(6)-b(6)/2)) - heaviside(f-(fc(6)+b(6)/2)); f7 = heaviside(f-(fc(7)-b(7)/2)) - heaviside(f-(fc(7)+b(7)/2)); f8 = heaviside(f-(fc(8)-b(8)/2)) - heaviside(f-(fc(8)+b(8)/2)); f9 = heaviside(f-(fc(9)-b(9)/2)) - heaviside(f-(fc(9)+b(9)/2)); f10 = heaviside(f-(fc(10)-b(10)/2)) - heaviside(f-(fc(10)+b(10)/2)); %-------------------------------------------------------------------------% plot(f,f1,f,f2,f,f3,f,f4,f,f5,f,f6,f,f7,f,f8,f,f9,f,f10); axis([0 9500 0 2]); xlabel('Frequency (Hertz)'); ylabel('Magnitude');

Code 2: FIR Filter Bank

%% FIR BandPass Filter Bank %-------------------------------------------------------------------------- n = 5; % Number of filters alpha = 1.45; % logrithmic growth coefficient of filters % Bandwidths of each of the 10 filters %================================================================== b=[442.05 640.97 929.41 1347.65 1954.09]; % Center Frequencies of each BPF fc=[1181.14 1722.65 2507.84 3646.37 5297.24 7450]; %-------------------------------------------------------------------------- % Calculate -3dB cutoff frequencies using center freq & bandwidth %================================================================== fcut = zeros(size(n+1)); % Zero Pad cutoff freq vector fcut(1) = fc(1) - b(1)/2; % Solve for cutoff freqs using center for i = 2:(n+1); % Freqs and bandwidths fcut(i) = fcut(i-1)+b(i-1); end %-------------------------------------------------------------------------- % Normalize frequencies for coefficients caculation %================================================================== Nq = 8000; %Nyquist frequency of 8kHz

Page 49: Frenso State

Speech Recognition Using FPGA

44

fs = 2*Nq; %sampling frequency of 16kHz fn_c = fc/Nq; %normalize center freq's fn_ct = fcut/Nq; %normalize 3-dB cutoff freq's %-------------------------------------------------------------------------- %-------------------------------------------------------------------------- % Calc filter coefficients for (n+1)th Order FIR Bandpass Filters %-------------------------------------------------------------------------- x = 500; % Number of points for plots N = 50; % Order of filter B = zeros(6,(N+1)); % Zero Pad Coeffcients Array %-------------------------------------------------------------------------- %1st Filter %=========== B1 = fir1(N,[fn_ct(1) fn_c(2)]); %Calc coeff's for 1st Filter [H, F1] = freqz(B1,1,x,fs); %Store into tranfer function H1 = abs(H); %Absolute value for mag resp M1 = 20*log(H1); %Mag Resp in dB %-------------------------------------------------------------------------- %2nd Filter %=========== B2 = fir1(N,[fn_ct(2) fn_c(3)]); %Calc coeff's for 2st Filter [H, F2] = freqz(B2,1,x,fs); %Store into tranfer function H2 = abs(H); %Absolute value for mag resp M2 = 20*log(H2); %Mag Resp in dB %-------------------------------------------------------------------------- %3rd Filter %=========== B3 = fir1(N,[fn_ct(3) fn_c(4)]); %Calc coeff's for 3rd Filter [H, F3] = freqz(B3,1,x,fs); %Store into tranfer function H3 = abs(H); %Absolute value for mag resp M3 = 20*log(H3); %Mag Resp in dB %-------------------------------------------------------------------------- %4th Filter %=========== B4 = fir1(N,[fn_ct(4) fn_c(5)]); %Calc coeff's for 4th Filter [H, F4] = freqz(B4,1,x,fs); %Store into tranfer function H4 = abs(H); %Absolute value for mag resp M4 = 20*log(H4); %Mag Resp in dB %-------------------------------------------------------------------------- %5th Filter %=========== B5 = fir1(N,[fn_ct(5) fn_c(6)]); %Calc coeff's for 5th Filter [H, F5] = freqz(B5,1,x,fs); %Store into tranfer function H5 = abs(H); %Absolute value for mag resp M5 = 20*log(H5); %Mag Resp in dB %-------------------------------------------------------------------------- for j = 1:length(B1) B(1,j) = B1(1,j); B(2,j) = B2(1,j); B(3,j) = B3(1,j); B(4,j) = B4(1,j); B(5,j) = B5(1,j); end % Plot all filters on the same plot figure; hold on plot(F1,M1)

Page 50: Frenso State

Speech Recognition Using FPGA

45

plot(F2,M2) plot(F3,M3) plot(F4,M4) plot(F5,M5) hold off title('FIR Filter Bank'); %Label Graph xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); ylim([-200 50]) %-------------------------------------------------------------------------- % Save coefficients save('FIR_Co_16khz','B1','B2','B3','B4','B5','B','N'); fileID1 = fopen('FIR_Coeff','w'); for i = 1:5 for j = 1:N+1; if j == N+1; fprintf(fileID1,'%f \n \n',B(i,j)); else fprintf(fileID1,'%f,',B(i,j)); end if (mod((j),10)) == 0 fprintf(fileID1,'\n'); end end end Code 3: Pre-emphasis Filter

% Pre-emphasis Filter Fs = 16000; B = [1 -0.97]; %Pre-emphasis coefficient [H,F] = freqz(B,1,N,Fs); H = abs(H); %Absolute value for mag resp M = 20*log(H); %Mag Resp in dB plot(F,M); %freqz([1 -0.97],1,); title('Pre-emphasis FIlter Response'); %Label Graph xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); grid on;

Code 4: FFT Analysis

clear all close all close all hidden clc %------------------------------------------------------------------------- % Variables Fs = 48000; % Sampling frequency (in hertz) Ts = 2; % Sampling time (in sec) l = 1; % Word size (in sec) ds = 2; % Downsampling conversion factor

Page 51: Frenso State

Speech Recognition Using FPGA

46

trials = 3; % Number of word templates bRes = 12; % Bit resolution Q = 2/(2^(bRes)); % Quantization levels w = 100; % Window Length thres = 0.005; % Threshold of Word Recognition prompts1 = 'Press ENTER and begin saying training word. \n'; prompts2 = 'Press ENTER and repeat the training word. \n'; prompts3 = 'Press ENTER and say training word one last time. \n'; word = zeros(1,l*Fs*Ts/ds); % Zero pad word X2fft = zeros(1, Fs/(2*ds)); % Zero pad FFT plot %------------------------------------------------------------------------- % Prompt User Which FFT to Create %========================================================================= disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); keywd=int32(input('Please select which PSD to display \n')); %-------------------------------------------------------------------------% % Detect Beginning of the Word %========================================================================= for k = 1:trials % Check keyword and load corresponding word template %---------------------------------------------------------- if keywd == 1 if k == 1 load('OPEN1') end if k == 2 load('OPEN2') end if k == 3 load('OPEN3') end qmic = OPEN; end %---------------------- if keywd == 2 if k == 1 load('CLOSE1') end if k == 2 load('CLOSE2') end if k == 3 load('CLOSE3') end qmic = CLOSE; end %---------------------- if keywd == 3 if k == 1 load('GO1') end if k == 2 load('GO2')

Page 52: Frenso State

Speech Recognition Using FPGA

47

end if k == 3 load('GO3') end qmic = GO; end %---------------------- if keywd == 4 if k == 1 load('STOP1') end if k == 2 load('STOP2') end if k == 3 load('STOP3') end qmic = STOP; end ptr = 1; % Initialize pointer. ave1 = mean(qmic(ptr:ptr+w)); % Initialization of average windows. ave2 = ave1; % Go through the sound until the difference between the average of two % adjacent windows is significant. check = 1; error = 1; while check if abs(ave1-ave2) > thres check = 0; end if (ptr + 2*w > Ts*Fs/ds) check = 0; disp '[!] Error: No Word Detected.'; error = 0; end if check ptr = ptr + w; ave2 = ave1; ave1 = mean(abs(qmic(ptr:ptr+w))); end end if error word = qmic(ptr:((ptr-1)+ l*Fs/ds)); % Store the detected word %-------------------------------------------------------------------------% % Perform DSP %========================================================================= Xfft = abs(fft(word)); % Find FFT of data X2fft = Xfft(1:end/2).^2 + X2fft; % Square FFT data to get PSD end end X2fft = X2fft/trials; % Average PSD's FFT's mag = 20*log10(X2fft); % Convert into dB magnitude if error

Page 53: Frenso State

Speech Recognition Using FPGA

48

figure; plot(mag); if keywd == 1 title('PSD of Waveform "Open"'); %Label Graph end if keywd == 2 title('PSD of Waveform "Close"'); %Label Graph end if keywd == 3 title('PSD of Waveform "GO"'); %Label Graph end if keywd == 4 title('PSD of Waveform "STOP"'); %Label Graph end xlabel('Frequency (Hertz)'); ylabel('Magnitude (dB)'); end Code 5: Saving Word Templates

%% Speech Storage and Waveform Testing %-------------------------------------------------------------------------- % This m-file prompts the user to select a 'training' word. The training % word will be stated and stored 3 times. This allows us to have to have % consistent testing since words can never be iterated exactly the same. % This way we can perform FFT's and have a knowledge of what our filter % output should look like. %-------------------------------------------------------------------------- % Clear old graphs and command history clear all close all close all hidden clc %-------------------------------------------------------------------------- % Variables Fs = 48000; % Sampling frequency (in hertz) Ts = 3; % Sampling time (in sec) l = 1; % Length (sec) of stored word after detection ds = 2; % Downsampling conversion factor bRes = 12; % Bit resolution Q = 2/2^(bRes); % Number of Quantization levels trials = 1; % Number of recording trials sampL = Ts*Fs/ds; % Number of samples in recording OPEN = zeros(trials,sampL); % Zeros pad words CLOSE = zeros(trials,sampL); GO = zeros(trials,sampL); STOP = zeros(trials,sampL); prompts1 = 'Press ENTER and begin saying training word. \n'; prompts2 = 'Press ENTER and repeat the training word. \n'; prompts3 = 'Press ENTER and say training word one last time. \n'; %-------------------------------------------------------------------------- % Sample Mic, Decimate, & Quantize %========================================================================= % Set up Mic Input AI = analoginput('winsound'); addchannel(AI, 1);

Page 54: Frenso State

Speech Recognition Using FPGA

49

set (AI, 'SampleRate', Fs); set(AI, 'SamplesPerTrigger', Ts*Fs); disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); disp('[5] FAN'); keywd=int32(input('Please select one of the above keywords to train. \n')); %------------------------------------------------------------------------- start(AI); % start the acquisition mic = getdata(AI); % Retrieve all the data dmic = decimate(mic,ds); % Downsample the data [x, qmic] = quantiz(dmic, -1:Q:1-Q, -1:Q:1); % Quantize the sound if keywd == 1 OPEN(1:length(qmic))= qmic; plot(OPEN) end if keywd == 2 CLOSE(1:length(qmic))= qmic; plot(CLOSE) end if keywd == 3 GO(1:length(qmic))= qmic; plot(GO) end if keywd == 4 STOP(1:length(qmic))= qmic; plot(STOP) end if keywd == 5 FAN(1:length(qmic))= qmic; plot(FAN) end save('OPEN1', 'OPEN'); save('CLOSE1', 'CLOSE'); save('GO1', 'GO'); save('STOP1', 'STOP'); save('FAN1', 'FAN'); save('BEGIN1', 'BEGIN'); Code 6: FIR Filter Testing

%% FIR Filter Testing %-------------------------------------------------------------------------- % Clear old graphs and command history clear all close all close all hidden clc %-------------------------------------------------------------------------- % Load FIR Filter Coefficients load('FIR_Co_24khz'); % Coefficients and number of taps (N) %------------------------------------------------------------------------- % Variables

Page 55: Frenso State

Speech Recognition Using FPGA

50

Fs = 48000; % Sampling frequency (in hertz) Ts = 3; % Sampling time (in sec) l = 2; % Length (sec) of stored word after detection ds = 2; % Downsampling conversion factor trials = 3; bRes = 12; % Bit resolution rec = l*Fs/ds; % Number of samples in recording n = 5; % Number of FIR filters Q = 2/(2^(bRes)); % Quantization levels w = 100; % Window Length thres = 0.0005; % Threshold of Word Recognition pts = 100; % Number of Points for each filter FingerPrint win = floor((rec-(N+1))/(pts)+0.5); % Size of each window length word = zeros(1,Fs/ds); % Zero pad word FP1 = zeros(n,pts); % Zero Pad fingerprint arrays FP2 = zeros(n,pts); FP3 = zeros(n,pts); %------------------------------------------------------------------------- % Prompt User Which Finger to Generate %========================================================================= disp('Voice Recognition Training.'); disp('[1] Open'); disp('[2] Close'); disp('[3] GO'); disp('[4] STOP'); disp('[5] FAN'); keywd=int32(input('Please Select a Reference Fingerprint to Generate\n')); %-------------------------------------------------------------------------% check = 1; error = 0; if (keywd == 1) load('OPEN1'); qmic = OPEN; end if (keywd == 2) load('CLOSE1'); qmic = CLOSE; end if (keywd == 3) load('GO1'); qmic = GO; end if (keywd == 4) load('STOP1'); qmic = STOP; end if (keywd == 5) load('FAN1'); qmic = FAN; end %-------------------------------------------------------------------------% % Detect Beginning of the Word %========================================================================= ptr = 1; % Initialize pointer.

Page 56: Frenso State

Speech Recognition Using FPGA

51

ave1 = mean(qmic(ptr:ptr+w)); % Initialization of average windows. ave2 = ave1; % Go through the sound until the difference between the average of two % adjacent windows is significant. while check if abs(ave1-ave2) > thres check = 0; end if (ptr + 2*w > Ts*Fs/ds) check = 0; disp '[!] Error: No Word Detected.'; error = 1; end if check ptr = ptr + w; ave2 = ave1; ave1 = mean(abs(qmic(ptr:ptr+w))); end end if ~error word = qmic(ptr:((ptr-1)+ l*Fs/ds)); % Store the detected word %-------------------------------------------------------------------------% % FIR Filtering & Fingerprint Generation %========================================================================= % Apply Preemphasis Filter to Word % Note: Eliminates the -6dB per octave decay of the spectral energy for j = 2:rec s(j) = word(j) - 0.97*word(j - 1); end out1 = (filter(B1, 1, s)).^2; out2 = (filter(B2, 1, s)).^2; out3 = (filter(B3, 1, s)).^2; out4 = (filter(B4, 1, s)).^2; out5 = (filter(B5, 1, s)).^2; %out6 = (filter(B6, 1, s)).^2; % Display the reference fingerprint. % Note: only half of the fft is displayed since the fft of a real signal % is half redundant. figure('Name','Reference Fingerprint','NumberTitle','off'); subplot(n,1,1); plot(out1); subplot(n,1,2); plot(out2); subplot(n,1,3); ylabel ('Amplitude'); plot(out3); subplot(n,1,4); plot(out4); subplot(n,1,5); plot(out5); xlabel ('\omega \times N \div 4\pi'); end %------------------------------------------------------------------------%

Page 57: Frenso State

Speech Recognition Using FPGA

52

APPENDIX B

Ordering Receipts

Page 58: Frenso State

Speech Recognition Using FPGA

53

Page 59: Frenso State

Speech Recognition Using FPGA

54

APPENDIX C

Source Code

Globals.c

#include "globals.h" /* global variables */ volatile int record, play, buffer_index; // audio variables volatile int left_buffer[BUF_SIZE]; // audio buffer volatile int right_buffer[BUF_SIZE]; // audio buffer volatile char byte1, byte2, byte3; // PS/2 variables volatile int timeout; // used to synchronize with the timer

Media Interrupt.c

#include "nios2_ctrl_reg_macros.h" /* these globals are written by interrupt service routines; we have to declare * these as volatile to avoid the compiler caching their values in registers */ extern volatile char byte1, byte2, byte3; /* modified by PS/2 interrupt service routine */ extern volatile int record, play, buffer_index; // used for audio extern volatile int timeout; // used to synchronize with the timer /* function prototypes */ void LCD_cursor( int, int ); void LCD_text( char * ); void LCD_cursor_off( void ); void VGA_text (int, int, char *); void VGA_box (int, int, int, int, short); void HEX_PS2(char, char, char); /* Start audio saving on SRAM address 08040000 */ /******************************************************************************** * This program demonstrates use of the media ports in the DE2 Media Computer * * It performs the following: * 1. records audio for about 10 seconds when an interrupt is generated by * pressing KEY[1]. LEDG[0] is lit while recording. Audio recording is * controlled by using interrupts * 2. plays the recorded audio when an interrupt is generated by pressing * KEY[2]. LEDG[1] is lit while playing. Audio playback is controlled by * using interrupts * 3. Draws a blue box on the VGA display, and places a text string inside * the box. Also, moves the word ALTERA around the display, "bouncing" off * the blue box and screen edges * 4. Shows a text message on the LCD display

Page 60: Frenso State

Speech Recognition Using FPGA

55

* 5. Displays the last three bytes of data received from the PS/2 port * on the HEX displays on the DE2 board. The PS/2 port is handled using * interrupts * 6. The speed of scrolling the LCD display and of refreshing the VGA screen * are controlled by interrupts from the interval timer ********************************************************************************/ int main(void) { /* Declare volatile pointers to I/O registers (volatile means that IO load and store instructions will be used to access these pointer locations, instead of regular memory loads and stores) */ volatile int * interval_timer_ptr = (int *) 0x10002000; // interal timer base address volatile int * KEY_ptr = (int *) 0x10000050; // pushbutton KEY address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * PS2_ptr = (int *) 0x10000100; // PS/2 port address volatile int * pin_ptr = (int *) 0x10000064; // header pins address /* initialize some variables */ byte1 = 0; byte2 = 0; byte3 = 0; // used to hold PS/2 data record = 0; play = 0; buffer_index = 0; // used for audio record/playback timeout = 0; // synchronize with the timer /* these variables are used for a blue box and a "bouncing" ALTERA on the VGA screen */ int ALT_x1; int ALT_x2; int ALT_y; int ALT_inc_x; int ALT_inc_y; int blue_x1; int blue_y1; int blue_x2; int blue_y2; int screen_x; int screen_y; int char_buffer_x; int char_buffer_y; short color; /* set the interval timer period for scrolling the HEX displays */ int counter = 0x960000; // 1/(50 MHz) x (0x960000) ~= 200 msec *(interval_timer_ptr + 0x2) = (counter & 0xFFFF); *(interval_timer_ptr + 0x3) = (counter >> 16) & 0xFFFF; /* start interval timer, enable its interrupts */ *(interval_timer_ptr + 1) = 0x7; // STOP = 0, START = 1, CONT = 1, ITO = 1 *(KEY_ptr + 2) = 0xE; /* write to the pushbutton interrupt mask register, and * set 3 mask bits to 1 (bit 0 is Nios II reset) */ *(PS2_ptr) = 0xFF; /* reset */ *(PS2_ptr + 1) = 0x1; /* write to the PS/2 Control register to enable interrupts */ NIOS2_WRITE_IENABLE( 0xC3 ); /* set interrupt mask bits for levels 0 (interval * timer), 1 (pushbuttons), 6 (audio), and 7 (PS/2) */

Page 61: Frenso State

Speech Recognition Using FPGA

56

NIOS2_WRITE_STATUS( 1 ); // enable Nios II interrupts /* create a messages to be displayed on the VGA and LCD displays */ char text_top_LCD[60] = "Audio Record \0"; char text_top_VGA[20] = "Altera DE2\0"; char text_bottom_VGA[20] = "Media Computer\0"; char text_ALTERA[10] = "ALTERA\0"; char text_erase[10] = " \0"; /* output text message to the LCD */ LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xffffffff; // turn off the LCD cursor /* the following variables give the size of the pixel buffer */ screen_x = 319; screen_y = 239; color = 0x1863; // a dark grey color VGA_box (0, 0, screen_x, screen_y, color); // fill the screen with grey // draw a medium-blue box around the above text, based on the character buffer coordinates blue_x1 = 28; blue_x2 = 52; blue_y1 = 26; blue_y2 = 34; // character coords * 4 since characters are 4 x 4 pixel buffer coords (8 x 8 VGA coords) color = 0x187F; // a medium blue color VGA_box (blue_x1 * 4, blue_y1 * 4, blue_x2 * 4, blue_y2 * 4, color); /* output text message in the middle of the VGA monitor */ VGA_text (blue_x1 + 5, blue_y1 + 3, text_top_VGA); VGA_text (blue_x1 + 5, blue_y1 + 4, text_bottom_VGA); char_buffer_x = 79; char_buffer_y = 59; ALT_x1 = 0; ALT_x2 = 5/* ALTERA = 6 chars */; ALT_y = 0; ALT_inc_x = 1; ALT_inc_y = 1; VGA_text (ALT_x1, ALT_y, text_ALTERA); while (1) { while (!timeout) ; // wait to synchronize with timer /* move the ALTERA text around on the VGA screen */ VGA_text (ALT_x1, ALT_y, text_erase); // erase ALT_x1 += ALT_inc_x; ALT_x2 += ALT_inc_x; ALT_y += ALT_inc_y; if ( (ALT_y == char_buffer_y) || (ALT_y == 0) ) ALT_inc_y = -(ALT_inc_y); if ( (ALT_x2 == char_buffer_x) || (ALT_x1 == 0) ) ALT_inc_x = -(ALT_inc_x); if ( (ALT_y >= blue_y1 - 1) && (ALT_y <= blue_y2 + 1) ) { if ( ((ALT_x1 >= blue_x1 - 1) && (ALT_x1 <= blue_x2 + 1)) || ((ALT_x2 >= blue_x1 - 1) && (ALT_x2 <= blue_x2 + 1)) ) { if ( (ALT_y == (blue_y1 - 1)) || (ALT_y == (blue_y2 + 1)) )

Page 62: Frenso State

Speech Recognition Using FPGA

57

ALT_inc_y = -(ALT_inc_y); else ALT_inc_x = -(ALT_inc_x); } } VGA_text (ALT_x1, ALT_y, text_ALTERA); /* display PS/2 data (from interrupt service routine) on HEX displays */ HEX_PS2 (byte1, byte2, byte3); timeout = 0; } } /**************************************************************************************** * Subroutine to move the LCD cursor ****************************************************************************************/ void LCD_cursor(int x, int y) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display char instruction; instruction = x; if (y != 0) instruction |= 0x40; // set bit 6 for bottom row instruction |= 0x80; // need to set bit 7 to set the cursor location *(LCD_display_ptr) = instruction; // write to the LCD instruction register } /**************************************************************************************** * Subroutine to send a string of text to the LCD ****************************************************************************************/ void LCD_text(char * text_ptr) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display while ( *(text_ptr) ) { *(LCD_display_ptr + 1) = *(text_ptr); // write to the LCD data register ++text_ptr; } } /**************************************************************************************** * Subroutine to turn off the LCD cursor ****************************************************************************************/ void LCD_cursor_off(void) { volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display *(LCD_display_ptr) = 0x0C; // turn off the LCD cursor } /**************************************************************************************** * Subroutine to send a string of text to the VGA monitor ****************************************************************************************/ void VGA_text(int x, int y, char * text_ptr) {

Page 63: Frenso State

Speech Recognition Using FPGA

58

int offset; volatile char * character_buffer = (char *) 0x09000000; // VGA character buffer /* assume that the text string fits on one line */ offset = (y << 7) + x; while ( *(text_ptr) ) { *(character_buffer + offset) = *(text_ptr); // write to the character buffer ++text_ptr; ++offset; } } /**************************************************************************************** * Draw a filled rectangle on the VGA monitor ****************************************************************************************/ void VGA_box(int x1, int y1, int x2, int y2, short pixel_color) { int offset, row, col; volatile short * pixel_buffer = (short *) 0x08000000; // VGA pixel buffer /* assume that the box coordinates are valid */ for (row = y1; row <= y2; row++) { col = x1; while (col <= x2) { offset = (row << 9) + col; *(pixel_buffer + offset) = pixel_color; // compute halfword address, set pixel ++col; } } } /**************************************************************************************** * Subroutine to show a string of HEX data on the HEX displays ****************************************************************************************/ void HEX_PS2(char b1, char b2, char b3) { volatile int * HEX3_HEX0_ptr = (int *) 0x10000020; volatile int * HEX7_HEX4_ptr = (int *) 0x10000030; /* SEVEN_SEGMENT_DECODE_TABLE gives the on/off settings for all segments in * a single 7-seg display in the DE2 Media Computer, for the hex digits 0 - F */ unsigned char seven_seg_decode_table[] = { 0x3F, 0x06, 0x5B, 0x4F, 0x66, 0x6D, 0x7C, 0x07, 0x7F, 0x67, 0x77, 0x7C, 0x39, 0x5E, 0x79, 0x71 }; unsigned char hex_segs[] = { 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned int shift_buffer, nibble; unsigned char code; int i; shift_buffer = (b1 << 16) | (b2 << 8) | b3; for ( i = 0; i < 6; ++i ) {

Page 64: Frenso State

Speech Recognition Using FPGA

59

nibble = shift_buffer & 0x0000000F; // character is in rightmost nibble code = seven_seg_decode_table[nibble]; hex_segs[i] = code; shift_buffer = shift_buffer >> 4; } /* drive the hex displays */ *(HEX3_HEX0_ptr) = *(int *) (hex_segs); *(HEX7_HEX4_ptr) = *(int *) (hex_segs+4); } Audio.c (Main)

#include "globals.h" #include <stdio.h> #include <math.h> /* globals used for audio record/playback */ extern volatile int record, play, buffer_index; extern volatile int left_buffer[]; extern volatile int right_buffer[]; void Euclidean_Dist(int i, int f_num, int *w, int *x, int *y); /* Function Prototype */ void PreEmphasis(int p, int *z); /* Function Prototype */ void averaging(int *a, int *b, int *c, int *d); int best_match(void); // function prototype void FIR_Filter(int trial, int samp_length, float B[], int *samp, int *out); volatile int * d1; volatile int * d2; volatile int * d3; volatile int * d4; volatile int * d5; int taps = 50; int which_word; int trial; long int dist[5][2]; long int *d = &dist[0][0]; float B1[] = {-0.001085,-0.000904,-0.000504,-0.000093,-0.000067,-0.000898,-0.002737,-0.004944,-0.005919,-0.003580,0.003497,0.014749,0.026864,0.034309,0.031310,0.014672,-0.013812, -0.046694,-0.072710,-0.080748,-0.064499,-0.025774,0.025145,0.072614,0.101135,0.101135,0.072614,0.025145,-0.025774,-0.064499,-0.080748,-0.072710,-0.046694,-0.013812, 0.014672,0.031310,0.034309,0.026864,0.014749,0.003497,-0.003580,-0.005919,-0.004944,-0.002737,-0.000898,-0.000067,-0.000093,-0.000504,-0.000904,-0.001085}; float B2[] = {-0.001713,-0.001456,0.000013,0.002355,0.004189,0.003689,0.000365,-0.003649,-0.004955,-0.002540,-0.000016,-0.002727,-0.010978,-0.016151,-0.006105,0.022034,0.052881, 0.059114,0.022610,-0.045258,-0.103633,-0.107849,-0.044031,0.055007,0.128266,0.128266,0.055007,-0.044031,-0.107849,-0.103633,-0.045258,0.022610,0.059114,0.052881,

Page 65: Frenso State

Speech Recognition Using FPGA

60

0.022034,-0.006105,-0.016151,-0.010978,-0.002727,-0.000016,-0.002540,-0.004955,-0.003649,0.000365,0.003689,0.004189,0.002355,0.000013,-0.001456,-0.001713}; float B3[] = {-0.001269,0.000886,0.001925,0.000690,-0.000448,0.000788,0.000779,-0.004915,-0.009409,-0.000526,0.015642,0.015917,-0.003276,-0.014281,-0.004246,-0.002627,-0.025212, -0.023980,0.041228,0.100548,0.039971,-0.111136,-0.165201,-0.020003,0.168760,0.168760,-0.020003,-0.165201,-0.111136,0.039971,0.100548,0.041228,-0.023980,-0.025212, -0.002627,-0.004246,-0.014281,-0.003276,0.015917,0.015642,-0.000526,-0.009409,-0.004915,0.000779,0.000788,-0.000448,0.000690,0.001925,0.000886,-0.001269}; float B4[] = { 0.001002,-0.001963,-0.000796,0.001198,-0.000086,0.002717,0.001160,-0.009047,-0.000978,0.010347,-0.000102,0.002654,-0.001798,-0.027391,0.008830,0.041886,-0.012494, -0.017569,-0.006177,-0.053593,0.060377,0.143398,-0.137512,-0.202763,0.198274,0.198274,-0.202763,-0.137512,0.143398,0.060377,-0.053593,-0.006177,-0.017569,-0.012494, 0.041886,0.008830,-0.027391,-0.001798,0.002654,-0.000102,0.010347,-0.000978,-0.009047,0.001160,0.002717,-0.000086,0.001198,-0.000796,-0.001963,0.001002}; float B5[] = {-0.000335,0.000150,-0.001840,0.003022,-0.001237,-0.000785,-0.002136,0.006948,-0.004037,-0.004442,0.002871,0.009414,-0.008703,-0.012707,0.022384,-0.000190,-0.012552, -0.026198,0.070681,-0.045170,-0.007926,-0.049572,0.228825,-0.324275,0.157310,0.157310,-0.324275,0.228825,-0.049572,-0.007926,-0.045170,0.070681,-0.026198,-0.012552, -0.000190,0.022384,-0.012707,-0.008703,0.009414,0.002871,-0.004442,-0.004037,0.006948,-0.002136,-0.000785,-0.001237,0.003022,-0.001840,0.000150,-0.000335}; /*************************************************************************************** * Pushbutton - Interrupt Service Routine * * This routine checks which KEY has been pressed. If it is KEY1 or KEY2, it writes this * value to the global variable key_pressed. If it is KEY3 then it loads the SW switch * values and stores in the variable pattern ****************************************************************************************/ void audio_ISR( void ) { volatile int * SW_ptr = (int *) 0x10000040; // SW slider switches base address volatile int * pin_ptr = (int *) 0x10000064; // expansion pins base address volatile int * red_LED_ptr = (int *) 0x10000000; // red LED address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * green_LED_ptr = (int *) 0x10000010; // green LED address volatile int * initial = (int *) 0x130000; // Starting address for saving data volatile int * temp_saving; volatile int * l_start_saving; volatile int * signal; volatile int * temp; volatile int * temp2; volatile int * temp3;

Page 66: Frenso State

Speech Recognition Using FPGA

61

volatile int * starting; volatile int * recognize; volatile int * wordG1,* wordG2,* wordG3,* wordG4; volatile int * wordS1,* wordS2,* wordS3,* wordS4; volatile int * wordO1,* wordO2,* wordO3,* wordO4; volatile int * wordC1,* wordC2,* wordC3,* wordC4; volatile int * check; volatile int * P1,* P2,* P3,* P4,* P5,* P6,*P7,* P8,* P9,* P10,* P11,* P12,* P13,* P14,* P15,* P16,* P17,* P18,* P19,* P20; temp2 = 0x0804027c; temp = 0x08040278; // starting of word temp3 = 0x08040280; // distance between two words check = 0x08040270; // pre-emphasis filter check d1 = 0x8040290; // difference of filter 1 d2 = d1+1; // difference of filter 2 d3 = d2+1; // difference of filter 3 d4 = d3+1; //difference of filter 4 d5 = d4+1; //difference of filter 5 signal = 0x3df0; // starting of buffer int SW_value; signed long int sum; signed long int sum2; int P_in; int i; int k; int m; int n; int Rmode; int dist; int distance; int dif1; int dif2; int dif3; int dif4; int matches; int fifospace, leftdata, rightdata; SW_value = *(SW_ptr); if (*(audio_ptr) & 0x100) // check bit RI of the Control register { int shift; int shift2; m = 0; n = 0; Rmode = 0; matches = 0; P_in = 8000; P1 = 0x250000;

Page 67: Frenso State

Speech Recognition Using FPGA

62

P2 = P1+P_in; P3 = P2+P_in; P4 = P3+P_in; P5 = P4+P_in; P6 = P5+P_in; P7 = P6+P_in; P8 = P7+P_in; P9 = P8+P_in; P10 = P9+P_in; P11 = P10+P_in; P12 = P11+P_in; P13 = P12+P_in; P14 = P13+P_in; P15 = P14+P_in; P16 = P15+P_in; P17 = P16+P_in; P18 = P17+P_in; P19 = P18+P_in; P20 = P19+P_in; wordG4 = 0x15ee00; wordS4 = 0x19d600; wordO4 = 0x1dbe00; wordC4 = 0x21a600; *(pin_ptr) = 0xffffffff; if (buffer_index == 0) temp_saving = 0x130000; // starting address of saving words l_start_saving = temp_saving; if (SW_value == 0x1) { l_start_saving = temp_saving; *(red_LED_ptr) = 0x1; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG1 = 0x130000; // save a pointer for first word starting address which_word = 1; trial = 1; } else if(SW_value == 2) { temp_saving = temp_saving + 16000; //fa00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x2; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG2 = 0x13fa00; which_word = 1;

Page 68: Frenso State

Speech Recognition Using FPGA

63

trial = 2; } else if(SW_value == 4) { temp_saving = temp_saving + 32000; //1f400 l_start_saving = temp_saving; *(red_LED_ptr) = 0x4; char text_top_LCD[60] = "Rec GO word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordG3 = 0x14f400; which_word = 1; trial = 3; } // address of average result temp_saving+48000 ;2ee00 else if (SW_value == 0x8) { temp_saving = temp_saving + 64000; //3e800 l_start_saving = temp_saving; *(red_LED_ptr) = 0x2; char text_top_LCD[60] = "Rec STOP word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS1 = 0x16e800; which_word = 2; trial = 1; } else if(SW_value == 0x10) { temp_saving = temp_saving + 80000; //4e200 l_start_saving = temp_saving; *(red_LED_ptr) = 0x10; char text_top_LCD[60] = "Rec STOP word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS2 = 0x17e200; which_word = 2; trial = 2; } else if(SW_value == 0x20) { temp_saving = temp_saving + 96000; //5dc00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x20; char text_top_LCD[60] = "Rec STOP word \0";

Page 69: Frenso State

Speech Recognition Using FPGA

64

LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordS3 = 0x18dc00; which_word = 2; trial = 3; } // address of average result temp_saving+112000 ;6d600 else if (SW_value == 0x40) { temp_saving = temp_saving + 128000; //7d000 l_start_saving = temp_saving; *(red_LED_ptr) = 0x40; char text_top_LCD[60] = "Rec OPEN word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO1 = 0x1ad000; } else if(SW_value == 0x80) { temp_saving = temp_saving + 144000; //8ca00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x80; char text_top_LCD[60] = "Rec OPEN word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO2 = 0x1bca00; } else if(SW_value == 0x100) { temp_saving = temp_saving + 160000; //9c400 l_start_saving = temp_saving; *(red_LED_ptr) = 0x100; char text_top_LCD[60] = "Rec OPEN word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordO3 = 0x1cc400; } // address of average result temp_saving+176000 ; abe00 else if (SW_value == 0x200) { temp_saving = temp_saving + 192000; //bb800 l_start_saving = temp_saving; *(red_LED_ptr) = 0x200;

Page 70: Frenso State

Speech Recognition Using FPGA

65

char text_top_LCD[60] = "Rec CLOSE word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC1 = 0x1eb800; } else if(SW_value == 0x400) { temp_saving = temp_saving + 208000; //cb200 l_start_saving = temp_saving; *(red_LED_ptr) = 0x400; char text_top_LCD[60] = "Rec CLOSE word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC2 = 0x1fb200; } else if(SW_value == 0x800) { temp_saving = temp_saving + 224000; //dac00 + 130000 = 20ac00 l_start_saving = temp_saving; *(red_LED_ptr) = 0x800; char text_top_LCD[60] = "Rec CLOSE word \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); wordC3 = 0x20ac00; } // address of average result temp_saving+240000 ; ea600 else if(SW_value == 0x3) // recognizing mode { l_start_saving = 0x240000; temp_saving = 0x240000; *(red_LED_ptr) = 0x3; char text_top_LCD[60] = "Speak Now \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); recognize = 0x240000; //save a pointer for starting address in recognizing mode which_word = 5; } else {

Page 71: Frenso State

Speech Recognition Using FPGA

66

temp_saving = temp_saving + 256000; //fa000 + 130000 = 22a000 l_start_saving = temp_saving; *(red_LED_ptr) = SW_value; char text_top_LCD[60] = "SWITCH ERROR \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); } fifospace = *(audio_ptr + 1); // read the audio port fifospace register // store data until the the audio-in FIFO is empty or the buffer is full while ( (fifospace & 0x000000FF) && (buffer_index < BUF_SIZE) ) { left_buffer[buffer_index] = *(audio_ptr + 2); right_buffer[buffer_index] = *(audio_ptr + 3); ++buffer_index; if (buffer_index == BUF_SIZE) { // done recording record = 0; *(green_LED_ptr) = 0x0; // turn off LEDG *(audio_ptr) = 0x0; // turn off interrupts *(red_LED_ptr) = 0x0; // turn off red led buffer_index = 0; sum = 0; i = 0; sum2 = 0; // start address 0x2120 // ending address 0x126f40 while (i < 960) { for(k=0;k<100;k++) { sum += abs(*signal); signal++; signal++; } sum = sum/100; if(sum>sum2) {

Page 72: Frenso State

Speech Recognition Using FPGA

67

*(temp2) = sum; // save average into temp2 sum2 = sum; *(temp) = signal-200; //Save starting address into temp starting = *(temp); // starting points to beginning of word } if(sum2 > 21050000) { char text_top_LCD[60] = "Word Detected \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); i = 961; } i++; } if (sum2 < 21050000) // if word not detected display lcd message { char text_top_LCD[60] = "No Word Spoken \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); } buffer_index = *(temp); // starting address of word is held by temp *(check) = *(starting+1) >> 10; while (m < P_in) // save 1 second of word into memory for future use { // word is down sampled shift = *(starting); shift2 = shift >> 10; // shift value by 10 to the right *(starting) = shift2; *(l_start_saving) = *(starting); l_start_saving++; starting++; starting++; starting++; m++; } PreEmphasis(P_in,temp_saving); //changes values to 97% of original value if (which_word == 1)

Page 73: Frenso State

Speech Recognition Using FPGA

68

{ char text_top_LCD[60] = "Processing \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, P1); FIR_Filter(trial, P_in, B2, temp_saving, P2); FIR_Filter(trial, P_in, B3, temp_saving, P3); FIR_Filter(trial, P_in, B4, temp_saving, P4); FIR_Filter(trial, P_in, B5, temp_saving, P5); } if (which_word == 2) { char text_top_LCD[60] = "Processing \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, P6); FIR_Filter(trial, P_in, B2, temp_saving, P7); FIR_Filter(trial, P_in, B3, temp_saving, P8); FIR_Filter(trial, P_in, B4, temp_saving, P9); FIR_Filter(trial, P_in, B5, temp_saving, P10); } if (which_word == 5) { char text_top_LCD[60] = "Processing \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); FIR_Filter(trial, P_in, B1, temp_saving, P11); FIR_Filter(trial, P_in, B2, temp_saving, P12); FIR_Filter(trial, P_in, B3, temp_saving, P13); FIR_Filter(trial, P_in, B4, temp_saving, P14); FIR_Filter(trial, P_in, B5, temp_saving, P15); } char text_top_LCD[60] = "Ready \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); //averaging(wordG1, wordG2, wordG3, wordG4); //averaging(wordS1, wordS2, wordS3, wordS4); //averaging(wordO1, wordO2, wordO3, wordO4); //averaging(wordC1, wordC2, wordC3, wordC4); if(SW_value == 0x3) // time domain distance of each value

Page 74: Frenso State

Speech Recognition Using FPGA

69

{ Euclidean_Dist(P_in, 1, P1, P6, P11); //*(temp3) = distance; //dif1 = distance; Euclidean_Dist(P_in, 2, P2, P7, P12); //dif2 = distance; Euclidean_Dist(P_in, 3, P3, P8, P13); //dif3 = distance; Euclidean_Dist(P_in, 4, P4, P9, P14); //dif4 = distance; Euclidean_Dist(P_in, 5, P5, P10, P15); matches = best_match(); if (matches == 0) { char text_top_LCD[60] = "Detected Fan \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xfffffffe; } if (matches == 1) { char text_top_LCD[60] = "Detected Stop \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); *(pin_ptr) = 0xffffffff; } } } fifospace = *(audio_ptr + 1); // read the audio port fifospace register } } if (*(audio_ptr) & 0x200) // check bit WI of the Control register { if (buffer_index == 0) *(green_LED_ptr) = 0x2; // turn on LEDG_1

Page 75: Frenso State

Speech Recognition Using FPGA

70

fifospace = *(audio_ptr + 1); // read the audio port fifospace register // output data until the buffer is empty or the audio-out FIFO is full while ( (fifospace & 0x00FF0000) && (buffer_index < BUF_SIZE) ) { *(audio_ptr + 2) = left_buffer[buffer_index]; *(audio_ptr + 3) = right_buffer[buffer_index]; ++buffer_index; if (buffer_index == BUF_SIZE) { // done playback play = 0; *(red_LED_ptr) = 0x0; *(green_LED_ptr) = 0x0; // turn off LEDG *(audio_ptr) = 0x0; // turn off interrupts char text_top_LCD[60] = "Done Playback \0"; LCD_cursor (0,0); // set LCD cursor location to top row LCD_text (text_top_LCD); LCD_cursor_off (); } fifospace = *(audio_ptr + 1); // read the audio port fifospace register } } return; } /**************************************************************/ /*difference equation for two words*/ void Euclidean_Dist(int i,int f_num, int *w, int *x, int *y) { /* This function calculates the Euclidean distance between two arrays of length i. */ int j = 0; int *x2; int *y2; int *w2; long int temp_d1 = 0; long int temp_d2 = 0; x2 = x; y2 = y; w2 = w; //------------------------------------------------------ // Loop to find cummulative difference then divide by i while(j < (i - 1)) { temp_d1 += (*(w) - *(y))*(*(w) - *(y)); temp_d2 += (*(x) - *(y))*(*(x) - *(y)); w++; x++; y++; j++;

Page 76: Frenso State

Speech Recognition Using FPGA

71

} x = x2; y = y2; w = w2; //------------------------------------------------------ dist[f_num-1][0] = abs(temp_d1)/(i); dist[f_num-1][1] = abs(temp_d2)/(i); return; } /**************************************************************/ /**************************************************************/ void PreEmphasis(int p, int *z) { /* This function applies a PreEmphasis Filter to the array which eliminates the -6dB per octave decay of the spectral energy*/ int u = 0; float diff; int in_diff; int *save; int value = *z; //------------------------------------------------------ // save = z; while(u < p) { diff = *z*0.95; *z = value; z++; in_diff = (int)diff; value = *z - in_diff; u++; } z = save; return; } /**************************************************************/ /**************************************************************/ void averaging(int *a, int *b, int *c, int *d) { int summation; int avg; int i; int *a2; int *b2; int *c2; int *d2; a2 = a; b2 = b; c2 = c; d2 = d; i = 0; while (i<8000) {

Page 77: Frenso State

Speech Recognition Using FPGA

72

summation = *(a)+*(b)+*(c); avg = summation/3; *(d) = avg; a++; b++; c++; d++; i++; } a = a2; b = b2; c = c2; d = d2; return; } /**************************************************************/ /**************************************************************/ int best_match(void) { long int match1 = 0; long int match2 = 0; int match = 0; /****** match1 = dist[0][0]+dist[1][0]+dist[2][0]+dist[3][0]+dist[4][0]; match2 = dist[0][1]+dist[1][1]+dist[2][1]+dist[3][1]+dist[4][1]; if (match1 < match2) match = 0; else match = 1; ******/ if(dist[0][0] < dist[0][1]) { *(d1) = dist[0][0]-dist[0][1]; match1++; } else match2++; if(dist[1][0] < dist[1][1]) { *(d2) = dist[1][0]-dist[1][1]; match1++; } else match2++; if(dist[2][0] < dist[2][1]) { *(d3) = dist[2][0]-dist[2][1]; match1++;

Page 78: Frenso State

Speech Recognition Using FPGA

73

} else match2++; if(dist[3][0] < dist[3][1]) { *(d4) = dist[3][0]-dist[3][1]; match1++; } else match2++; if(dist[4][0] < dist[4][1]) { *(d5) = dist[4][0]-dist[4][1]; match1++; } else match2++; if (match1 < match2) match = 0; else match = 1; return match; } /**************************************************************/ /**************************************************************/ void FIR_Filter(int trial, int samp_length, float B[], int *samp, int *out) { /* This function filters the samples pointed to by 'samp' and stores them in the location pointed to by 'out' */ /* samp_length is the number of samples pointed to by 'samp'. */ // /* 'samp_length': number of samples B[]': coefficient array for filters 'samp': pointer to integer samples 'out': pointer to output storage */ /*********************************************************************************************************************************************************************************/ int *save_in; int *save_out; save_in = samp; save_out = out; int k = 0; int inc = 0; float val = 0; float y = 0; float f_out = 0; while(inc < taps) { while(k < (inc+1))

Page 79: Frenso State

Speech Recognition Using FPGA

74

{ val = (float)*(samp-k); //printf("%f \n",val); y += val*B[k]; k++; } //printf("%f \n", y); if (trial == 1) { y = ((y*y)+0.5); } if (trial == 2) { f_out = (float)*out; y = ((y*y)+0.5); y+= f_out; } if (trial == 3) { f_out = (float)*out; y = ((y*y)+0.5); y = (y + f_out)/3; } *out = (int)y; //printf("\t %i \n",*out); samp++; inc++; out++; y =0; k = 0; } //printf("%i \t %f \t %f \n",inc,*samp,*out); //printf("%i \t %i \n",inc,samp_length); while(inc < samp_length) { while(k < taps) { val = (float)*(samp-k); y += val*B[k]; //printf("%i \t %f \t %f \t",k ,B[k], *(samp-k)); k++; } if (trial == 1) { y = ((y*y)+0.5); } if (trial == 2) { f_out = (float)*out; y = ((y*y)+0.5); y+= f_out; } if (trial == 3) { f_out = (float)*out; y = ((y*y)+0.5); y = (y + f_out)/3; } *out = (int)y;

Page 80: Frenso State

Speech Recognition Using FPGA

75

//printf("%i %3.8f \n",inc,*out); samp++; // Inc pointers & counters inc++; out++; y =0; k = 0; } samp = save_in; // Return pointers to 0th element out = save_out; return; } /*****************************************************************/ Pushbutton.c

extern volatile int buffer_index; /*************************************************************************************** * Pushbutton - Interrupt Service Routine * * This routine checks which KEY has been pressed. If it is KEY1 or KEY2, it writes this * value to the global variable key_pressed. If it is KEY3 then it loads the SW switch * values and stores in the variable pattern ****************************************************************************************/ void pushbutton_ISR( void ) { volatile int * KEY_ptr = (int *) 0x10000050; // pushbuttons base address volatile int * audio_ptr = (int *) 0x10003040; // audio port address volatile int * green_LED_ptr = (int *) 0x10000010; // green LED address int KEY_value; KEY_value = *(KEY_ptr + 3); // read the pushbutton interrupt register *(KEY_ptr + 3) = 0; // Clear the interrupt if (KEY_value == 0x2) // check KEY1 { *(green_LED_ptr) = 0x2; // turn on LEDG[1] // reset the buffer index to record buffer_index = 0; // clear audio-in FIFO *(audio_ptr) = 0x4; // turn off clear, and enable audio-in interrupts *(audio_ptr) = 0x1; } else if (KEY_value == 0x4) // check KEY2 { *(green_LED_ptr) = 0x4; // turn on LEDG[2] // reset buffer index to record buffer_index = 0; // clear audio-out FIFO

Page 81: Frenso State

Speech Recognition Using FPGA

76

*(audio_ptr) = 0x8; // turn off clear, and enable audio-out interrupts *(audio_ptr) = 0x2; } /****else if (KEY_value == 0x8) // check KEY3 { *(green_LED_ptr) = 0x8; // turn on LEDG[3] // reset buffer index to record buffer_index = 0; // clear audio-in FIFO *(audio_ptr) = 0x4; // turn off clear, and enable audio-in interrupts *(audio_ptr) = 0x3; } ****/ return; }

Interval Timer ISR.c

extern volatile int timeout; /***************************************************************************** * Interval timer interrupt service routine * * Controls refresh of the VGA screen * ******************************************************************************/ void interval_timer_ISR( ) { volatile int * interval_timer_ptr = (int *) 0x10002000; volatile char * LCD_display_ptr = (char *) 0x10003050; // 16x2 character display *(interval_timer_ptr) = 0; // clear the interrupt timeout = 1; // set global variable /* shift the LCD display to the left */ //*(LCD_display_ptr) = 0x18; // instruction = shift left return; }