feasibility study of a voice identification approach using ... · the basis for signal-processing...
TRANSCRIPT
Feasibility study of a voice identification approach using
Fourier-Transform based spectral matching
xxxx1 and xxxx1 Authors' information has been removed, submitted by 60430381Mathematical Foundations of Data Analysis I WS 2018/19,
Mathematical Institute,
University of Cologne
Speech recognition and especially voice identification is a strong research field nowadays
and it will potentially grow even more in the near future. Spectral analysis is a fundamental
operation in speech recognition. This project is an approach based on discrete Fourier
transformation to analyze words spoken in spectral frequency for speaker recognition. By
using several mathematical methods regarding Fourier transformation, audio signals will
be evaluated not only in the time- but also in the frequency-domain. The final results will
show that the use of the fast Fourier transform for extracting features from the audio input
is a numerically more efficient method for voice identification. In addition to that, it will be
examined how noises influence the operations and in which way it is possible to remove this
noise for an acceptable reconstruction of a template signal.
I. INTRODUCTION
Fourier Analysis forms the foundation for signal pro-
cessing. Applications can be found, e.g. in data
sampling, data compression or imaging. One of the
most fascinating and present field of interest is audio
signal processing. With Fourier-based data analysis it is
possible to compare speech patterns for the identification
of a template speaker. We aim to compare audio files
from different people to a template audio file. As a
result, from using Fourier based spectral comparison in
MATLAB, we then are supposed to be able to identify
and match the speaker. Fourier analysis makes it fairly
easy to denoise audio files and filter out unusable and
undesirable data. The necessity and importance of voice
and speech analysis is obvious: the ability to identify
a person by his voice and to recognize exactly what
one says. The use in public surveillance and smart
speakers are just some of many aspects that emphasize
the rising relevance of voice and speech identification,
and recognition respectively.
One’s voice is the most important tool for human
communication. From a technological point of view,
the study of voices and sounds hence offers many
opportunities for us. With the rapid and constant tech-
nological development in today’s world, it is substantial
to understand how voice identification in general but
also specific topics like speech recognition work. With
this project, we want to analyze the basic mathematical
foundation of voice identification without stepping too
deep into the subject of artificial intelligence and speech
recognition. Our main goal is to show the idea behind
the advanced technological tools, we have access to
nowadays, and to attempt a simple voice identification.
The focus of our work is on the principles of audio signal
processing, including spectral comparison, the filtering
Best ms the fouthe fo
ations can bs can b
a compressionompression
nating and prnating and pr
ssingssing. WiW
parpar
Report
tranr
thod for vod for
the operations ae operation
onstruction of a tuction of
DUCTIONDUCTION
dati
Sample
s a strong researstrong res
ure. Spectral analSpectral
an approach baapproach b
ectral frequency fral frequen
ding Fourier tranding Fourier t
lso in the frequeo in the freq
orm for extm for
e ide
MFDA-In for signaln for signal
und, e.g. ind, e.g.
imaging. Oimaging.
t field of intet field of inte
ourier-based durier-based d
eech patterns fch patte
er. We aimer. We aim
to a teto a t
WS mplatempla 2018
/2019field nowadafield nowad
sis is a fundamsis is a fund
ed on discreted on discrete
r speaker recor speaker reco
formation, aumation, a
cy-domain. Thdomain. Th
acting featurescting feature
tification. In aification. In
n which wayn which way
ignal.nal.
2
of sound, denoising methods and the computational
implementation of some Fourier transformations.
In a world guided by technology and progress, speech
recognition is a fundamental research area, yet to be
completely understood. The basic concepts of speech
analysis go a long way back, already beginning in the
late 18th century. Thus, we are mainly interested in
the current process of speech recognition. The benefits
that come from the understanding of how sounds and
speech are produced and perceived are immense. Audio
signal processing has led to the development of speech
synthesizers as well as automatic speech recognition
systems, such as Siri, Alexa or Google Home, with the
aim to simplify our everyday life. Nevertheless, this
field is still not fully explored. The automatic speech
recognition systems, for example, are often susceptible
to errors. Unexpected variability in acoustics can lead
to poor performances and irritations. Therefore, this
research area is still at the beginning of its development
and has undoubtedly huge potential for the future.
II. MATHEMATICAL METHODS
The basis for signal-processing applications is formed
by the discrete Fourier transform (DFT). The discrete
Fourier transform is essential for finding the spectrum of
a finite-duration signal. In comparison to that, it is also
inevitable to discuss the more efficient computational al-
gorithm, the so-called fast Fourier transform (FFT). For
our considerations, the numerical efficiency is important
for a stable implementation. Therefore, we mainly want
to focus on the use of FFT.
A. Discrete Fourier Transform
In general, the discrete Fourier transform (DFT) of a
finite duration sequence f(n), 0 ≤ n ≤ N − 1, denoted
by f̂ or Ff , is calculated by following formula
f̂(ω) =
N−1∑n=0
f(n)e− i2πωn
N
.
Remark 1. There also exists a similar method of
calculating the n-point DFT of a vector x. Let n ≥ 2
be any given integer. The n-point discrete Fourier
transform (DFT) is the n × n matrix Fn of complex
numbers, defined by
Fn =[(zn)
jk]
0≤j,k≤n−1
=
⎡⎢⎢⎢⎣
1 z0n · · · z0
n
1 z1n · · · zn−1
n
1 · · ·
1 zn−1n · · · z
(n−1)(n−1)n
⎤⎥⎥⎥⎦
=
⎡⎢⎢⎢⎣
1 1 · · · 1
1 z1n · · · zn−1
n
1 · · ·
1 zn−1n · · · z
(n−1)(n−1)n
⎤⎥⎥⎥⎦ ,
where zn = e− i2π
n . For a vector x ∈ Cn the DFT is given
by x̂ = Fnx[2].
For completeness, we also want to mention the inverse
discrete Fourier transform (IDFT) of a given sequence
f̂(ω), 0 ≤ ω ≤ N − 1, which is defined by
f(n) =1
N
N−1∑n=0
f̂(ω)ei2πωn
N .
The inverse of a Fourier transformation can be used to
reconstruct a signal from its frequency components.
Remark 2. Analogous to the DFT, one can derive the
n-point version of the inverse discrete Fourier transform.
For a given sequence f̂(ω), 0 ≤ ω ≤ N − 1, its IDFT is
then defined by
F̃n =1
nF ∗
n ,
where F ∗n is the adjoint of Fn[2].
In general, the DFT can be taken of a sequence of
complex numbers. In our case, for simplicity, we only
considered real-valued numbers.
The main purpose of measuring the spectrum of a sig-
nal, is to learn about the covered frequency range and
the number of frequencies measured. One has to con-
sider that a real-life speech signal must have a certain
duration to be sufficient enough. Furthermore, the per-
formance of spectral analysis by DFT needs information
about the duration of the DFT, the time between succes-
fast Foust F
the numericae numeric
plementation.mentation.
he use of FFThe use of FF
DD
Report
DS
cations is formedions is fo
(DFT). The discT). The
r finding the specfinding the sp
omparison to thaomparison to
efficient comefficient c
r tranr t
Sample
⎣ 1
11 zznnn
= e− i22ππ
pln . For a ve. For a
FnF xx[2].[2]
For completenesscompletF
discrete Fouridiscrete F
f̂f((ωω), 0), 0 ≤
MFDA-Iutatiutat
orm (FFT).FFT)
ciency is impoency is impo
efore, we mainefore, we mai
te Fourier Tre Fourie
te Fourte Fou
ff
WSetee
um ofm of
t is alsot is also
al al-al-
TheTh
reconstrecon
Re
2018
/2019
1)()(n−1)1)
⎦⎦
oror xx ∈∈ CCnn the
we also wantalso want
r transform (Itransform (I
≤ NN − 1, whic1, wh
ff(n
versers
3
sive applications of the DFT and on the sampling rate
of the discretized signal [1]. For a useful evaluation, it
is necessary to choose an appropriate window to cover
possible speech parameter changes. The parameter N is
determined by the preferred spectrum resolution.
Remark 3. The sampling of an analog signal in
the time-domain results in a periodic function in the
frequency-domain. For the DFT both f(n) and f̂(ω)
are periodic with period N . Therefore, the IDFT of the
signal will also be N -periodic.
From a numerical point of view, the DFT certainly has
serious deficiencies regarding its implementation. The
computational expense of the DFT can take up to N2
operations which means that for large data sets, a com-
plexity of O(N2) could occur. Hence, the use of DFT will
probably be highly inefficient. Due to that, we want to
compare the DFT to the fast Fourier transform (FFT).
The FFT is supposed to be a faster version of the DFT
and should save computational time.
B. Fast Fourier Transform
With the FFT algorithm, the computational time
reduces to O(N log2 N) in the worst case. This is done
by separating the calculations of even and odd indices:
When the index is even, the terms n and n + N2
can be grouped. One gets
f̂(2ω) =
N
2 −1∑n=0
(f (n) + f
(n +
N
2
))e
− i2πnω
N
2 .
For odd indices, we hence have
f̂(2ω + 1) =
N
2 −1∑n=0
ei2πn
N
(f (n) + f
(n +
N
2
))e
− i2πnω
N
2 .
Similarily, for DFT matrices, one can derive a full
facorization. For n = 2m, with m ≥ 1 is an integer, we
get following FFT scheme:
Fn = F2m = Gm0 Gm
1 ...Gmm−1P̃2m ,
where P̃ is a permutation matrix and Gmk is defined by
Gmk = diag {E2m−k , ..., E2m−k }
with
E2n =
⎡⎢⎢⎢⎣
In
... Dn
· · · · · · · · ·
In
... −Dn
⎤⎥⎥⎥⎦ .
Continuously, Dn is defined as
Dn = diag{
1, e− iπ
n , ..., e− i(n−1)π
n
}
and In is the n × n identity matrix[2]. This full matrix
factorization formula for Fn with n = 2m, provided n is
an even integer, is supposed to decrease the computa-
tional complexity of F2m .
A signal of size N is then calculated with two discrete
Fourier transforms of size N2 plus O(N) operations. A
similar argument can also be applied to the number
of additions required. All in all, O(N log2 N) time is
needed for the calculation of the FFT.
In comparison to the two methods mentioned above, we
also want to take a closer look at a real-valued version
of the DFT: the discrete Cosine transform (DCT). The
idea behind the DCT is to avoid a dimension extension
in case of the transformation of a real signal into a
complex signal.
C. Discrete Cosine Transform
Recall the definition of the DFT. The discrete Cosine
transform (DCT) of a sequence f(n) of N points is de-
fined by the equation
DCT (ω) =2
N
N−1∑i=0
f(n) cos
((2i + 1)ωπ
2N
),
with ω = 1, 2, ..., (N −1). For ω = 0, 2N
changes to√
2N
[3].
Similarly, we also want to give the definition of the
inverse discrete Cosine transform (IDCT).
We have the same assumptions as in the definition for
Best even, thn,
One getsgets
)) ==
NN
B22 −11∑∑nn=00
((ff ((
Report
m
computationalomputatio
worst case. Thisworst case. Th
ns of even and odns of even and
termt
Sample
n × nn ideident
n formula forormula for F
integer, is suppteger, is s
al complexity ofomplexity F
signal of sizegnal of siz NN
Fourier transfoourier tran
similar argimilar a
of additof
needene
MFDA-In andd nn ++
ff
((n ++
ND22 ))))
hence havehence have
((
WSimee
s donedone
ndices:ndices:
In
alsoals
of theof the
ideaidea
in
2018
/2019
rix[2]. Thisrix[2]. This
with nn == 2mm, pp
sed to decreasd to decreas
mm ..
s then calculathen calcu
ms of sizems of size NN/222ment can alsoment can also
ns required.required.
for the calculor the calcul
omparisonompariso
ant tnt
4
the DCT. Then,
IDCT (n) =2
N
N−1∑i=0
a(ω)DCT (ω) cos
((2i + 1)ωπ
2N
),
n = 1, 2, ..., (N − 1), is called the the inverse discrete
Cosine transform (IDCT). a(ω) =√
1N
for ω = 0 and
a(ω) =√
2N
for ω �= 0. Again, for n = 0, 2N
changes to√
2N
[3].
Alternatively, one can also define the DCT and the IDCT
via a matrix formulation.
Remark 4. For each n ≥ 2, the unitary matrix
Cn = [c0, ..., cn−1]n×n, with
ck =
[1√n
,
√2n
cos(k+ 1
2 )π
n, ...,
√2n
cos(n−1)(k+ 1
2 )π
n
]T
,
is called the n-point discrete Cosine transform (DCT),
and its transpose CTn is called the n-point inverse discrete
Cosine transform (IDCT)[2].
The multiplicative factor 1n
turns the matrix Cn into an
orthogonal matrix. Therefore, the inverse DCT is given
by the transpose of Cn.
With the DCT, it is possible to consider audio signals as
sparse signals in the frequency-domain. For further con-
siderations we refer to [3]. The most interesting aspect
of the DCT is that the DCT speech signal representation
is able to compress input data into as few coefficients
as possible. Coefficients with relatively small amplitudes
can be get rid of without any misrepresentation of infor-
mation in the reconstructed signal.
Remark 5. The computational time of the DCT is
comparable to the DFT. There are also algorithms in
O(n log n) time, but rarely specified in practice[3].
D. Realization and Application
Our voice identification system focuses on the determi-
nation of which signal pattern belongs to a prespecified
speaker. Therefore, we recorded voices from different
people saying the words “Data Analysis”. To be capable
of distinguishing between the template voice and the
rest of the recordings, we will analyze our data regarding
their time-domain and frequency-domain representation.
After this analysis it should be fairly easy to allocate
the signals. We also want to compare the patterns of
the same person recording different words, e.g. “Frohe
Weihnachten”, or a longer sentence. As a first step,
the samples, including male and female voices, will be
read-in MATLAB.
In Part I of this project, our objective is to identify if
a certain voice belongs to the predetermined person.
To analyze and compare these samples, we will firstly
transform them with the above mentioned methods to
obtain the transformed signal in the frequency-domain.
These spectra will then be compared with the master
spectrum. With our application, it should then be pos-
sible to differentiate between speakers and to backtrack
if the same words were spoken.
In Part II, we focus on removing unwanted noise. For this
purpose, a new signal is created by mixing the original
voice with a noise signal. By applying a simple filter, fol-
lowed by an inverse transformation, it should be possible
to reconstruct the template.
III. NUMERICAL STUDY AND EVALUATION
A. Input Data
The foundation of our project, as already mentioned,
is built by audio recordings. A total of nine audio
samples haven been recorded. For further information
and details on each sample, see Appendix.
For a closer look on the audio samples, they were
plotted in the time-domain first.
For simplicity, the four most important and relevant
samples are shown in Figure 1. Graph (a) is the signal
of a female voice saying “Data Analysis”. (b) displays
the audio signal of the template male saying “Frohe
Weihnachten”. The next one, graph (c), will be our
master audio signal. It displays the template male voice
of the words “Data Analysis”. Finally, the last graph (d)
shows the same signal from (c) but with a background
noise. All original input signals are plotted from time in
seconds s to amplitude.
By comparing the signals from (c) and (d), one
Bestwit
out any many
nstructed signucted sign
The computThe comp
e to the DFTe to the DFT
me, butme, but
Report
audio signals aso signals
n. For further conFor further
ost interesting asinterestin
ech signal represeech signal repre
ata into as fewata into as fe
relatively smrelatively
reprere
Samnn
enn
mple our apur a
rentiate betweentiate bet
me words were spowords were
Part II, we focus ot II, we focu
purpose, a newrpose, a n
voice with a noice with
lowed by alowed b
to recoto
MFDA-Ill ampam
ntation of inn of i
al time of thal time of
here are alsohere are als
ly specified inly specified in
onon
WS ectct
tationation
fficientsfficients
udesdes
II 2018
/2019d w
t should tshould t
peakers and toakers and t
en.en.
n removing unwremoving unw
gnal is createdgnal is created
ise signal. Bysignal. By
inverse transfnverse trans
struct the temstruct the te
NUMNUM
5
Figure 1: Four different audio signals in the time-domain
can easily see the difference. In the mixing process,
we have to make sure that both have the same length.
For mixing purposes, we used simple splines. One can
change the factor noiseIntensity, which determines the
strength or intensity of the selected noise. In our sample
we used noiseIntensity = 1.
1 % mixing audio sample S1 with noise sample S9
2 noiseIntensity = 1.0;
3 [S10, Fs10] = mixingSamples(S1, Fs1, S9, ...
Fs9, noiseIntensity, Fs1);
The foundation and basis for our further analysis is set.
In the next step, we will transform these signals into
spectra in the frequency-domain.
B. Fast Fourier Transformation of the Audio
Signals
It is practicable to transform the signal with the fast
Fourier transformation. We used the following code in
MATLAB to call and perform the different kinds of
Fourier transformation of our input signals.
1 [Y, f, fSPE] = ...
computeTransformation(sample, ...
samplingRate, typeOfTransformation);
One can change the type of transformation to ’FFT’,
’DFT’ or ’DCT’. We used the build-in Matlab code for
FFT and our own source code for DFT and DCT. The
sampling rate in our project is always the same, since the
same recorder has been used. The output Y is our trans-
formed data. Because Y is a two-sided spectrum and has
complex values, we perform calculateSpectrum to obtain
the one-sided spectrum with real values.
Best 1
stio sample S1mple S
nsity = 1.0;y = 1.0;
s10] = mixins10] = mixin
9, noiseInt9, noiseIBRep
ort rentent
mixing process,ng proce
e the same lengthhe same le
mple splines. Onesplines.
tyty, which determ, which determ
elected noise. Inelected noise.
Saudio signalio sig
MFDA-I A-noise samplise samp
les(S1, Fs1,les(S1, Fs1,
ty, Fs1);ty, Fs1);
FDbasis for oubasis for ou
will trwill tr
WS cann
es thethe
samplesample
Fo
WSWSWS11 [[
WSW20
18/20
19
in the time-don the time-d
ier transftrans
6
1 function spe = calculateSpectrum(Y)
2 %
3 n = length(Y);
4 amplitude = abs(Y) / n; % amplitude
5 amplitude_oneSide = ...
amplitude(1:floor(n/2)); %two-sided ...
to one-side
6 amplitude_oneSide(2:end-1) = 2 * ...
amplitude_oneSide(2:end-1);
7
8 spe = amplitude_oneSide;
9 end
Important: As in Section II A mentioned and shown,
DFT is very slow for large data. In practice, this fact is
confirmed and can be extracted from the code. It takes
considerably longer to perform our own DFT and DCT,
comparing to the build-in FFT. For further analysis,
we will only display FFT results, since the results from
DFT and DCT are nearly identical. The DFT and DCT
code can be found in the Appendix.
The following plots demonstrate the results of the
transformed signals from Figure 1. Instead of con-
templating the signals regarding the time, we can
now analyze the signals in the frequency-domain. The
corresponding graphs are plotted from frequency in Hz
to amplitude. The results are given in Figure 2.
We want to evaluate the outcome of our applications.
Comparing the plots, one can easily extract following re-
sults:
1. The frequency spectrum of (a) is broadly spread.
The peak of this voice can be found around 1000Hz,
which is considerably higher than the peak of the
voice in (c), which is around 500Hz. Moreover, re-
garding (a), there are higher frequencies, especially
higher than 4000Hz, being represented. Since the
course of these two graphs are similar, it is con-
ceivable to suggest that the intonations are alike
and the same words were probably spoken but with
a different frequency spectrum. Considering that
both are saying the same, it is plausible to sug-
gest that higher frequencies are a sign for a higher
pitched voice, i.e. from a female.
2. Graph (b) has clearly a different frequency spec-
trum compared to the other three samples. There
are plenty of peaks between 100Hz to 400Hz. The
maximum frequency can be found around 220Hz.
A frequency of 2000Hz and higher is barely exist-
ing. The absence of higher frequency points to a
male voice. Since the trend of this spectrum heav-
ily differs from our template (c), we can assume
that other words were spoken.
3. Finally, (d) is very similar to (c). The trend of
both graphs nearly correspond to each other. Also,
they have similar peaks and range of frequencies.
One can directly guess that the signals are from
the same person. Considering the results above, the
probability, that the same words were said, are ex-
tremely high. The effect of the additional noise, can
be seen in the interval [500, 1500] and [4000, 5000],
where the amplitudes in (d) are slightly higher.
In addition to these simple observations, we want to com-
pare these spectra with each other more theoretically.
Therefore, we want to determine the correlation between
some spectra, e.g. the degree of similarity. Given two
transformed signals Y1 and Y2, the correlation of these
two is given by
c =| < Y1, Y2 > |
||Y1|| · ||Y2||.
The basic idea of this correlation, is to determine the
orthogonality between the two signals. In Table I, the
correlations between several spectra are shown. In Addi-
tion to the previous four samples (a)-(d), we have added
the following samples:
• (e): template male voice saying a longer sentence
• (f): another male voice saying ”Data Analysis”
a b c d e fa 1 0.4262 0.4135 0.5710 0.4598 0.4599b 0.4262 1 0.6265 0.6861 0.7368 0.5117c 0.4135 0.6265 1 0.9455 0.7176 0.3555d 0.5710 0.6861 0.9455 1 0.7434 0.4914e 0.4598 0.7368 0.7176 0.7434 1 0.4652f 0.4599 0.5117 0.3555 0.4914 0.4652 1
Table I: Comparing different spectra using simplecorrelation
The correlation results show exactly what we anticipated.
One can see that samples (b),(d) and (e) have most simi-
Besty spectrum opectrum o
of this voice cahis voice c
is considerablyis considerab
in (c), whichin (c), whic
(a), th(a), th
Report
of
d of con-f co
me, we canwe can
y-domain. Themain. T
m frequency infrequency
en in Figure 2.n Figure 2
come of our appcome of our app
n easily extractn easily extr
Sample
. T
n the interhe inte
e the amplitudese amplitud
tion to these simpto these s
e these spectra whese spectr
Therefore, we waefore, w
some spectra,ome spec
transformetransfo
two is gwo
MFDA-I ll
is broadly sprs broadly spr
found aroundfound around
her than theher than the
around 500around 500HzHz
are higher freqare higher freq
00Hzz, being re, being
e two graphe two graph
st thast t
WS z
cations.cations.
wing re-ing re-
TheThe b
or
2018
/2019
additioadditio
1500] and [41500] and
(d) are slightlare slightl
e observationsbservati
th each otherth each other
to determinedetermine
e.g. the degreg. the degre
signalsign YY1Y ana
ven byven by
7
Figure 2: Audio signals in the frequency-domain using FFT
larity with the template sample (c), which is not surpris-
ing. Another fact is that (c) and (d) have a correlation of
0.9455, which is very high. Even with additional noise,
we can easily identify the same person. We also notice
that it is irrelevant what is said or how long a person
speaks, the correlation with samples of the same person
is always considerably high. From these results, we see
that the frequency spectrum delivers a lot of important
information, i.e. helping us
• to identify if it is the same person speaking,
• to differentiate between female and male voices
and/ or
• to recognize if the same words were spoken.
Comparing signals (c) and (d), we notice that the noise
makes the correlation with other signals higher. In prac-
tice, we often encounter audio samples that are not clean.
In order to remove these unwanted noise and to obtain a
clean voice, we need a process of denoising, e.g. filtering.
C. Filtering
There are many ways to denoise a signal. The most com-
mon techniques are filters which cut off specific frequency
ranges or remove frequencies with low amplitudes. Fre-
quently used filters are
• Low-pass filters,
• High-pass filters and
• Band-pass filters.
We go back to samples (c) and (d), see Figure 2. In
this project, cutting off frequency ranges or removing
amplitudes do not provide the desired result. We have
to consider another customized filter for this problem,
since both spectra are very similar.
The frequency spectrum of the original, the mixed and
the filtered signal are displayed in Figure 3. Our filter is
based on the intention of removing parts and the influ-
ence of noise from the mixed signal as much as possible.
Therefore, we consider two frequency spectra Ymixed and
Bestith
y high. Fgh.
y spectrum deectrum de
e. helping uselping us
ntify if it intify if it
Report
als inls i
h is not surpris-not surpr
ave a correlation oa correlati
with additional nh additio
e person. We alse person. We a
said or how lonsaid or how
amples of thampl
m them
Sahe frequene freq
MFDA-Isameame
e results, wets, w
a lot of impoa lot of im
e same personsame person
e between fe between
WS ise,e,
noticeotice
personperson
ersonon
ThereThere
mo
2018
/2019
y-domain usingdomain usin
8
Figure 3: Overlay of frequency spectra
Ynoise. To be able to compare these two, we have to nor-
malize them first. Then we can calculate the difference
in their amplitude:
amp_diff = Ymixed,norm − Ynoise,norm.
The filter is constructed as simple as follows with
amp_diff > 0:
1 function filter = ...
designDeNoisingFilter(fSample, fNoise)
2 %
3 fSample_amp = abs(fSample)/length(fSample);
4 fNoise_amp = abs(fNoise)/length(fNoise);
5
6 amp_diff = fSample_amp/sum(fSample_amp) ...
- fNoise_amp/sum(fNoise_amp);
7 filter = amp_diff > 0;
8 %
9 end
To apply this filter, we have to multiply it with the trans-
formed mixed signal. After adjusting it to the one-sided
spectrum, we obtain the filtered frequency spectrum in
Figure 3. We notice that the filtered spectrum comes
very close to the original signal. The effect of this filter
can be seen clearly, especially in the frequency intervals
[500, 1500] and [4000, 5000].
1 filter_FFT = ...
designDeNoisingFilter(S10_FFT, S9_FFT);
2 S10_FFT_filtered = S10_FFT.*filter_FFT;
3 S10_FFT_filtered_SPE = ...
calculateSpectrum(S10_FFT_filtered);
4 S10_FFT_filtered_Inversed = ...
real(computeInverseTransformation...
5 (S10_FFT_filtered, 'FFT'));
As shown in Figure 3, the filtered spectrum is mostly
lower than the original spectrum regarding the ampli-
tude. It is clear, that we used a very basic and easy
filter, so there is the possibility of over-filtering. The
frequency spectrum alone is not conclusive enough. It
is very important to see the impact on the samples in
the time-domain. For that purpose, we performed an
inverse transformation, depending on the previously
used method.
Best cted ased
Besnn filter =filter =
ignDeNoisignDeNoi
Report
OverOve
we have to nor-have to no
late the difference the diffe
,norm,nor −− YYnoise,nonoisY
impleim
Say of frequeof fre
MFDA-I s follows wows
DAlter(fSamplelter(fSample
= abs(fSamplebs(fSamp
= abs(fNois= abs(fNoi
plpl
WS Fi
very cvery
can becan b
[500[500
2018
/2019
cy spectray spect
re 3. W3. W
sese
9
Figure 4: Overlay of signal (c) and (d) plus filtered version of signal (d)
In Figure 4, the inversed signals from Figure 3 are dis-
played. We notice, that there is still some unwanted noise
in the filtered signal. But still, the filter does work and
clearly improved the signal. To proof it with numbers,
we can calculate the correlation between the filtered and
the original signal (c). As a result, we have a correlation
of 0.9866, an improvement of 4,3% to the 0.9455 obtained
previously, see Table I.
IV. DISCUSSION
A. Conclusion
In this project, we have successfully used DFT, FFT
and DCT to obtain the frequency spectra of several
audio samples. With these we have built the foundation
of spectral analysis. In order to compare and identify
these samples, we studied the similarity and correlation
of each spectrum. As a result, we can now identify
different voices and match these to a template speaker.
Clearly, we have to be aware of the fact, that this is
only a small sample size. Obviously, we have just used a
very simple indicator of similarity. There are definitely
way more complex and better indicators for a complete
spectral analysis, which would go beyond the scope
of this project. Nevertheless, we have demonstrated
the basic idea and have noticed, that one’s frequency
spectrum is so unique in itself, that a distinction is fairly
easy to be made.
Furthermore, we have also looked into denoising
and have created our own filter. Our filter does improve
the similarity to the original template. The fact that
still little noise remains, indicates that the filter could
be improved in accuracy and precision.
In Addition to all that, we performed different (in-
verse) Fourier transformations to reconstruct our input
signals. We can confirm that the results are almost
equal but we also do realize the huge difference in
computational complexity. Since we only considered
a very small set of data, it would be interesting to
see how big the actual difference between the methods
are, especially thinking of complex speech recognition
systems.
Best s a
ment of 4nt of
ble I..
IV.V
Report
(c)(c)
gure 3 are dis-e 3 are d
me unwanted noisunwanted
e filter does worklter does
o proof it with no proof it with
ion between theion between t
sult, we havsult, we
to tht
Sad (d) plus(d) p
MFDA-Ia correcor
0.9455 obtai5 obta
SCUSSIONSCUSSION
A.A. ConclusiConclus
ee
WS andd
mbers,bers,
red andred and
ationon
ve
way mway
spectraspect
of tf t
2018
/2019
filtered versiontered version
simple inmple i
rere
10
B. Outlook
The next step of a voice identification is the speaker
recognition system (SRS). The goal of a SRS is to present
an (almost) exact scheme to distinguish the individual
properties of each speaker. The base for a SRS are
the similarities in individual speech elements. Speaker
recognition requires precise analysis of the speech signal.
It has to be clarified whether the representation carries
suitable signal features or not. Moreover, artificial
intelligence, e.g. deep learning, plays a significant role
for training speech patterns in order to improve their
accuracy.
For further implementation one can derive the so-
called Short-Time Fourier transform, or also known
as the windowed Fourier Transform. The short-time
Fourier transform is localized in time and frequency.
This means that it provides information about at which
point in time a certain frequency occurs.
As a possible comparison, it is convenient to con-
sider Wavelet Transforms. They are supposed to tell
what frequencies are present and, in contrast to normal
Fourier Transforms, where (or at what scale). Wavelets
might be better in areas like good compression or
improved filtering. It would be interesting to see if there
are any advantages, that come up by using Wavelet
transforms and if so, which. Based on the time-frequency
multi-resolution property of Wavelet transforms, the
input speech signal is decomposed into various frequency
channels. The major challenge for creating a Wavelet
based speech recognition system is the choice of suitable
wavelets for speech signals and also selecting the right
feature vectors from the wavelet coefficients. A possible
implementation could be the evaluation of the function-
ality of a wavelet based feature extraction system and
its performance on word recognition problems.
ACKNOWLEDGEMENT
I would like to show my gratitude to xxxxxxxx for his engagement. Without the help of the other, this project could not have been realized. His significant support, especially regarding his implementation and study with respect to the DFT and DCT, made this project possible in the first place. Only with his full commitment and dedication, this project became complete. I am already looking forward to working together in the next project.
[1] Ben Gold, Nelson Morgan, and Dan Ellis. Speech and Au-
dio Signal Processing, 2nd Edition. John Wiley & Sons,
Inc., 2011, New Jersey.
[2] Dr. Boqiang Huang. Mathematical Foundations of Data
Analysis I. Wintersemester 2018/19, University of
Cologne.
[3] R.G. Moreno-Alvarado, and Mauricio Martinez-Garcia.
DCT-compressive Sampling of Frequency-sparse Audio
Signals. Proceedings of the World Congress on Engineer-
ing 2011 Vol II, WCE 2011, July 6 - 8, 2011, London, U.K.
Best 2n
ey.
Huang.ang. MathemMathe
I. Winterseme. WinterseI
Report
c
d to telto t
st to normalnormal
cale). Wavelets). Wavele
eeppand Dan Ellis.and Dan Ellis. Sp
ditionditio . John
Sample
n
ACKNOWLACKNOW
ouldd likelike toto shos
engagement.gement. Wit
couldould notnot ha
especiallyespecia
respectres
in t
MFDA-I Wiley &iley
al Foundationsundations
2018/19, Un2018/19,
WS loo
SWSand Au-and Au-
Sons,ns
2018
/2019
rob
DGEMENTDGEMEN
mymy gratitudegratitude
out theth helphelp
beenbeen realizerealize
egardingarding hishis imi
oo thethe DFTDFT and
e firstrst place.plac
dication,cation thisthis
ng forwaforwa
APPENDIX
Main Script: Project.m
1 %% Step 0: Initializing and loading audio samples or files
2 %
3 clear;
4
5 projectPath = uigetdir();
6 if isdir(projectPath)
7 cd(projectPath);
8 if exist('Input.mat','file') == 2
9 load('Input.mat');
10 else % read audio files
11 [S1,Fs1] = audioread('Ichkurz.m4a');
12 [S2,Fs2] = audioread('Ichhochkurz.m4a');
13 [S3,Fs3] = audioread('Mamahochkurz.m4a');
14 [S4,Fs4] = audioread('Papa.m4a');
15 [S5,Fs5] = audioread('Ich2.m4a');
16 [S6,Fs6] = audioread('Ich3lang.m4a');
17 [S7,Fs7] = audioread('IchRauschen.m4a');
18 [S8,Fs8] = audioread('IchRauschen2.m4a');
19 [S9,Fs9] = audioread('Rauschen.m4a');
20
21 save('Input.mat');
22 end
23 end
24 %% Step 1: Performing DFT/DCT for each audio sample and calculating one-sided frequency spectrum
25 %
26 if exist('Output_step1.mat', 'file') == 2
27 load('Output_step1.mat');
28 end
29
30 numberOfSignals = 9;
31 typeOfTransformation = 'FFT';
32 for i = 1:9
33 eval(['sample = S' num2str(i) ';']);
34 eval(['samplingRate = Fs' num2str(i) ';']);
35
36 [Y, f, fSPE] = computeTransformation(sample, samplingRate, typeOfTransformation); %#ok<ASGLU>
37
38 eval(['S' num2str(i) '_' typeOfTransformation ' = Y;']);
39 eval(['S' num2str(i) '_' typeOfTransformation '_freq = f;']);
40 eval(['S' num2str(i) '_' typeOfTransformation '_SPE = fSPE;']);
41 end
42
43 clearvars i numberOfSignals sample samplingRate Y f fSPE;
44 save('Output_step1.mat');
45
46 %% Step 2: De-noising a mixed audio sample
47 %
48 clear;
49 if exist('Output_step2.mat','file') == 2
50 load('output_step2.mat')
51 else
52 load('Output_step1.mat');
53 end
54
55 % mixing audio sample S1 with noise sample S9
56 noiseIntensity = 1.0;
57 [S10, Fs10] = mixingSamples(S1, Fs1, S9, Fs9, noiseIntensity, Fs1);
58
59 % performing DFT/DCT for mixed sample S10 and calculating its frequency spectrum
60 typeOfTransformation = 'FFT';
61 [Y, f, fSPE] = computeTransformation(S10, Fs10, typeOfTransformation);
62 eval(['S10_' typeOfTransformation ' = Y;']);
Best omp
m2str(i)tr(i)
num2str(i)str(i) ''
num2str(i)um2str(i)
ars i numbears i num
tput_sttput_st
Report
o
2
2str(i)2str(i) ';'; ]););
= Fs'= Fs' num2str(ium2st
eTransformeTra
ty
Sample
mple and cample an
MFDA-I ion(saon(s
OfTransformaOfTransforma
peOfTransforeOfTrans
typeOfTransfOfTransf
gnals samplegnals sample
at't');
ising a mixeng a mix
tep2tep
WS ;';']);]);
le,
2018
/2019
culating one-ulating on
63 eval(['S10_' typeOfTransformation '_freq = f;']);
64 eval(['S10_' typeOfTransformation '_SPE = fSPE;']);
65
66 % designing and applying filter in frequency space
67 if strcmp(typeOfTransformation, 'DFT')
68 filter_DFT = designDeNoisingFilter(S10_DFT, S9_DFT);
69 S10_DFT_filtered = S10_DFT.*filter_DFT;
70 S10_DFT_filtered_SPE = calculateSpectrum(S10_DFT_filtered);
71 S10_DFT_filtered_Inversed = real(computeInverseTransformation(S10_DFT_filtered, 'DFT'));
72 elseif strcmp(typeOfTransformation, 'DCT')
73 filter_DCT = designDeNoisingFilter(S10_DCT, S9_DCT);
74 S10_DCT_filtered = S10_DCT.*filter_DCT;
75 S10_DCT_filtered_SPE = calculateSpectrum(S10_DCT_filtered);
76 S10_DCT_filtered_Inversed = computeInverseTransformation(S10_DCT_filtered, 'DCT');
77 else
78 filter_FFT = designDeNoisingFilter(S10_FFT, S9_FFT);
79 S10_FFT_filtered = S10_FFT.*filter_FFT;
80 S10_FFT_filtered_SPE = calculateSpectrum(S10_FFT_filtered);
81 S10_FFT_filtered_Inversed = real(computeInverseTransformation(S10_FFT_filtered, 'FFT'));
82 end
83
84 clearvars Y f fSPE;
85 save('Output_step2.mat');
86
87 %% Plotting: selected samples
88 %
89 %load('Output_step2.mat');
90 samples = { S3, S5, S1, S10 };
91 samplingRates = { Fs3, Fs5, Fs1, Fs10 };
92 titles = {'(a)', '(b)', '(c)', '(d)'};
93 h = plotSamplesInSubplots(samples, samplingRates, titles);
94 pause;
95
96 print(h,'-dpng', '-r600', '-noui', fullfile(projectPath, 'samples_amplitude.png'));
97 close(h);
98
99 %% Plotting: frequency spectra for selected samples
100 %
101 %load('Output_step2.mat');
102 typeOfTransformation = 'FFT';
103 titles = {'(a)', '(b)', '(c)', '(d)'};
104 if strcmp(typeOfTransformation, 'DFT')
105 freqs = { S3_DFT_freq, S5_DFT_freq, S1_DFT_freq, S10_DFT_freq };
106 freqSPE = { S3_DFT_SPE, S5_DFT_SPE, S1_DFT_SPE, S10_DFT_SPE };
107 xlimits = [0 5000];
108 ylimits = [0 0.1];
109 elseif strcmp(typeOfTransformation, 'DCT')
110 freqs = { S3_DCT_freq, S5_DCT_freq, S1_DCT_freq, S10_DCT_freq };
111 freqSPE = { S3_DCT_SPE, S5_DCT_SPE, S1_DCT_SPE, S10_DCT_SPE };
112 xlimits = [0 6000];
113 ylimits = [0 0.0004];
114 else
115 freqs = { S3_FFT_freq, S5_FFT_freq, S1_FFT_freq, S10_FFT_freq };
116 freqSPE = { S3_FFT_SPE, S5_FFT_SPE, S1_FFT_SPE, S10_FFT_SPE };
117 xlimits = [0 5000];
118 ylimits = [0 0.1];
119 end
120 h = plotSpectraInSubplots(freqs, freqSPE, titles, xlimits, ylimits);
121 pause;
122
123 file = fullfile(projectPath, ['samples_freqSpectrum_' typeOfTransformation '.png']);
124 print(h,'-dpng', '-r600', '-noui', file);
125 close(h);
126
127 %% Plotting: comparing spectra of mixed with filtered and original samples
128 %
129 %load('Output_step2.mat');
130 typeOfTransformation = 'FFT';
131 titles = {'mixed sample', 'filtered sample', 'original sample'};
132 if strcmp(typeOfTransformation, 'DFT')
000]0
0.1];1];
typeOfTransfeOfTrans
{ S3_DCT_freq_DCT_fr
E = { S3_DCT_{ S3_DCT_
its = [0 6000its = [0 60
imits = [0 0.imits = [0 0
= { S= { S
E =E =
Report
ile(proe(pr
selected sampllected sa
;
c)', '(d)''(d)'};};
mation,mation, 'DFT''DFT ))
req, S5_DFT_frereq, S5_DFT_f
SPE, S5_DFT_SPE, S5_DF
Sample
es, titles);es, titles);
ctPathtP
MFDA-I E, SE,
ion,ion, 'DCT'DCT )
_DCT_freq, ST_freq,
S5_DCT_SPE,S5_DCT_SPE,
;;
T_freq, S5_FFT_freq, S5_FF
3_FFT_SPE, S5_FFT_SPE
5000];5000];
0 0.1];0 0.1];
ubplubp
WS 1_DFT_freq,1_DFT_freq
_DFT_SPEFT_SP
2018
/2019T
'samples_ampamples_
133 freqSPE = { S10_DFT_SPE, S10_DFT_filtered_SPE, S1_DFT_SPE };
134 freqs = S10_DFT_freq;
135 elseif strcmp(typeOfTransformation, 'DCT')
136 freqSPE = { S10_DCT_SPE, S10_DCT_filtered_SPE, S1_DCT_SPE };
137 freqs = S10_DCT_freq;
138 else
139 freqSPE = { S10_FFT_SPE, S10_FFT_filtered_SPE, S1_FFT_SPE };
140 freqs = S10_FFT_freq;
141 end
142 h = compareSpectra(freqSPE, freqs, titles, typeOfTransformation);
143 pause;
144
145 file = fullfile(projectPath, ['comparison_filteredFreqSpectrum_' typeOfTransformation '.png']);
146 print(h,'-dpng', '-r600', '-noui', file);
147 close(h);
148
149 %% Plotting: comparing mixed sample with filtered and original samples
150 %
151 %load('Output_step2.mat');
152 typeOfTransformation = 'FFT';
153 titles = {'mixed sample', 'filtered sample', 'original sample'};
154 samplingRates = {Fs10, Fs10, Fs1};
155 if strcmp(typeOfTransformation, 'DFT')
156 samples = {S10, S10_DFT_filtered_Inversed, S1};
157 elseif strcmp(typeOfTransformation, 'DCT')
158 samples = {S10, S10_DCT_filtered_Inversed, S1};
159 else
160 samples = {S10, S10_FFT_filtered_Inversed, S1};
161 end
162 h = compareSamples(samples, samplingRates, titles, typeOfTransformation);
163 pause;
164
165 file = fullfile(projectPath, ['comparison_filteredSample_' typeOfTransformation '.png']);
166 print(h,'-dpng', '-r600', '-noui', file);
167 close(h);
168
169 %%
Implementation of DFT, DCT, FFT: computeTransformation.m
1 function [Y, f, fSPE] = computeTransformation(X, Fs, typeOfTransform)
2 %
3 m = length(X); % sample length
4 n = pow2(nextpow2(m)); % length used for transformation
5
6 X = cat(1, reshape(X, m, 1), zeros(n-m, 1));
7 Y = zeros(n, 1);
8
9 if strcmp(typeOfTransform, 'DFT') % perform DFT
10 for k = 0:n-1
11 Y(k+1) = exp(-2*pi*1i*(0:n-1)/n*k)*X;
12 end
13 elseif strcmp(typeOfTransform, 'DCT') % perform DCT
14 for k = 0:n-1
15 Y(k+1) = cos(pi/n*(0.5:n)*k)*X;
16 end
17 Y = [Y(1)/sqrt(n); Y(2:end)/sqrt(n/2)]; % normalization factor for DCT
18 else
19 Y = fft(X,n); % perform build-in FFT
20 end
21
22 f = Fs/n *(0:(n-1)/2)'; % frequency
23 fSPE = calculateSpectrum(Y); % frequency spectrum
24 %
25 end
Best PE] =E]
X);
nextpow2(m))ow2(m)
at(1, reshapat(1, resha
zeros(n, 1)zeros(n, 1)
cmp(tcmp(t
kk
Report
fil
;
poFT, DCT, FFFT, DCT, FF
RmputeTrmpu
Sample
s, typeOfTransfs, typeOfTra
redSample_dSamp
MFDA-I Isformatirmat
% sample le% sample
% length% length
m, 1), zerosm, 1), zero
Transform,Transfor ''
n-11
1) = exp(-21) = exp(-2**
typeOftypeOf
11
WS S: computeT: computW(X
2018
/2019
mation);tion);
typeOfTransftypeOfTrans
2
Implementation of IDFT, IDCT, IFFT: computeInverseTransformation.m
1 function X = computeInverseTransformation(Y, typeOfTransform)
2 %
3 n = length(Y); % length used for transformation
4 X = zeros(n, 1);
5
6 if strcmp(typeOfTransform, 'DFT') % perform inverse DFT
7 for k = 0:n-1
8 X(k+1) = 1/n*exp(2*pi*1i*(0:n-1)/n*k)*Y;
9 end
10 elseif strcmp(typeOfTransform, 'DCT') % perform inverse DCT
11 Y = [Y(1)/sqrt(n); Y(2:end)/sqrt(n/2)];
12 for k = 0:n-1
13 X(k+1) = cos(pi/n*(0.5:n)*k)*Y;
14 end
15 else
16 X = ifft(Y,n); % perform build-in inverse FFT
17 end
18 %
19 end
Mixing two Samples: mixingSamples.m
1 function [x, sr] = mixingSamples(x1, sr1, x2, sr2, ratio, sr)
2 %
3 m1 = length(x1);
4 m2 = length(x2);
5 m = min(m1, m2);
6
7 if sr > min(sr1, sr2)
8 sr = min(sr1, sr2);
9 end
10
11 t1 = 0:1/sr1:((m1-1)/sr1);
12 t2 = 0:1/sr2:((m2-1)/sr2);
13 t = 0:1/sr:((m-1)/sr);
14
15 x = spline(t1, x1, t') + spline(t2, x2, t') * ratio;
16 end
Self-designed Filter: designDeNoisingFilter.m
1 function filter = designDeNoisingFilter(fSample, fNoise)
2 %
3 fSample_amp = abs(fSample)/length(fSample);
4 fNoise_amp = abs(fNoise)/length(fNoise);
5
6 amp_diff = fSample_amp/sum(fSample_amp) - fNoise_amp/sum(fNoise_amp);
7 filter = amp_diff > 0;
8 %
9 end
tgned Filtergned Filte
Bfiltefilte
Report
;
2);
););
) + spline) + spli
Sample
leamr2, ratio, sr)r2, ratio, s
MFDA-I , x2x
AI
esignDeNosignDeN
FDdesignDeNoisidesignDeNoisi
= abs(fSamp= abs(fSamp
= abs(fNoi= abs(fNoi
plpl
WS t')) **
2018
/201919
8/2