requirements and partitioning of otoacoustic emission
TRANSCRIPT
Chair of Real-Time Computer SystemsDepartment of Electrical and Computer EngineeringTechnical University of Munich
Requirements and Partitioning of
Otoacoustic Emission
Measurement Algorithms
Rodrigo Hernangomez Herrero
Master’s Thesis
Requirements and Partitioning of
Otoacoustic Emission Measurement
Algorithms
Anforderungen und Aufteilung von
Messalgorithmen fur Otoakustische
Emissionen
Requisitos y Particionado de Algoritmos
para la Medida de Emisiones Otoacusticas
Master’s Thesis
Supervised by Prof. Dr. sc. Samarjit Chakraborty
Chair of Real-Time Computer Systems
Department of Electrical and Computer Engineering
Technical University of Munich
Advisor Nils Heitmann
Author Rodrigo Hernangomez Herrero
Submitted on September 22, 2017
This thesis was typeset using the XeTeX
typesetting system developed by Jonathan
Kew.
Declaration of Authorship
I, Rodrigo Hernangomez Herrero, declare that this thesis titled “Requirements and Partition-
ing of Otoacoustic Emission Measurement Algorithms” and the work presented in it are my
own unaided work, and that I have acknowledged all direct or indirect sources as references.
This thesis was not previously presented to another examination board and has not been
published.
Signed:
Date:
Abstract
Otoacoustic Emissions (OAEs) are a technique for objective diagnosis of hearing impairment.
Their application fields extend primarily to cases where the patient cannot actively cooperate
in the clinical intervention, such as in hearing screening of neonates.
Because of the high cost of professional equipment, smartphones arise as a newer, cheaper
tool to perform such tests. This Master’s Thesis’ objective is the analysis of the computa-
tional requirements of an OAE screening embedded system given a set of medical specifica-
tions. This device will communicate with a smartphone either per USB or through a wireless
protocol, which adds the possibility to partition the algorithm among both systems.
The approach to accomplish the analysis involves the measurement of power consumption
and real-time performance in the embedded system for different settings and implementation
variants. Models can be built out of the experimental results, so that the profiled parameters
can be linked to performance. This may in turn be used to find out the best set of hardware
and software parameters to fulfill the application requirements in an optimal way.
ETSITESCUELA TECNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN
Requisitos y Particionado de
Algoritmos para la Medida de
Emisiones Otoacusticas
Chair of Real-Time Computer Systems
Department of Electrical and Computer Engineering
Technical University of Munich
Tutor Nils Heitmann
Autor Rodrigo Hernangomez Herrero
Munich, 27 de septiembre de 2017
Resumen
Se conoce como emisiones otoacusticas (OAE por sus sigles en ingles) a una serie de tecni-
cas para el diagnostico objetivo de la discapacidad auditiva de una persona. El ambito de
aplicacion de estas abarca primordialmente aquellas situaciones en las que el paciente no
puede participar activamente en la evaluacion clınica, como sucede en el caso de la revision
medica de neonatos.
El elevado coste de los equipos clınicos profesionales ha propiciado la aparicion de los smartp-
hones como una nueva herramienta que puede desempenar esta labor de forma mas ase-
quible. En este contexto, el objetivo que este Trabajo Fin de Master persigue es el analisis
de los requisitos computacionales de un sistema empotrado para la evaluacion de OAE ba-
jo un conjunto de especificaciones medicas. Tal dispositivo debera comunicarse via USB o
inalambricamente con un smartphone, lo que anade la perspectiva de particionar el algoritmo
entre los dos sistemas.
El enfoque escogido para llevar a cabo el analisis comprende la medida del consumo energetico
y el rendimiento en tiempo real del sistema empotrado para diferentes ajustes y variantes de
implementacion. A raız de los resultados experimentales se puede construir un modelo que
relacione los parametros examinados con el rendimiento del sistema. A su vez, esto puede
ser usado para hallar el mejor conjunto de parametros hardware y software que cumplan los
requisitos de la aplicacion de forma optima.
Contents
List of Figures X
List of Acronyms XI
1. Introduction 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Document Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Background 7
2.1. Engineering Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. OAE Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. Platform Architecture 18
3.1. Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4. Experiments 28
4.1. Physical setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2. DPOAE profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3. Impact of clock frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4. Averaging schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5. FFT and Goertzel algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6. Audio codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5. Case scenarios 47
5.1. Global model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2. Systematic setting of parameters . . . . . . . . . . . . . . . . . . . . . . . 48
6. Conclusions 56
6.1. Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A. Appendices 58
A.1. DPOAE with USB. Current and CPU Load Profile . . . . . . . . . . . . . 58
A.2. DPOAE with USB. Energy and Partitions Profile . . . . . . . . . . . . . . 61
A.3. Averaging Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.4. FFT and Goertzel Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
List of Figures
1.1. Global estimates on prevalence of hearing loss . . . . . . . . . . . . . . . . 2
1.2. Prevalence of Disabling Hearing Loss vs. GNI per capita . . . . . . . . . . 2
2.1. TEOAE recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2. DPOAE recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3. OAE system block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1. Picture of the whole hardware platform . . . . . . . . . . . . . . . . . . . 20
3.2. Host vs. Device DPOAE detection . . . . . . . . . . . . . . . . . . . . . . 25
3.3. DPOAE Partition Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1. Schematic diagram of experiment configuration . . . . . . . . . . . . . . . 30
4.2. DPOAE profiling capture . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3. USB test. Sampling frequency performance . . . . . . . . . . . . . . . . . 32
4.4. USB test. Buffer length performance . . . . . . . . . . . . . . . . . . . . . 33
4.5. USB test. Sample size performance . . . . . . . . . . . . . . . . . . . . . . 34
4.6. USB test. Partition performance . . . . . . . . . . . . . . . . . . . . . . . 35
4.7. Test without USB. fclk = 48 MHz . . . . . . . . . . . . . . . . . . . . . . . 35
4.8. Impact of clock frequency on current consumption for averaging partition 38
4.9. Goertzel and FFT time performance . . . . . . . . . . . . . . . . . . . . . 45
5.1. BLE consumption current . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
List of Acronyms
ABR Auditory Brainstem Response.
ADC Analog-to-digital Converter.
AR Artifact Rejection.
BLE Bluetooth Low Energy.
CDC Communication Device Class.
CMA Cumulative Moving Average.
CPU Central Processing Unit.
DAC Digital-to-analog Converter.
dB Decibel.
dBFS Decibels relative to Full Scale.
DFT Discrete Fourier Transform.
DMA Direct Memory Access.
DPOAE Distortion Product Otoacoustic Emission.
DSP Digital Signal Processor.
FFT Fast Fourier Transform.
FPU Floating Point Unit.
GNI Gross National Income.
I2C Inter-Integrated Circuit.
I2S Inter-IC Sound.
MCU Microcontroller Unit.
MIMD Multiple Instruction Multiple Data.
OAE Otoacoustic Emission.
OHC Outer Hair Cells.
PLL Phase-Locked Loop.
RAM Random Access Memory.
RISC Reduced Instruction Set Computer.
SIMD Single Instruction Multiple Data.
SISD Single Instruction Single Data.
SNR Signal-to-noise Ratio.
SPL Sound Pressure Level.
TEOAE Transient Evoked Otoacoustic Emission.
USART Universal Synchronous/Asynchronous Receiver/Transmitter.
USB Universal Serial Bus.
WHO World Health Organization.
“La patria es la familia y los amigos. Heimat heißt Familie und Freunde.
Homeland is your family and your friends.”
A mi familia, lejos y cerca. A los ausentes, que siempre llevo conmigo. A
los presentes, que siempre me faltan. A todo aquello que permanece firme
cuando lo demas se tambalea. Al inmerecido orgullo que me profesan, pues
a sus ensenanzas me debo enteramente.
A mis raıces, mi pueblo y los amigos que siempre esperaran a los madrilenos
en verano.
Al colegio San Viator, pilar desde donde me apoyo para lograr mis metas. A
CAL y a la Parroquia Virgen de la Fuensanta, a mis hermanos de comunidad
con los que fe y amistad se funden en un abrazo hasta volverse indistinguibles.
A Dios, y a cada trocito de Dios que hay en las personas con las que me
encuentro, en la vida que me rodea y en los objetivos que persigo.
A la Universidad Politecnica de Madrid, y en especial al grupo de investi-
gacion B105. Sois alegrıa, pasion y talento sin complejos.
To all the amazing international people that I have met in Munich. It is so
unfair that our paths diverged so soon, when we started realizing the wonderful
person we were in front of.
Meinen lieben Willis im Wohnheim und sonstigen deutschen Freunden. Man
redet von Muttersprache, weil man sie in der Familie lernt. Deutsch ist ja
uberhaupt nicht meine Muttersprache, aber diese Idee berechtigt mich irgend-
wie zu sagen, dass Ihr fast wie eine Familie fur mich hier seid. Ihr habt mir
ganz viele Worte beigebracht, darunter steht aber ,,Integration” als meine
Lieblingsvokabel.
Der Technischen Universitat Munchen. Da konnte ich spuren, wie gluck-
lich ich bin, an so einer ausgezeichneten Universitat studieren zu konnen,
wo Unterricht, Forschung und Studenten die Prioritat sind. Dem Lehrstuhl
fur Realzeit-Computersysteme, und allen seinen Mitarbeitern und Studen-
ten. Danke fur die Gelegenheit, mit Euch arbeiten und von Euch bei der
Masterarbeit unterstutzt werden zu konnen.
1Introduction
“Alcanza la excelencia y compartela.”
“Achieve excellence and share it.”
Saint Ignatius of Loyola
In this first chapter, a presentation of this thesis’ topic is provided. The motivation
for the research on the topic will be discussed, as well as the intended outcomes and
the contribution to the scientific and global community. At the end of the chapter
a brief explanation of this document’s organization can be found.
Motivation
According to the World Health Organization (WHO), there are around 360 million
people in the world with disabling hearing loss, which makes up to 5.3% of the
world’s population. The prevalence of this condition is unequally distributed across
the globe: while it affects 4.9% of male adults in Western Europe, North America,
Oceania and Pacific Asia, this number doubles in South Asia, as Figure 1.1 accounts
for. In fact, 9% of male adults and 8.8% of female adults suffer from hearing loss in
countries such as India, Afghanistan, Pakistan or Bangladesh, this resulting in this
region being the most affected in the world [1].
Deafness also has a strong impact among infants and children. 32 million people
between 0 and 14 years old are estimated to be partially or totally deaf all over the
world, which represents 1.7% of the overall children population and 9% of the whole
affected population. Again, South Asia is the region where this problem strikes in
a hardest way, as the prevalence for this age group lies around 2.4%. As a matter
of fact, a correlation exists between the average Gross National Income (GNI) per
capita and the prevalence of disabling hearing loss of a region, both for children and
for adults (See Figure 1.2).
2 INTRODUCTION
East Asia : 22 %
High-income : 11 %
Central/East Europe and Central Asia : 9 %
Sub-Saharan Africa : 9 %Middle East and North Africa : 3 %
South Asia : 27 %
Asia Pacific : 10 %
Latin America & Caribbean : 9 %
Figure 1.1.: Global estimates on prevalence of hearing loss. Data source: [1]
High Income: 0.5%
Cent/East Europe andCent Asia1.6%
S-Sahara Africa: 1.9%
Middle East and North Africa: 0.92%
South Asia: 2.4%
Asia Pacific: 2%
LatinAmericaandCaribbean: 1.6%
East Asia: 1.3%
y = 0.0266x-0.334
0%
1%
1%
2%
2%
3%
3%
4%
- 5 10 15 20 25 30 35 40 45 50
Prev
alen
ce o
f Disab
ling He
arin
g Lo
ss f
or ch
ildre
n
Average GNI per capita (thousands US Dollars)Prevalence of Disabling Hearing Loss for children until 14 years old
(a) Prevalence in children
High Income: 18%
Cent/East Europe and Cent Asia 36%
S-Sahara Africa: 44%
Middle East and North Africa: 26%
South Asia: 48%
Asia Pacific: 43.5%Latin America and Caribbean: 39%
East Asia: 34%
y = 0.5212x-0.208
0%
10%
20%
30%
40%
50%
60%
- 5 10 15 20 25 30 35 40 45 50
Pre
vale
nce
of Disab
ling Hea
ring
Loss
(ad
ults 65+)
Average GNI per capita (thousandsUS Dollars)
Prevalence of Disabling Hearing Loss for adults over 65 years old
(b) Prevalence in adults older than 65 years
Figure 1.2.: Prevalence of Disabling Hearing Loss vs. GNI per capita. Image source: [1]
While hearing loss is a severe handicap for people of all ages and conditions, newborn
and children are the ones who suffer its consequences in a most critical manner.
Hearing is together with vision the most essential human sense and it takes an
active part on child development regardless of the region and the culture. The
most notably aspect of this is language acquisition, which is typically accomplished
through speech. In cases where hearing impairment hinders or even makes impossible
to gain oral language skills, early intervention is vital to ease the adverse effects that
this may lead to, including academic dropout or even social exclusion.
In fact, several studies point out that there is a significant correlation between the age
of enrollment in such intervention programs and the degree of language development
gained through it. In particular, in [2] it is concluded that children who are enrolled
by 11 months of age or earlier exhibit a degree of vocabulary and verbal skills that
approximates that of non hearing-disabled children at the age of 5, no matter the
extent of their impairing condition. On the other hand, later enrolled children score
lower on such metrics. This should serve as an evidence of the importance of early
identification and its impact on the child’s developmental success.
Motivation 3
In order to achieve a correct early identification it is needed to provide a potentially
deaf child with the right diagnostic tools. As it was just discussed above, hearing
loss should be detected at a rather early stage where this condition can only be
determined with a medical approach through screening and diagnostic tests. In such
a context, a concise explanation is required of what the different types of hearing
loss tests are and which ones of them suit the target patient the best.
Subjective and Objective Hearing Testing
Over the years, medical doctors have come out with different techniques to assess
whether a patient has hearing difficulties or not. These techniques can be most
remarkably classified according to the degree of involvement of the patient in the
test itself, leading to the categories of subjective and objective hearing testing.
Subjective tests require the patient to react to some kind of stimulus in an active
way. Pure-tone audiometry is the most usual example of this category, where the
subject is presented with a set of audio tones played to their ears and he or she
must acknowledge to have perceived each corresponding tone. Nevertheless, there
are other instances of subjective tests with diagnostic significance, such as speech
testing or reflex audiometry. In the latter one, some kind of reflexive behavior is
sought in response to short loud narrow-band sounds, avoiding the necessity of an
agreed code to confirm a successful stimulus perception.
On the other hand, objective tests are performed without cooperation from the
patient. Instead of this, they rely on some physiological characteristics of the hear-
ing system to detect hearing loss. As an example for this group both Otoacoustic
Emissions (OAEs) and Auditory Brainstem Responses (ABRs) can be addressed.
In OAEs, a probe fits into the ear and emits certain sounds to stimulate the inner
ear. This generates an acoustic response that is recorded and analyzed by the probe
to screen hearing impairment. As for ABRs, brain electrical activity is measured in
the presence of sound stimulation through electrodes attached to the subject’s head.
While some subjective approaches, namely reflex audiometry, are suitable for new-
born and child screening, objective tests are usually preferred. From those, ABRs
present the best performance, as they are capable of determining the hearing thresh-
old. OAEs only proof the state of the inner ear, and a fail to detect such emissions
may be caused by some reasons other than hearing impairment, such as a noisy
environment. The drawback of ABRs is the set up of the test, which involves the
use of electrodes. In regard of this, OAEs are much less invasive and faster, which
is the reason why they are typically used in newborn screening at hospitals.
OAEs’ less costly set up also make it an interesting choice for non developed areas
as it requires less equipment, although the overall cost of such clinical devices may
4 INTRODUCTION
still be out of reach in several territories. Taking into account the discussion at the
beginning of the chapter and the numbers in Figure 1.2, it appears as something
sensible that if deafness is regarded as a global problem, then the stress should
be put on the most affected regions. They happen to be rather underdeveloped,
underprivileged areas as well, which adds more value to the different OAE techniques
as influent actors in the worldwide fight for the integration of hearing impaired
people.
Hardware requirements for OAE screening devices are minimal, yet the character-
istics of their different components must be outstanding and this may end up with
commercial devices whose price lies around US$ 3,000 [3]. In this way, it would be
desirable to find a way to bring down this cost in order to spread the use of this
screening technique out of hospitals into health care centers, both in developed and
underdeveloped countries.
Objectives
For all the above reasons, this work has as its major objective to bring down the
cost of OAE screening as a meaningful step into the worldwide spread of early
identification of disabling hearing loss and early intervention.
There are different approaches in which this can be achieved. The most classical one
would be to start from a commercial OAE device and cut down functionalities and
performance aspects in the hope that a cheaper, yet still clinically-suitable device
could come out of it.
The generalized use of smartphones, nevertheless, might provide a powerful tool that
takes this idea even further. Assuming the availability of a smartphone or tablet
(from now on, both will be referred with the generic term host) in most health care
centers and hospitals, it is possible to take advantage of its existing hardware to
make the bill of materials of the desired gadget shorter. It is a common idea that
has been exploited over the past years, where plenty of functionalities have been
increasingly migrated to such devices in the form of apps. Here it is important to
remark that, according to Ericsson Mobility Report of June 2017, the number of
smartphone subscriptions in 2016 was 3.9 billions, and it is expected to reach 6.8
billions in 2020, which would roughly represent 88% of the world’s population at
that point [4].
However, the host on its own is not prepared to perform such a test. As it will be
explained in the following sections, an ear-probe with a certain number of loudspeak-
ers and microphones is compulsory to perform OAE screening and diagnosis. This
demands the design of an external device (from now on, simply device) that helps
Objectives 5
the host with its task of ear stimulation, response recording and emission detection.
Such design leads to a set of interesting questions:
I What is the best technology to communicate device and host? More specifi-
cally, should both elements be wired together or connected through a wireless
link?
I OAE tests comprise different processing stages. Which of them should be
performed by the device, and which of them should be executed in the host?
I What are the hardware requirements of the device? If hardware requirements
are fixed, what are the capabilities of the device?
These three questions are interdependent. Hardware constraints such as Random
Access Memory (RAM), clock frequency or sampling rate of analog-digital conversion
have an impact in software development, which in turn affects power consumption.
The same goes for host communication and algorithm distribution among host and
device. By profiling the code through all possible options, insight on the issue can
be gained and the best options can be selected. This will help to achieve maxi-
mum efficiency and optimal use of resources while complying with all application
requirements.
Assumptions
In order to elaborate a systematic and scientific discourse that helps to answer these
questions, some aspects of the study will be delimited and some assumptions will be
made:
I OAEs will be the basis for the tests. In particular, the focus will be set on a
subcategory of them called “Distortion Product Otoacoustic Emissions”. More
information about OAEs and their subtypes will be found in the next chapter.
I Among all the hardware components that can be found inside an OAE de-
vice, the microcontroller will be the major subject of study. This presumably
includes all the typical parameters that are associated to the choice of a micro-
controller (architecture, clock rate, memories), as well as the software structure
that resides in it. Other important elements, including the audio codec, the
microphone or the loudspeakers, will not undergo such a research and will only
be briefly addressed.
I It would be important to take a look into the clinical performance of the device
to make an overall assessment. Nevertheless, due to the lack of patients for
a meaningful clinical study and the scope limitation to the microcontroller,
6 INTRODUCTION
this aspect cannot be successfully evaluated. Therefore, this report will deal
with those technical aspects strictly related to the area of the electrical and
electronic engineering, which is in any case the expected field of study.
Document Organization
The discussion in this document is structured into a series of blocks. The first of
them is an introduction that gives an overview of the problem of hearing impairment
and the benefits of OAE procedures to fight against it. It leads to the work of this
thesis to find the best settings for a smartphone-driven device to accomplish OAE
screening.
Before starting with the actual body of work, the background chapter provides basic
knowledge in the areas of embedded systems, signal processing and audiology to
understand the overall discussion within the document. The platform architecture
for tests’ hardware and software framework are also described prior to moving on to
the experiment section. Here all the profiling tests that have been carried out are
detailed, as well as the lessons that can be learned from them.
Finally, the gained information during experiments is applied through some case
scenarios, which leads to some conclusions to answer the initial questions. The
possible future work for this thesis is addressed as well.
2Background
“El que lee mucho y anda mucho, ve
mucho y sabe mucho.”
“He who reads much and walks much,
sees much and knows much.”
Miguel de Cervantes, Don Quixote
The aim of this chapter is to provide familiarity with some ideas that are vital for
the understanding of the thesis. Although an electrical engineering background is a
prerequisite to understand the whole work, some notions from this field are explained
here for convenience. Medical concepts around OAEs are discussed in more depth,
as it is not the usual framework of similar works.
Engineering Background
This whole thesis revolves around the development of a medical embedded device
in a technical approach. Thus, a certain degree of familiarity with the underlying
technology of such a device is desirable to fully understand all the explained concepts.
In the case of study, a Microcontroller Unit (MCU) or simply microcontroller is used
to perform some processing. Accordingly, some explanation about signal processing
and about microcontrollers will be provided.
Signal Processing
The MCU uses mathematical operations to extract the desired features out of the
recorded signals. As it will be later discussed, the fundamental information to be
extracted is the frequency spectrum of the signal, which can be accomplished through
the Fourier Transform.
The Fourier Transform is a linear transformation that decomposes a signal into its
8 BACKGROUND
frequential complex components. As this transformation is analog, it is not directly
applicable to a digital platform like a MCU. Fourier Transform’s digital version, the
Discrete Fourier Transform (DFT), is used instead. Actually, what is used in this and
in most contexts is DFT’s most efficient implementation, the Fast Fourier Transform
(FFT). The particularity of this algorithm is its speedup over the straightforward
one: while the complexity for a naive implementation of DFT is quadratic (O(n2)),
FFT’s complexity is quasilinear (O(n log n)) [5].
DFT takes N sampled values (this samples are generally complex-valued, although
for many applications input data is real-valued) and returns N complex values. These
values display the frequency spectrum in the interval [0, fs], fs being the sampling
rate. Thus, DFT represent a signal’s spectrum at a certain frequency resolution,
that is, adjacent samples describe frequential components that differ by some ∆f .
This ∆f is related in this way to N and fs:
∆f =fsN
(2.1)
There is yet another complementary method to extract frequency components named
Goertzel’s algorithm. In this variant, a single DFT term is calculated through a
digital filter. Consequently, if M terms from a signal of length N must be calculated,
the process is repeated for M different filters, leading to a complexity of O(MN).
Although this is computationally less efficient than FFT, for a small number of terms
M it is indeed faster [5].
In many cases, the interest of these complex values lies only on their magnitude. The
absolute value is calculated out of them and it is often transformed into a logarithmic
scale, whose unit is called Decibel (dB).
For audio signals, this logarithmic scale is usually referred to the reference value
p0 = 20µPa, which is considered the threshold of human hearing. The resulting
quantity receives the name of Sound Pressure Level (SPL). Humans can perceive
sounds between 0 and 120 dB [6].
In a digital domain, numbers have a finite range where overflow is avoided. In such
a context, a maximum amplitude signal exists, which is normally used as a reference
value. dB units become in this case Decibels relative to Full Scale (dBFS) and its
maximum value is 0, as a signal’s amplitude cannot be greater than this reference.
A last interesting signal theory concept is the Signal-to-noise Ratio (SNR). It is again
a logarithmic value, but expressed as a difference between a dB value representing
a signal level, and another dB value representing a noise level. If this value is
recalculated again into linear units, it becomes the quotient between both quantities.
Engineering Background 9
Important Aspects on Microprocessors
Architectures
Each MCU has an underlying architecture that defines its capabilities and lim-
itations. Here a remark must be made about the difference between the words
microcontroller and microprocessor. In a general sense, a microcontroller is a micro-
processor with several input and output interfaces to communicate with peripherals,
which can even be integrated into it. Thus, in the context of embedded systems this
term is often preferred over microprocessor, as this is the common scenario. In the
scope of this work both terms will be used interchangeably.
In the light of parallelism, different architectures can be addressed. Simpler proces-
sors with only one core fall into the category Single Instruction Single Data (SISD),
while multiprocessor are normally built into a Multiple Instruction Multiple Data
(MIMD) architecture. In the first variety, a single core fulfills a task by operating
with a unique set of instructions on the same data. When there is more than one
core available, the task is decomposed into multiple set of instructions that are used
by the different cores to process data in a distributed way [7].
There are, however, specialized microprocessors with specific additional features.
Some of them are known as Digital Signal Processors (DSPs) and they are commonly
found in multimedia applications where heavy signal processing is required. DSPs
own an instruction set with some special operations such as filtering or multiply-
accumulate, and they often exhibit a Single Instruction Multiple Data (SIMD) ar-
chitecture. This enables them to use special instructions that perform the same
operation (e.g. an addition) on large vectors of data [7][8].
Another crucial facet of microcontrollers is fixed-point vs floating-point arithmetic.
While fixed-point treats real data types roughly as integers, in floating-point a sig-
nificand and an exponent are defined, which sacrifices range on behalf of precision.
Power and Energy Consumption
One of the most important aspects on the MCU’s performance is energy consump-
tion. Specially in a wireless scenario where the OAE device must be operated with
batteries, a power analysis is vital to determine its autonomy and its final cost.
As it can be found in classical literature for the topic, power scales with voltage and
frequency [7]:
Powerdynamic ∝1
2V 2f (2.2)
But then frequency is taken out of the equation for energy:
10 BACKGROUND
Energydynamic ∝ V 2 (2.3)
According to the formulas, reducing clock frequency will help reduce power for a
certain task, but it will not have an impact on the overall energy consumption.
Nevertheless, frequency does have a proportional relationship with voltage. This
means that an optimal pair of frequency and voltage values can be set, such that a
digital system fulfills some functional criteria while consuming as little as possible.
Another interesting issue within energy consumption is what some sources call
communication-computation trade-off [9]. Mostly used for wireless sensor networks,
this term makes reference to a certain trade-off between the processing of some data
inside a wireless node and the transmission of this processed data. As processing
aims to reduce data size, which then implies lower bandwidth requirements for wire-
less communication, the goal in the cited sources is to find an optimal spot where
total energy consumption is minimized.
Software Paradigms
In multimedia and signal processing applications, embedded software usually has to
cope with some challenging requirements like latency or throughput. Such issues are
often handled with a real-time programming approach.
For instance, when a continuous stream of sampled data must be processed, the
processing rate per sample must be less than the sampling rate so that the whole
process can be performed in real time. This can be accomplished by processing
sample per sample, but sometimes it is more efficient or even necessary to gather
a set of samples into a buffer and process them all together. In this case buffer
processing time must be smaller than the time needed to acquire one complete
buffer.
If samples are continuously being acquired or generated, then double buffering is
normally implemented. In this way, two buffers are used as an interface between
processing and input/output. At every instant, one of the two buffers is used for
processing while the other takes care of interfacing with the outside world. When all
samples from the second buffer have been transmitted, both buffers change roles. In
this way, both tasks (processing and transmission) never interfere with each other [8].
In order to accelerate sample transmission, Direct Memory Access (DMA) may be
used. DMA takes charge of loading and storing data among registers without inter-
vention of the Central Processing Unit (CPU), which greatly simplifies software [8].
As a result, if processing rate is faster than sampling rate (which is mandatory for
real-time applications), the CPU may remain idle for an amount of time before the
next buffer is ready to be processed, while DMA keeps moving samples. MCUs are
OAE Basics 11
normally equipped with low energy modes that deactivate modules and clocks that
are not needed in order to save energy.
In this specific case, the main clock and the CPU can be deactivated once processing
is finished while the DMA module keeps performing the background task along with
other peripherals. DMA must inform the CPU when a new buffer is ready to be
processed, which is typically accomplished through the use of interrupts.
If this scheme is followed, CPU load can be defined as the proportion of time when
the CPU is active.
OAE Basics
OAEs are a trusted technique in audiology and hearing screening. From the historical
perspective, their discovery can be traced back to Dr. David Kemp’s contribution.
The British physicist was the first person to measure these emissions in 1978, and
few years later the first applications as a diagnosis tool emerged. Along the decades,
OAEs have gained popularity in the clinical word as an infant hearing screening
method, and a whole market has arisen with a vast range of devices with different
capabilities and functionalities [6].
From the anatomic and physiologic point of view, the principle of OAEs lies on the
cochlea. This is is a rolled tubular cavity inside the human inner ear where sound
transduction occurs. That is, this is the organ responsible for translating audible
acoustic waves into electrochemical signals that the brain can process. The actual
process in which this is accomplished is complex and is totally out of scope for this
work, but for the sake of comprehension hair cells and the basilar membrane must
be discussed.
The basilar membrane is a resonant structure within the cochlea, whose physical
characteristics vary along its length. This results in a resonance frequency on its
surface being dependent of longitudinal location. Hair cells are distributed on this
membrane and they are stimulated by vibration, which only occurs when a wave
with a certain frequency activates the particular membrane region where a group
of hair cells stand. In a way, the basilar membrane, together with the hair cells,
maps acoustic frequency into spatial location, which is the basis for the perception
of sound in mammals [6].
Having said that, hair cells can be grouped into Inner Hair Cells and Outer Hair Cells
(OHC). Inner Hair Cells are the actual acoustic-electrochemical transducers, while
OHC participate in a so called “active mechanism” within the hearing process. When
an audible acoustic wave travels through the ear into the cochlea and reaches them,
OHC vibrate to generate a kind of “mechanical amplification”. Such amplification
12 BACKGROUND
is the source of OAEs, which in such context are described as a by-product of the
mechanical amplification. In other words, OHC inside the cochlea create acoustic
waves that serve as a feedback in the hearing process, and these can also be regarded
as acoustic responses that can be recorded and measured with a microphone.
One interesting aspect of the so-called cochlear amplification is its non-linearity.
Apart from the positive effect this has on human hearing’s dynamic range, it is a
feature that plays an important role in most types of Evoked OAEs.
Classification of OAEs
As it has been already introduced, different types of OAEs exists. Firstly, a distinc-
tion can be noted between Spontaneous and Evoked OAEs. The first ones refer to
those that are recorded without the presence of any artificial stimulus. They are
rarely used in clinical applications, as ear stimulation leads to greater amplitude
levels, which ultimately leads to easier detection [6].
Evoked OAEs are, consequently, the most used group in the medical world. These
can also be divided in different subcategories taking into account which stimuli are
applied in each case, but in practice two of them stand out: Transient Evoked
Otoacoustic Emissions (TEOAEs) and Distortion Product Otoacoustic Emissions
(DPOAEs).
In TEOAEs, the ear is fed with transient clicks of a very short duration. Because
of their short temporal span, these clicks have a broadband frequential spectrum,
which leads to the stimulation of the whole basilar membrane. This results in the
generation of OAEs for all the spectrum, which then can be recorded by the OAE
probe, as Figure 2.1 shows.
Figure 2.1.: TEOAE recording. Lighter gray plot at the lower left corner shows the evoked spec-trum. Image source: [6]
OAE Basics 13
The amplitude level of the emissions in TEOAE tests is tens below that from the
stimuli. Furthermore, both stimuli and emissions overlay on the frequency spectrum
and on time, which may cause stimuli to mask emissions. To avoid this, certain
protocols are employed where stimulus amplitude and polarity alternates so that
averaging cancels out its contribution (at least theoretically). In this case OAEs are
not affected by this cancellation, as their originating process is non-linear and so is
the relationship between stimulus and emission.
To clarify this cancellation protocol, the following scheme can be considered: A
sequence of clicks is fed to the inner ear. Each period of this sequence comprises
four clicks. The three first clicks have a normalized amplitude of 1 and a positive
polarization, while the last click in the sequence presents negative polarization and
3 times the amplitude of the remaining clicks in the sequence. When averaged, this
sequence of four clicks yields predictably a null value. It is not the same for the OAEs
produced by it, though. As the cochlea works in the non-linear region for TEOAEs
tests, the set of emissions caused by each click will exhibit similar amplitudes.
DPOAEs represent a contrast to TEOAEs. In this technique, two pure tones at
frequencies f1 and f2 are used as stimuli. Again thanks to cochlear amplification’s
non-linearity, OHC produce OAEs at frequencies that are integer linear combinations
of the two fundamental frequencies (i.e. 2f1−f2, 3f1−2f2, 2f2−f1, 3f2−2f1, f2−f1etc.) In Figure 2.2 a real DPOAE recording is presented, where both fundamental
tones and DPOAEs are visible.
2000 2500 3000 3500 4000 4500 5000 5500 6000
-30
-20
-10
0
10
20
30
40
50
60
[ ]
Level [dB SPL]
DPOAE spectrum recorded from healthy human ear.
2f1-f23f1-2f2
4f1-3f2
2f2-f13f2-2f1
4f2-3f1
f1
f2
Figure 2.2.: DPOAE recording. Lighter spikes correspond to stimulus frequencies, while dark onesrepresent the different distortion products. Image source: [6]
This approach has two implementation advantages compared to TEOAE. On the
one hand, now both stimuli and emissions are narrow-band signals which occupy
different regions of the frequency spectrum. This means that stimuli must no longer
be averaged out, as they can be told apart in frequency domain. On the other
14 BACKGROUND
hand, narrow-band stimuli are also easier to calibrate. As it will be explained in this
chapter, calibration is an important aspect of OAE testing.
As its main drawback, it can be mentioned that DPOAE does not test all frequency
range of interest at once, whereas TEOAE does. This means that, in order to conduct
a thorough test of the basilar membrane, multiple DPOAE tests with different f1
and f2 must be performed.
Another particularity of this variety is that it requires the use of two loudspeak-
ers, each one of them playing one of the two stimulus tones. The reason behind
that is that loudspeakers always present a certain degree of non-linearity as well.
Consequently, if both pure tones were digitally mixed and output through the same
loudspeaker, the loudspeaker itself would produce intermodulation tones at DPOAE
frequencies, which would mask real emissions.
In any case, the implementation simplicity of DPOAE is the reason why it was
chosen as the OAE screening modality to begin with and why this thesis deals
almost entirely with this specific method.
Implementation of OAE Procedures
Now that some insight about OAE screening has been provided, the actual im-
plementation of OAE procedures can be discussed. This normally consists of a
preliminary calibration phase and a proper test phase.
Calibration
In order to make tests clinically meaningful, they must go under a calibration process.
This takes care of two fundamental aspects:
1. That recorded audio can be correctly linked to a physical magnitude.
2. That stimulus parameters (more specifically SPL) can be precisely determined.
The first one is achieved through microphone calibration, while for the second one
in-ear calibration is needed.
Microphone calibration is the procedure to extract the relationship between the
numerical values obtained through the microphone after digitalization and the actual
magnitude they represent. This is normally frequency-dependent and is independent
of the particularities of a test. The main factors that influence it are the analog
circuitry, including microphone itself and signal conditioning stages, and some codec
parameters as gain or sample length. On Figure 2.3, this process characterizes the
signal path between points “B” and “C”.
OAE Basics 15
Audio codec
Microphone
Speaker(s)
EarA
B
C
Microcontroller
Figure 2.3.: OAE system block diagram
Thanks to its independence with individual tests, microphone calibration is only
required once for an OAE device. The actual methodology to accomplish it may
vary and it is not that interesting for the study, although a common factor is the use
of a sound level meter. The important remark here is that the calibration outcome
is a table of fixed values that represent the response of microphone and codec at
a certain frequency. Even in cases where the codec parameters of the device may
change, this normally has a deterministic impact on such values, which can be simply
recalculated and/or interpolated accordingly.
The situation for in-ear calibration looks quite different. This one is performed after
the probe has been inserted into the ear and its goal is to guarantee a certain SPL
at the eardrum. Probe insertion has a vital influence on the relation between output
SPL of the loudspeaker and SPL at the eardrum, and it can be generally asserted
that this relation is different each time the probe is introduced into the ear. It is
also a frequency-dependent relation, so after the process some tables representing
frequency spectra are obtained. Stimulus signals will be later modified according
to this tables, which implies more processing for broadband signals than for pure
tones. Thus, the path from “A” to “B” according to Figure 2.3 becomes calibrated.
During in-ear calibration, it is assumed that the SPL recorded by the microphone
equals the SPL at the eardrum. Although this is strictly not true, it is a fair
approximation that serves clinical purposes. However, there are further procedures
described in the literature [6].
OAE detection
Once the environment is calibrated, OAE testing is ready to start. Although there
are differences depending on the chosen OAE modality to be used, the following
general scheme is valid for all of them:
1. Play stimulus through the loudspeaker(s) (In the case of Evoked OAEs).
16 BACKGROUND
2. Record response into a buffer while stimulus is playing.
3. If a transformation were applied to this single buffer, the noise level of this
single recording (also called noise floor) would be typically too high and it
would mask actual emissions. In order to solve this, several buffers are recorded
and averaged into a single buffer, so that the noise floor goes down while
the emissions persist. This averaging can be performed either in time or in
frequency domain.
4. Apply transformation (typically FFT) to the averaged buffer. If the buffer is
already frequency-averaged this step is unnecessary.
5. If performing a diagnosis test, present the frequency spectrum of the response.
In the case of a screening test with Pass/Fail result, analyze frequency coef-
ficients to obtain a useful metric. Such metrics may also be calculated for
diagnosis tests as a clinical help. A typical example of such metrics is SNR
calculation of OAEs. To obtain the SNR, the noise floor SPL is subtracted
from the SPL of the emission in dB. This implies calculating the noise level,
which involves defining a frequency region to be considered noise and average
over this region.
Artifact Rejection
In OAE testing it is assumed that the noise floor will decrease by buffer averag-
ing, but it actually may increase. This will happen when the incoming recorded
buffer is contaminated with an unusually higher noise level, which may be caused by
ambient noise (e.g.opening a door) or by physiological noise, either voluntary (e.g.
swallowing) or involuntary (e.g. pulse).
In order to detect these bad samples (also called artifacts) and minimize their neg-
ative impact, some rules may be applied. In [6], the following Artifact Rejection
(AR) techniques are described:
I Noise filtering: The recorded frequency range may be larger then needed.
Consequently, by filtering the signal, high or low frequency noise may be elim-
inated. This is somewhat taken for granted in digital signal processing, as
it is a compulsory step in digitalization. Furthermore, this is also implicitly
implemented within FFT.
I Large magnitude AR: Signals to be detected are small. With this premise,
an incoming buffer may be spotted as artifact if its SPL level at a frequency
of interest is unusually large.
I Repeatability AR: In this approach, the incoming buffer is split into two
different buffers and the averaged mean from each buffer is then compared.
OAE Basics 17
If the difference between the means is too high, it can be concluded that the
data presents some kind of local low-frequency noise, which matches with the
nature of the noise OAE tests deal with.
Once artifacts have been detected, it is algorithm-dependent what to do with them.
The simplest approach is to just discard them if a predefined threshold has been
exceeded. For some cases, it may also have a positive influence to make them into
the average after some kind of weighting depending on their score in the AR method.
Stop rules
The duration of the test is yet to be defined. In order not to prolong it more than
necessary, some stop rules may be applied. The following ones are typically used:
1. To stop the test when a number of recorded buffers has been reached. This
number can be either counted on actual recorded buffers or on valid buffers
(that is, after AR). For the first case, this is equivalent to having a fixed test
time.
2. To stop the test when the noise floor has reached a certain level.
3. To stop the test when OAE SNR is above a certain threshold. This naturally
means that in cases where this SNR cannot be achieved, one of the other
criteria above must be applied.
3Platform Architecture
“Caminante, no hay camino, se hace
camino al andar.”
“Wanderer, there is no path, the path
is made by walking.”
Antonio Machado, Campos de Castilla
This chapter consists of a description of the framework that has been implemented
to profile OAE algorithms. The underlying hardware has an undeniable impact on
them, so an attempt to gain familiarity with it follows in the next section. Software
itself is also described, not only in its algorithmic form related to the ultimate
application of medical diagnosis, but also in low-level detail, both for device and
host. As the experiments will show, these low-level details may have a considerable
influence in the results.
Hardware
In order to have a valid framework to implement OAE algorithms, different hard-
ware components were selected and put together. These are specifically a micro-
controller, an audio codec and an OAE probe.
Microcontroller
It was decided to work with a MCU belonging to the ARM Cortex-M family. ARM is
one of the most popular processor architectures worldwide, and its Cortex-M family,
entirely composed of 32-bit Reduced Instruction Set Computer (RISC) machines, is
present in a multitude of embedded systems [10].
The chosen microcontroller was Silicon Labs’ EFM32TM Wonder Gecko, which has
a Cortex-M4F single core. This core is one of the most powerful ones in the family,
Hardware 19
and its major differences with Cortex-M3 are the inclusion of DSP instructions and
the presence of a hardware single-precision Floating Point Unit (FPU).
This processor was framed into the starter kit EFM32TM Wonder Gecko STK-3800,
also manufactured by Silicon Labs. The specific MCU model used on this board is
the EFM32WG990F256, which has a 256 KB Flash and a 32 KB RAM.
The Wonder Gecko is also equipped with a wide variety of interfaces and peripheral
units that can be accessed through the board and its pin headers. The most relevant
ones on this thesis’ scope are the DMA, the Universal Synchronous/Asynchronous
Receiver/Transmitter (USART) with support of different communication protocols
(specifically, it will be used for the Inter-IC Sound (I2S) communication), the Inter-
Integrated Circuit (I2C) bus and the Universal Serial Bus (USB). The last one can
make use of the assembled Micro-USB connector.
The board also provides different options to power the MCU, selectable by an elec-
trical switch. The options are namely three: power through battery, Micro-USB or
Mini-USB (This last connector is used for debugging purposes). This fact will gain
importance when speaking about power measurements.
Audio Codec
In order to transform digital data into acoustic data and vice versa, an audio codec
was used. The choice was the low power stereo audio codec SGTL5000 from NXP
Semiconductors. The relevant features of this component are the following:
I Stereo audio input and output
I Integrated headphone amplifier.
I Integrated microphone amplifier.
I I2S data interface.
I I2C control interface.
I Integrated programmable Phase-Locked Loop (PLL) to manage sampling fre-
quency.
The codec also displays a wide range of audio processing capabilities, which are not
of interest of the application.
A board shield with an assembled SGTL5000 was used to access the codec. This
shield was designed by PJRC for its Teensy microcontroller development system,
and it was pinned to a perfboard to wire it properly and to make it easily pluggable
to the Wonder Gecko starter board and to the OAE probe.
20 PLATFORM ARCHITECTURE
OAE Probe
Perfboard
Starter kit
Pin header
Microcontroller
Audio codec
Micro-USB
Mini-USB
Figure 3.1.: Picture of the whole hardware platform
OAE Probe
The last hardware element of the system is an OAE probe provided by the company
Path Medical. This piece is composed by two headphone speakers and an electret
microphone, having each one of these components an isolated duct inside a tube
that is inserted into the ear. While tests are running this tube is coated with a foam
or silicone pluggable seal to isolate the inner ear from external noise.
The probe is connected to the perfboard through a 14-pin connector, although only
7 pins are used. The used pins are namely positive and negative terminals of the
first loudspeaker (2), positive and negative terminals of the second loudspeaker (2)
and bias voltage, ground and output terminal for the microphone (3).
Thanks to the presence of two loudspeakers, this probe is suitable for DPOAE.
Software
Software is in charge of detecting OAEs using the hardware platform recently de-
scribed. In a distributed paradigm like this, where the algorithm is divided into
device and host, both sides have to be considered.
Device
As a bare microcontroller-based system, the implemented OAE device is programmed
in C language. The tasks to be fulfilled are the following:
Software 21
I Set the audio codec with the right parameters and transfer audio data to and
from it.
I Set the host-communication interface. In this study this is accomplished
through USB, although this should be replaced with a wireless interface to
be determined by the results of this work.
I Wait for a command from host and serve it accordingly.
I If the command compels an OAE test, execute it according to the partition
scheme, as it will be introduced in section 3.2.3.
In this section the first two low-level features will be explained.
Codec operation
The chosen audio codec requires a clock signal to be sent into it to feed the PLL.
The MCU uses its Timer 0 to generate a 12 MHz clock signal for it, using the
high-frequency peripheral clock as a source.
Once a clock is provided, the different codec parameters can be set through I2C
commands. In this communication, the MCU always acts as a master and the codec
as a slave. The most important parameters are sampling rate, sample length and all
the different volumes and gains for the microphone, the headphones, the Analog-to-
digital Converter (ADC) and the Digital-to-analog Converter (DAC).
Among all these gains, the only one who is variable on execution time is the one from
the DAC. The reason behind it is that stimuli have a variable dBFS level depending
on in-ear calibration. In order to generate stimulus signals with the proper level,
they must be attenuated, which can be done either by software (multiplying by
an attenuation factor) or by the codec (setting the proper DAC attenuation). The
scheme that brings the best results is a hybrid between both: calculated attenuation
is first approximated through DAC settings, which in this case has a resolution of
0.5 dB. The remaining level difference is then achieved by a multiplying factor in
the code.
The remaining interaction with the codec to be explained is data transmission. As
mentioned in 3.1.2, the I2S protocol is used for that. Here the roles are the opposite
as in I2C: the MCU is the slave and the codec acts as master. In stereo mode, two
words are sent bidirectionally at the sampling rate, and each one takes 16 bit cycles
for 16-bit length or 32 for 24-bit length. The same slots are preserved in mono mode,
so the bit rate is always:
bit rate = 2× b× fs [bps] (3.4)
22 PLATFORM ARCHITECTURE
where b represents bit cycles per word.
I2S data transmission is handled by the DMA, which also has to be programmed.
4 channels have to be set, 2 for each data direction. However, the right reception
channel is always muted because there is only one incoming source into the MCU
(that is, there is only one microphone). Each channel is managed in a ping-pong
way with a corresponding callback function when one of the two buffers has been
fully transferred. Because the maximum DMA transfer size is smaller than the ap-
plication’s requirements, ping-pong operation has to be implemented at two different
levels:
I Callback level: DMA is programmed with a lower transfer size than the
buffer size. When the specified data amount has been transferred, only a por-
tion of a complete buffer has been transmitted. The callback takes care of
updating ping and pong pointers, so that the transmission continues seam-
lessly.
I Buffer level: Callback operation will eventually reach the end of a ping-
pong signal buffer. At this moment, DMA ping-pong pointers will just start
addressing the other signal buffer.
DMA will set a flag variable when it has gone over a complete buffer. At this point
the CPU will be typically idle and will wake up from a low energy mode. Because of
the ongoing I2S operation, the only possible low energy mode in the Wonder Gecko
is Sleep Mode (Energy Mode 1), which is the lowest power consuming level that
allows synchronous peripheral communication [11].
Two-level ping-pong operation causes extra wake-ups into Run Mode (Energy Mode
0) while waiting for buffer completion, which is a sub-optimal yet unavoidable
method. Not all DMA transfers exhibit such behavior, though. Transfers can be
classified into these three different categories:
1. Muted buffers: is the case of the right recorded channel already described,
or the inactive playback channel during mono operation. In this case, both
source and destination memory addresses are static. One of them corresponds
to the USART Tx/Rx, while the other points a null value. Buffer size is set
to the maximum capable value to minimize wake-ups, and when this chan-
nel generates an interrupt the callback function only activates the mechanism
again. Despite being a dummy DMA operation, it is compulsory in order to
preserve frame synchronization in I2S.
2. Static buffers: That corresponds to stimulus buffers, which always contain
the same periodically-played data. Because of this, ping-pong only operates
Software 23
at callback level here. As for other DMA settings, destination address here
is always USART’s Tx register and source address is provided by callback
updates and accordingly incremented within DMA operation.
3. Dynamic buffers: Recorded signals account for this group, where ping-pong
operation at both levels becomes essential. Here DMA’s source destination is
USART’s Rx register and the destination address is managed by callbacks and
internally incremented.
Apart from target addresses, callbacks and memory increments, an important setting
of DMA is data size. This determines the number of bytes a single DMA data transfer
consists of. Such a fine detail raises a big concern in the Wonder Gecko.
The root of this problem is endianness: I2S protocol is big-endian and Wonder
Gecko’s memory organization is little-endian. Byte swap can be activated within
the USART to circumvent the problem, but USART’s input and output registers
can only hold up to a halfword (2 bytes) at a time. This solves it for 16-bit samples
but not for 24-bits, where a sample must be split into two consecutive accesses to
the 2-byte long USART buffer. This compels the program to manually reverse bytes
in 24-bit mode, adding a processing overhead.
USB interface
In order to implement the USB communication with the device, a project example
provided by Silicon Labs was modified to adapt it to the needs of the application.
Thus, USB descriptors are configured to implement a USB Communication Device
Class (CDC), which is one of the simplest USB classes to transmit data.
USB data transfers are accomplished through non-blocking interrupts of small size,
but two blocking functions were implemented to hide the complexity of interrupt
accesses and allow bigger transfer sizes in transmission and reception, respectively.
Inside the mentioned functions, the non-blocking counterparts are iteratively called
and the device waits in Sleep Mode up to each interrupt completion.
Host
For this study, a desktop computer was used as a host for the sake of simplicity and
to ease profiling procedures. Thus, different Python applications and scripts may be
used to access the device.
Firstly, a Python application developed by TUM’s Chair of Real-Time Computer
Systems was used to test functionality. In order to set the application up to interact
with the device, a specific-interface had to be added to it and a simple protocol
was built among the two systems. The particularity (and also the limitation) of this
24 PLATFORM ARCHITECTURE
application is that it takes charge of all the computation. The tasks left for the device
are solely storing of stimulus buffers, playback, recording and data transmission to
the host. In this way, this application can evaluate low-level peripheral management
but not algorithm partition.
That is the reason why a second set of Python scripts were written from scratch to
emulate real host behavior. These are the following:
I config-oae.py : It sends a command to set OAE parameters up, namely sam-
pling rate, sample size, buffer length and thresholds for artifact rejection.
I calib-oae.py : It launches an in-ear calibration process. A chirp signal (that
is, a sine wave of time-dependent frequency) is used as broadband signal, and
a certain number of buffers are averaged before transforming into frequency
domain to extract the frequency response. This process is repeated for each
one of the two loudspeakers, as spatial diversity may lead to different acoustic
transmission characteristics.
I partitioned-dpoae.py : Launches a DPOAE test, where the processing stages
implemented in the device are selectable. Other selectable parameters are f1,
stimulus SPLs or the number of buffers to be recorded.
Both calib-oae.py and partitioned-dpoae.py are programmed as if they were part
of an actual host environment, which means that they do not demand more data
from the device than needed in a real application. Nevertheless, a debug mode was
implemented where every recorded sample is also transmitted. In this way, and
by testing each different partition, it was possible to check that the C algorithm
inside the device yields the same results as the Python algorithm on the computer.
Figure 3.2 points this fact out.
Algorithm Partitioning
As it was already outlined in 2.2.2, OAE detection needs different processing steps,
also referred as processing stages. For the tested implementation, DPOAE was
chosen, and a RAM-saving approach was taken to be able to work with large buffer
lengths. The designed scheme is outlined in Figure 3.3 and described as follows:
1. Artifact Rejection: The incoming recorded buffer with fixed-point samples
undergoes AR tests to check if it is valid. The data is only further processed if
it scores under a threshold both for large magnitude AR and for repeatability
AR, otherwise is discarded. Both tests are calculated in floating point making
sample by sample conversion to save RAM occupation.
Software 25
1400 1600 1800 2000 2200 2400
Frequency (Hz)
−20
0
20
40
60
SPL
(dB
)
Averaged window
Figure 3.2.: Host vs. Device DPOAE detection. Red data corresponds to the raw recorded dataprocessed by the Python application. Dashed black data represents the frequencycomponents computed by the device, which overlay with values computed on thehost. The two spikes at f1 = 2000 and f2 = 2440 Hz are stimuli’s spectra, bothdisplaying a SPL of 60 dB. The blue dot is the component for the OAE at 2f1 − f2,and the blue line indicates the noise level. Obtained SNR lies around 16 dB.
ArtifactRejection
AveragingFrequency
components
extraction
SNRcomputation
Sample buffer
Rawdata
Valid dataAveraged
data
Frequencycomponents SNRTo Tx:
Figure 3.3.: DPOAE Partition Scheme
I Large magnitude AR: Goertzel algorithm is used to extract the SPL
at 2f1 − f2 and this value is calibrated according to the microphone cali-
bration table before comparing with the threshold.
I Repeatability AR: The buffer containing N samples is split in two.
Average values are calculated for sample 1 to N/2 and for sample N/2
+1 to N. At the same time, the maximum absolute value for the whole
buffer is found out. The final value to be compared against the threshold
is the following:
scoreLarge AR =
∣∣∣∣∣∣∣N2−1∑
n=0
xn −N−1∑n=N
2
xn
∣∣∣∣∣∣∣maxn |xn|
(3.5)
26 PLATFORM ARCHITECTURE
2. Averaging: If the buffer is valid, it is averaged in time-domain into a floating
point average buffer. Again, this calculation is performed through sample-by-
sample floating-point conversion. Cumulative Moving Average (CMA) is used,
so that only a floating-point buffer is needed and so that this buffer always
represents a valid average. CMA update is calculated as follows:
CMAn+1 = CMAn +xn+1 − CMAn
n+ 1(3.6)
3. Extraction of frequency coefficients: A set of frequency components are
calculated through Goertzel up to obtain their uncalibrated power values in
natural scale. The number of frequency bins to be evaluated depends on
multiple factors, where the following reasons can be remarked:
I Sample-domain vs. frequency-domain: One or more frequencies
of interest must be chosen as “signal values”, and another amount of
bins must account for “noise samples”. The first approach might be to
take a fixed number of bins left and right of the frequencies of interest,
which is easy, predictable and convenient. However, it has more physical
significance to define this region in terms of frequency, which results in a
dependency of the number of bins on sampling rate and buffer length, as
evidenced in Equation 2.1. If this frequency bound is also related with
stimulus frequencies (e.g. (f2 − f1)/2 ), then f1 and f2 also play a role.
This region where the extracted frequency components lie will be further
referred as the “observation window”.
I Clearance region: A pure tone may not yield an ideally perfectly sharp
DFT. This means that some frequency bins surrounding this tone may
be influenced by it and follow a slope, which makes them have a greater
value than they would have without the presence of this tone. This phe-
nomenon is called spectral leakage and is the reason why a certain number
of frequency bins left and right of the OAE may be discarded, establishing
a clearance region.
The total number of frequency components to extract is 2 × (halfwidth −clearance) + 1. In the tested implementation, clearance is 0 and halfwidth is
2, so 5 components have to be extracted
4. SNR computation: In the final step, the SNR of the OAE at 2f1 − f2 is
calculated. Consequently, the frequency components calculated in the previous
step consist of a single signal value and a set of noise values. In this step:
a) frequency components are computed in dB and calibrated.
Software 27
b) noise values are arithmetically averaged in dB.
c) noise SPL is subtracted from signal SPL to obtain the SNR.
Partitioning this algorithm means drawing a vertical line between two stages in
Figure 3.3, which will be referred as the partitioning border. Stages left of the
partitioning border are processed in the device and right of it in the computer. The
5 black dots in the figure represent the five possible spots (or partitioning spots) for
the partitioning border, and an arrow coming from each of them points out the data
to be sent, which decreases in volume as the border shifts to the right.
For instance, if no stages are performed on the device, throughput equals sampling
rate times sample size. Nevertheless, some of the incoming buffers may be discarded,
and Artifact Rejection can thus spare their transmission if it is executed on the
device. Extracting frequency components will bring throughput even lower, as only
a portion of the spectrum (i.e. only a subset of the values from the whole FFT) is
needed to calculate the SNR. And if the SNR is also computed on-device, then only
a floating point value out of a whole buffer is sent.
If the partitioning border is placed in the last three partitioning spots, it has to be
decided how often data is sent, which also implies how often the two last stages are
executed. This adds another degree of freedom, which will be further referred as
“OAE extraction rate” or simply extraction rate. Throughput is then divided by
extraction rate for these three spots. Extraction rate equals 4 in the tests, which
means that data is sent every four buffers.
4Experiments
“La inspiracion existe, pero tiene que
encontrarte trabajando.”
“Inspiration exists, but it has to find
you working.”
Pablo Picasso
This chapter explains the method used to profile the performance of DPOAE tests
on the device. By analyzing the results, secondary profiling tests are designed to try
to optimize time efficiency and energy consumption.
As a result of the experiments, the behavior of the platform is characterized and
the extracted information can construct a model to predict the impact of additional
implementation.
Physical setup
The goal of this work was to analyze software’s impact on the following aspects of
the microprocessor:
I Memory (specifically RAM) occupation.
I CPU load.
I Energy consumption
The first item does not need exhaustive profiling, as the bulk of RAM occupation
can be calculated a priori. As for time and energy performance, however, physical
measurements are required.
The procedure to achieve them involved two lines of action:
1. Current measurement: Energy can be extracted from current measure-
ments, as Equation 4.7 indicates:
Physical setup 29
Energy =
ˆPower · dt =
ˆV · I · dt (4.7)
If MCU voltage is constant, which is a reasonable approximation, then energy
basically implies integrating measured current over time. In discretized mea-
surements, summation can approximate integration for the current sequence
in, so that:
Energy ' V ·∑n
in ·∆t = V ·∑n
in/fs (4.8)
where fs is the measuring sampling rate.
2. Timestamping: In order to link the measured current values to a code section
and to calculate the elapsed time in it, the use of timestamps becomes essential.
A way of implementing this is using one or several additional digital lines that
inform whenever the code enters a different section of interest.
For the Wonder Gecko, current was measured placing a 1.5 Ω shunt resistor between
the battery and micro-USB throws of the power switch mentioned in Section 3.1.1.
In this way, if the MCU is powered via micro-USB and the power switch is in battery
position when no battery is connected, all the MCU current flows through the shunt
resistor. As the voltage that is wired to the audio codec perfboard comes from the
debugging mini-USB, the voltage drop at the resistor is only proportional to MCU’s
current. Figure 4.1 represents this situation graphically.
As for timestamping, a single digital output was used. This signal, referred as digital
toggle, has two functions: triggering and timestamping. Triggering is managed by
an initial falling edge, which occurs at the beginning of each profiled program that
is loaded onto the MCU. Any following change of digital level from 1 to 0 or vice
versa (i.e., any following toggle) will be interpreted as a timestamp, which means
that the time when it happens may be stored and tied to a specific point in the code,
according to a pattern that is known beforehand.
National Instruments’ I/O data acquisition card PXIe-6363 was used to retrieve all
the measurements from this setup. Specifically, a differential analog input channel
was configured to measure shunt resistor’s voltage drop, and the digital toggle was
connected to a digital input channel and to a trigger channel. As it has already been
discussed, the latter connection is meant to ensure synchronism between the digital
and the analog signal.
Data acquisition is handled by different Python scripts that operate the National
Instrument PXI measurement system. The basic behavior is the following: digital
and analog input channels are configured, and then the measurements are started
in the digital channel. Analog start is triggered by a falling edge of the digital line
30 EXPERIMENTS
Microcontroller Mini-USB
V1+
D+
D-
VMCU
Micro-USB
V2+
Codec Board
I²S
I²C
Powerswitch
Figure 4.1.: Schematic diagram of experiment configuration
to create a temporal reference between the analog and digital channel. Once the
tests are finished, data acquisition is stopped in all channels and analog values are
aligned to the first falling edge in the digital measurement. Then the timestamps
are used to match the measured current values with the profiled code.
If the test involves USB communication with the device, different OAE commands
can be dynamically launched to modify test parameters and keep record of progress.
If USB is not enabled in the device, the expected number of toggles can be calculated
beforehand and then data can be periodically acquired while counting the measured
toggles to estimate progress.
In any case, toggles help to pack several tests with different parameters into a sin-
gle execution and later label the portions of this measurement corresponding to
individual tests and test fragments.
DPOAE profiling
The parameters used to perform the tests, as presented in Chapters 2 and 3, are the
following:
I Audio codec sampling frequencies [Hz]: 16000, 24000, 32000, 48000.
I Audio codec sample size [bits]: 16, 24.
I OAE buffer length [samples]: 512, 1024, 2048.
I Partitioning scheme: All five possible, as outlined in 3.2.3.
I Host communication: USB.
I Number of recorded buffers: 20.
I OAE extraction rate: 4.
DPOAE profiling 31
I Number of extracted frequency components: 5.
16-bit samples are stored as 16-bit signed integers, while 24-bit samples are stored
as 32-bit signed integers to keep computations manageable. Because of this greater
size, the combination of 2048 sample buffer length and 24 sample size is avoided to
prevent running out of RAM.
Apart from this restriction, all possible parameter combinations are carried out in a
total amount of 100 tests. Profiled code has been compiled using gcc’s optimization
level O2. Experiments with unoptimized code were also conducted. However, they
shall not be discussed in this document, as they led to a much lower performance
and thus this would not be a valid option in a real scenario.
First test: USB link
The first set of tests that were profiled included USB transmission of the results.
They will be considered the reference implementation for the rest of the document.
After taking measurements, with the help of the digital toggle, all measured current
values are classified either as belonging to a processing stage, to the idle time before
a new buffer recording is completed, or to irrelevant inter-test data. Figure 4.2 shows
how a labeled measurement looks.
0.10 0.12 0.14 0.16 0.18 0.20 0.22
time [s] +1.739×102
10
15
20
25
30
Curr
ent
[mA
]
Consumption current of test set
Figure 4.2.: DPOAE profiling capture. This frame corresponds to a frequency extraction partition.In the figure seven processing periods can be observed. In two of them the frequencycomponents are actually extracted, which is plotted in blue. In the rest only AR andaveraging is performed. It is possible to confirm visually that the extraction rate is4. Pink data corresponds to idle time between processing periods. Each peak duringidle time indicates a CPU wakeup to update the DMA at a callback level.
32 EXPERIMENTS
The classified measurement is analyzed as follows: For each test, the current is
averaged separately for processing periods (yielding iA) and for idle time (yielding
iI). The overall mean current (i) from the test is also calculated by averaging both
processing and idle periods together. Additionally, CPU load (denoted as τ) is
estimated as the fraction of time where the CPU remains active over total test time.
These four parameters are not independent with each other, as the overall mean
current can be calculated according to Equation 4.9.
i = τ · iA + (1− τ) · iI (4.9)
Once these features have been analyzed for all tests, individual tests’ features can
be averaged together according to a certain common parameter, e. g. sampling
frequency. In this way, it can be observed how this specific parameter affects per-
formance. For the sake of conciseness, only a small selection of the extracted results
will be discussed in the body of the document, while the rest of them can be looked
up in Appendices A.1 and A.2.
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Artifact Rejection
iIiAi
τ
(a) Artifact rejection
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Standalone
iIiAi
τ
(b) Standalone
Figure 4.3.: USB test. Sampling frequency performance
Figure 4.3 shows these averaged features leaving sampling frequency as a free pa-
rameter and choosing two different partitions. The following important conclusions
can be extracted from the plot:
I As intuition dictates, CPU load increases proportionally to sampling frequency.
I Idle current is independent of both sampling frequency and partitioning, which
is also expected.
I Active current is also independent of sampling frequency, but it differs along
partitions. An explanation for this can be the different instructions that are
used in each processing stage.
If buffer length is analyzed in the same way, the main difference to be found is that
DPOAE profiling 33
now CPU load is slightly decreasing, as Figure 4.4 points out. To explain this, the
relation between the two quantities can be analyzed using parameters r, ts and c.
ts = 1/fs is the sampling period, which is the time a sample takes to be acquired.
r is the time the MCU needs to process one sample, and c is the static amount of
time that is spent while processing a buffer, which is independent of the buffer size.
This three parameters can be used to relate buffer size N with CPU load, as it can
be seen in Equation 4.10.
τ =N · r + c
N · ts=(r +
c
N
) 1
ts=(r +
c
N
)fs (4.10)
When N is increased, CPU load approximates asymptotically to r/ts. This formula
also describes the sampling frequency’s linear behavior for a fixed or averaged buffer
length.
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Artifact Rejection
iIiAi
τ
(a) Artifact rejection
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Standalone
iIiAi
τ
(b) Standalone
Figure 4.4.: USB test. Buffer length performance
Analysis of sample size yields analog conclusions as for the already discussed parame-
ters. Regarding results in Figure 4.5, it can be noted that CPU load is approximately
7% higher for 24 bits in the two first partitioning schemes. The reason behind is
that in these cases the sent data is integer, and thus occupies double the size in
24 bits (24-bit samples are handled as 32-bit signed integers). From averaging on,
samples are converted to floating point and the algorithm does not present further
differences with regard to sample size. Byte reversal in 24 bits, which is carried out
during the idle time, is responsible for the higher idle current under this value.
As a last step in this version’s profiling, processing stages are looked into separately
for every test, and both CPU load and energy consumption are computed. In this
case, all processing periods from a test are discarded except those at positions that
are multiples of the extraction rate. In that way, it is ensured that in the Frequency
extraction and SNR computation stages only OAE extraction periods are averaged.
Back to Figure 4.2, only the second and sixth periods perform OAE extraction.
34 EXPERIMENTS
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30m
AAveraged current
16bit
24bit
(a) Averaged current
No process AR Averaging Freq-Ext. Standalone0
5
10
15
20
25
30
%
CPU Load
16bit
24bit
(b) CPU load
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30
mA
Active current
16bit
24bit
(c) Active current
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30
mA
Idle current
16bit
24bit
(d) Idle current
Figure 4.5.: USB test. Sample size performance
Figures 4.6a and 4.6b depict averages of these energy and CPU load analyses. The
most important remark that can be made is the predominance of the idle periods
in the total energy consumption. This is not the usual nor desired scenario for a
real-time embedded application, so idle consumption is an issue that would require
some insight.
Second test: No USB
A first attempt to decrease idle consumption can be shutting off USB communication.
This is coherent with the final goal of implementing a wireless device, as in that
scenario no USB protocol would be present.
The new averaged energy consumption can be seen in Figure 4.7a. For this test, the
“Tx” stage is not present anymore. Accordingly, the bar labeled as “No processing”
is composed only of idle periods. Energy values are approximately half of those for
the previous section. Idle current is also reduced to half the value, from ∼ 15 mA
Impact of clock frequency 35
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Ener
gy[m
J]
89.5% 83.5%75.4%
80.5% 77.4%
10.5%10.5%
15.0%
9.0% 9.0%
1.25 mJ 1.27 mJ 1.29 mJ1.35 mJ 1.35 mJ
Energy consumption per extraction cycle
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
(a) Energy distribution
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
92.2% 89.6%83.7%
90.5% 89.1%
7.8%7.8%
11.6%
CPU occupation per extraction cycle
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
(b) CPU load
Figure 4.6.: USB test. Partition performance
to ∼ 8 mA, as Figure 4.7b indicates.
As it was already mentioned, the DMA and I2S peripherals keep working in idle
mode and they make use of the high-frequency peripheral clock, which limits the
available low energy modes of the microcontroller to Sleep Mode. Both modules are
probably the most important contributors to these 8 mA of idle consumption.
No process ARAveraging
Freq-Ext.Standalone
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Ener
gy[m
J]
100.0% 92.0% 87.4% 76.1% 72.3%
4.2%
12.2% 12.0%4.7%
4.4% 4.4%4.9% 4.8%
4.5% 4.4%0.67 mJ
0.7 mJ 0.71 mJ
0.76 mJ 0.77 mJ
Energy consumption per extraction cycle
Large AR
Rep. AR
Avg.
DFT
SNR
Idle
(a) Energy distribution of partitions
No process AR Averaging Freq-Ext. Standalone
10
15
20
25
30
mA
Idle current
16bit
24bit
(b) Idle current
Figure 4.7.: Test without USB. fclk = 48 MHz
Impact of clock frequency
Disabling USB brought idle consumption down, but for many configurations most
energy is still spent during idle time.
By looking at measurements such as in Figure 4.2, the first aspect that may be
discussed is CPU wakeups within idle periods due to DMA implementation. While
optimizing DMA would indeed reduce idle consumption, the narrow temporal width
of these bursts indicates that the potential gain is not crucial even in the best case.
36 EXPERIMENTS
Apart from this, there is not much more options left from the point of view of software
optimization. Nevertheless, as CPU load is low enough for all considered cases,
clock frequency of the MCU can be reduced in order to observe any improvement in
consumption.
Before the experiments take place, a model can be built to analyze the impact of
clock frequency reduction.
Dynamic power dependency with frequency was already addressed in Chapter 2, and
this can be linked to current:
Powerdynamic = v · i⇒ i =Powerdynamic
v∝ f (4.11)
As voltage is constant, current can also be related to frequency through a linear
model for active (iA) and idle (iI) components:
iA = qA · f + IA (4.12)
iI = qI · f + II (4.13)
Here qA and qI represent the consumed coulombs per clock cycle for both active and
idle modes, while IA and II are the static currents that do not depend on frequency.
Looking at CPU load τ , it can be assumed that it is inversely proportional to clock
frequency:
τ =fτf
(4.14)
The range of τ is [0, 1]. τ = 0 for f → ∞ and τ = 1 for f = fτ , so f ∈ [fτ ,∞). fτ
can be regarded as the full load frequency.
The overall consumption current is the average of iA and iI , weighted by τ :
i = τ · iA + (1− τ) · iI = (iA − iI) · τ + iI (4.15)
Now, substituting 4.12, 4.13 and 4.14 in 4.15 the total current can be expressed in
terms of frequency:
i (f) = (qA · f + IA − qI · f − II) ·fτf
+ qI · f + II =
= qI · f +fτ · (IA − II)
f+ fτ · (qA − qI) + II (4.16)
And by differentiating, the optimal operating frequency where consumption is min-
imized can be found:
Impact of clock frequency 37
∂i (fopt)
∂f= qI −
fτ · (IA − II)f2opt
= 0
qI =fτ · (IA − II)
f2opt
fopt =
√fτ · (IA − II)
qI(4.17)
imin = i (fopt) = 2√fτ · qI · (IA − II) + fτ · (qA − qI) + II (4.18)
Only if fopt ≥ fτ . If fopt lies out of the function’s domain, then f ′opt = fτ and
consequently:
i′min = i (fτ ) = qA · fτ + II (4.19)
In the platform, measurements were taken for fmax/2 and fmax/3 without USB
running, and the obtained active and idle currents and CPU load were compared
with those at fmax that were already measured for the repetition test without USB.
While iI is reasonably constant for all tests under a certain f , CPU load and iA can
differ greatly. Table 4.1 shows the results choosing iA from the averaging partition
and τ as a rough average of that same partition.
Table 4.1.: Measured values for different clock frequencies. Chosen partition: Averaging
iA iI τfmax = 48 MHz 18.5 mA 7.8 mA 5.6%
fmax/2 = 24 MHz 11.1 mA 5.6 mA 11.0%fmax/3 = 16 MHz 8.3 mA 4.5 mA 16.0%
Using linear regression, the following model parameters were calculated:
qA = 316.3 pC
qI = 100.5 pC
IA = 3.4 mA
II = 3.0 mA
fτ = 2.5 MHz
These lead to a minimum consumption current of 4.1 mA for fopt = 2.9 MHz.
38 EXPERIMENTS
2.5 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0
Clock frequency [MHz]
0
20
40
60
80
100
CP
Ulo
ad[%
]
iAiA · τiI
iI · (1− τ)
i
τ
0
5
10
15
20
25
Con
sum
pti
oncu
rren
t[m
A]
Figure 4.8.: Impact of clock frequency on current consumption for the averaging partition. Thegreen dotted line shows a linear approximation of the total current, ignoring theinverse term with f in 4.16. Large dots represent experimental data
The difference between IA and II is smaller than 0.5 mA. This explains why fopt is so
close to fτ and why the inverse term is almost negligible, as Figure 4.8 indicates. In
fact, if static currents for both modes were equal (IA = II = IS), Equations 4.16, 4.17
and 4.19 would be transformed as follows:
i (f) = qI · f +fτ · (IS − IS)
f+ fτ · (qA − qI) + IS = qI · f + fτ · (qA − qI) + IS
(4.20)
fopt =
√fτ · (IS − IS)
qI= 0; thus f ′opt = fτ (4.21)
i′min = i (fτ ) = qA · fτ + IS (4.22)
The inverse term would disappear, and i(f) would therefore become linear. The
different terms in Equation 4.20 can be explained in this way:
I qI · f : Is the only term related to frequency, and it only depends on how fast
iI decreases with f . It actually represents the idle dynamic current.
I fτ · (qA − qI): The difference (qA − qI) represents the coulombs per clock cycle
that solely the CPU consumes. It is multiplied by the full load frequency, so
it accounts for the current that the CPU would draw if it were permanently
working. Its independence with f gives a notion that a CPU task cannot
consume less by reducing frequency, as changes in CPU load and active current
Averaging schemes 39
cancel out.
I IS : Is the static current.
The decision to consider the nonlinear term in Equation 4.16 depends on the obtained
IA for a specific set of test parameters. It can be expected that, for this particular
platform, IA will not be much larger than II , as CPU activation should increase
dynamic consumption in a much heavier way than static consumption.
In any case, the overall conclusion that is valid for all cases is that idle dynamic
current increases with clock frequency at a path of qI = 100.5 pC and that II = 3
mA.
Averaging schemes
Averaging takes up to 6% of energy consumption in the reference implementation.
Consequently, it would be interesting to try to optimize this stage of the processing
chain by using a different approach.
For instance, CMA has been the chosen algorithm because it was assumed that
the averaged buffer should represent a physically meaningful average of the record-
ings throughout every buffer iteration. Nevertheless, two facts advise against this
normalization:
1. The averaged buffer is only further processed every nth buffers, where n equals
the extraction rate. In this sense, normalizing can be performed only one out
of n times.
2. SNR express a ratio between magnitudes. If these magnitudes are scaled by
the same factor, then scaling does not have an impact on the ratio. Scaling is
only needed when absolute physical values are required (e.g. if the algorithm
calculates SPL at f1 and f2 to ensure that it adapts to the specified L1 and
L2 values).
For these reasons, normalization of summed data was not included in the analysis.
In spite of this, the conducted experiments did profile normalization to certify that
in general it is not a good strategy.
Apart from this, up to this point artifact rejection has only been considered in its
simplest form, as a way of deciding if a buffer should be computed into the average
or discarded. As outlined in Section 2.2.2, the AR score can also be an averaging
weighting factor rather than a discarding one, taking the most out of poor quality
data.
Another dimension of averaging are data types. Samples from the codec always come
in fixed point format, whereas the actual implementation operates with floating point
40 EXPERIMENTS
arithmetic. Floating point is more expensive in terms of computation and energy,
but easier to implement thanks to its extended dynamic range. Fixed point is more
efficient but it involves extra concerns about over- and underflowing in order to keep
computations right.
In Table 4.2 a taxonomy of the different averaging schemes is outlined. One aspect
to classify them is whether they use AR to reject buffers or to weight them, and the
other one is whether data is normalized at each iteration or not.
Table 4.2.: Averaging schemes
Artifactrejection
Artifactweighting
Non-normalized Sum Weighted sumNormalized CMA Weighted CMA
For cumulative averaging an averaged buffer (~a) is preserved between iterations,
whereas for summation the preserved one is a summed buffer (~s). In either case
the contents of a sample buffer (~x) are added to the preserved buffer. This sample
buffer is no longer required after averaging, so weighting can be done in place.
For fixed point, the mentioned sample buffer is simply the one where the DMA moves
samples from the codec into. For floating-point, though, an intermediate floating
point sample buffer must be allocated to perform data-type conversion, requiring
extra memory.
However, this is only required if averaging is performed in block. If memory saving
is critical then samples can be converted individually.
For block processing, ARM provides a DSP library with different vectorized func-
tions. The ones that will be used here are:
I vector value conversion: Converts values from a vector with a certain data type
into a vector with another data type. In this section fixed point to floating
point is used.
I vector scale: Multiplies vector values by a constant factor into a destination
vector. If division is required, the reciprocal of the dividing factor must be used
as scale factor. It allows in-place computation so that source and destination
are the same memory region.
I vector shift : Shifts fixed-point values from a vector a certain number of bits
either right or left. It also allows in-place computation.
I vector add : Adds the values of two vectors into a destination vector.
Averaging schemes 41
I vector sub: Subtracts one vector’s values from the other vector’s values into a
destination vector.
While the documentation does not specify if adds and subs can be computed in-
place, experiments within the platform proved that at least it is possible in the
Wonder Gecko.
As a shortcut to write down the formulas for the averaging schemes, Equations 4.23
and 4.24 introduce the notation for buffer averaging for the nth iteration:
~an =
∑ni=1 ~xi
n=~sn
n(4.23)
~aWn =
∑ni=1 ωi · ~xi∑ni=1 ωi
=~sWn
Wn(4.24)
where Wn is the total sum of all weights up to the nth iteration:
Wn =n∑i=1
ωi =n−1∑i=1
ωi + ωn = Wn−1 + ωn (4.25)
Provided that at the beginning of the nth buffer iteration the sample buffer (~xn) and
either the summed (~sn−1) or averaged (~an−1) buffer from last iteration are available,
the update algorithms for every average variant are the following:
Sum: ~sn =
n∑i=1
~xi =
n−1∑i=1
~xi + ~xn = ~sn−1 + ~xn
Weighted sum: ~sWn =
n∑i=1
ωi · ~xi =
n−1∑i=1
ωi · ~xi + ωn · ~xn = ~sWn−1 + ωn · ~xn
CMA: ~an =~sn−1 + ~xn
n=
(n− 1) · ~an−1 + ~xn
n= ~an−1 +
~xn − ~an−1n
Weighted CMA: ~aWn =~sWn−1
Wn+ωn · ~xnWn
=Wn−1
Wn· ~aWn−1 +
ωn
Wn· ~xn =
= (1− αn) · ~aWn−1 + αn · ~xn
Fixed point data types as specified by ARM’s C libraries have 1.7, 1.15 and 1.31
formats. This means that the most significant bit is used as a sign and the remain-
ing bits represent the fractional part of a real number in the range [−1, 1). But
these data types are defined as signed integers of N bits, which lie in the range[−2N−1, 2N−1 − 1
]. Consequently, floating point conversion for the DSP library
hides an implicit extra scaling as described by Equation 4.26. The principles that
42 EXPERIMENTS
led to discarding average normalization are also valid here, but floating scaling is
already integrated in the block processing functions, so it can only be spared for
sample processing.
xfloat =xfixedb
=xfixed2N−1
(4.26)
Table 4.3 summarizes the arithmetic operations needed for every averaging approach.
Table 4.3.: Averaging operations (Div. = Divisions, Mult. = Multiplications, Sub. = Subtrac-tions, Add. = Additions)
Div. Mult. Sub. Add.Sum 1
Weighted sum 1 1CMA 1 1 1
Weighted CMA 2 1
Operations columns are sorted in terms of typical relative computational cost. That
is, divisions are typically the most expensive operation, followed by multiplication,
which is also more expensive than the rest of them. By taking this into account
it can be noted that averaging schemes are also somewhat sorted along rows from
expectedly cheaper to expectedly expensive.
But this situation can be tweaked to optimize some averaging versions, as both mul-
tiplications and divisions can be regarded as a vector scaling. In this way, divisions
can be transformed into multiplications by the reciprocal number, which should yield
better performance.
Then the table is transformed into Table 4.4.
Table 4.4.: Optimized averaging operations (Scl. = Scalings, Sub. = Subtractions, Add. =Additions)
Scl. Sub. Add.Sum 1
Weighted sum 1 1CMA 1 1 1
Weighted CMA 2 1
Energy and time performance of these four different functions were measured for
different buffer lengths in 3 different ways: in a straightforward implementation, in
an optimized approach according to Table 4.4 and also using the mentioned DSP
functions. DSP functions can also be referred as block processing functions, in oppo-
Averaging schemes 43
sition to the other variants that are executed sample-by-sample inside a loop. These
can then be referred as in-loop variants.
All of these measurements were repeated for 32 and 16 bit fixed point and for floating
point data type.
Then, time and energy consumption for all variants were plotted in function of
the buffer length, and a linear behavior was observed in almost all cases (See Ap-
pendix A.3). By dividing the slopes of these plots by the slope of a reference av-
eraging scheme, the resultant values come up as the proportion of time and energy
that each scheme takes in comparison with the reference one. These are called time
ratios and energy ratios.
In Section 4.2, floating point straightforward CMA was used as the averaging scheme,
so this one will be used as the reference scheme. Time ratios for the optimized and
DSP implementations under this reference are summarized in Table 4.5.
The obtained results generally confirm the theoretical discussion. The first con-
clusion to be drawn is that optimized software represents a major improvement
in performance. By using the optimized implementation of CMA instead of the
straightforward one, time computation per sample reduces down to 41.9% of the
original runtime.
Table 4.5.: Time ratios for different averaging schemes
Floating point Fixed point 32 bits Fixed point 16 bitsOptimized DSP Optimized DSP Optimized DSP
Weighted CMA 44.3% 75.7% 30.3% 87.9% 30.2% 47.1%CMA 41.9% 93.2% 30.2% 62.3% 32.5% 36.7%
Weighted sum 41.9% 64.6% 28.0% 50.1% 25.6% 27.9%Sum 37.2% 44.7% 25.6% 12.2% 25.6% 8.7%
Furthermore, implementations using block processing functions are in general terms
outperformed by these optimized in-loop variants. As the DSP library presumably
uses SIMD instructions inside its functions, it could be claimed that the generated
code for in-loop functions with the O2 optimization level also does, which would
explain this apparent contradiction. Only for the summing scheme in fixed point
this tendency is reverted.
As for the most efficient data type, 32-bit fixed point obtains the best scores for the
optimized version and 16-bit fixed point for DSP.
Normalization was also included in the experiments, both through arithmetic scal-
ing and through shifting. For this case, two extra functions were profiled where
normalization is added to the summing stage. Results in Table 4.6 indicate that
44 EXPERIMENTS
they are sensibly more expensive than summing. This cost is not only caused by
arithmetic operations but also by the fact that normalized data has to be written
to a different buffer than summed data, which causes memory writes to double.
Table 4.6.: Time ratios for normalized averaging schemes
Floating point Fixed point 32 bits Fixed point 16 bitsOptimized DSP Optimized DSP Optimized DSP
Sum and scaling 47.0% 57.0% 34.9% 50.0% 34.9% 27.9%Sum and shift 34.9% 22.1% 39.6% 19.8%
FFT and Goertzel algorithm
In the reference implementation, all frequency extractions are performed through
Goertzel’s algorithm. In its original form, the algorithm consists of a first stage where
the signal is processed through a digital filter and a final one where the complex
component is computed out of the two last values from the filtered signal. However,
if the complex value is not required, the two same values can be calculated differently
to obtain the squared magnitude. This is the approach in the implementation, as
phase information is not needed.
The literature consulted recommends Goertzel’s algorithm only for a small number
of DFT terms, as the computation of the whole spectrum presents quadratic com-
plexity [5]. Goertzel is said to be more efficient than FFT if the number of extracted
terms K satisfies:
K < log2N (4.27)
where N is the length of the signal buffer in samples.
In order to determine if this also holds in the device, the actual implementation (also
referred as in-loop Goertzel) was compared with the real FFT functions from ARM’s
DSP library. To ensure that both versions are compared in the same conditions, a
final stage was added to the FFT to calculate the squared magnitude of the K
computed complex values.
A third strategy was added to the test, where Goertzel is also performed with the
help of the DSP library, in a block processing approach. In particular, the first stage
filter is realized through a function that performs biquad filtering.
Similarly to the previous section, all tests where executed for 16-bit fixed point,
32-bit fixed point and floating point for different buffer lengths.
FFT and Goertzel algorithm 45
Just as before, the results point out that DSP functions do not accelerate Goertzel’s
algorithm, but they are around 100% slower. As for FFT, it outperforms Goertzel
for a lower number of terms than predicted by Equation 4.27. In fact, this quantity
increases less than in theory with N . Regarding the in-loop Goertzel implementa-
tion, it can be stated that FFT is preferable in floating point if K is greater than 7 or
if it is greater than 14 in 32-bit fixed point. For 16-bit fixed point, the computation
of only 5 terms is already faster for FFT. The number of squared terms K does not
seem to have a great impact on the overall FFT performance.
As for the most convenient data-type to operate with, if pure FFT is compared then
16-bit fixed point is the best choice. Surprisingly, floating point scores better than
32-bit fixed point both in time and energy efficiency. Regarding Goertzel, there are
practically no differences among data types for the straightforward implementation.
Figure 4.9 gives an overview of the experiment’s results, and Appendix A.4 can be
read for a deeper analysis.
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Du
rati
on[m
s]
SP floating point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
(a) Floating point. N = 512
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Du
rati
on[m
s]
SP floating point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
(b) Floating point. N = 2048
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Du
rati
on[m
s]
16-bit fixed point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
(c) 16-bit fixed point. N = 512
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Du
rati
on[m
s]
16-bit fixed point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
(d) 16-bit fixed point. N = 2048
Figure 4.9.: Goertzel and FFT time performance
A last remark can be made about DSP implementations of filtering and FFT re-
46 EXPERIMENTS
garding memory. None of them supports in-place computation, which demands the
use of a destination buffer of N samples in addition to the source buffer.
Audio codec
The audio codec in use was also profiled for a sampling frequency range between 16
kHz and 48 kHz, for SPLs between 50 and 70 dB. Experiments measured currents
within a maximum range of 6.5 to 7.4 mA, which gives an idea of the low influence
of the considered parameters in consumption.
Taking 7 mA as the measured value, power consumption equals 23.1 mW. This
complies with the provided values in the datasheet for PLL use.
5Case scenarios
“Yo soy yo y mi circunstancia, y si no
la salvo a ella no me salvo yo”
“I am I and my circumstance, and if I
don’t save it I don’t save myself”
Jose Ortega y Gasset
This chapter applies the obtained results to specific case scenarios, using wireless
consumption models as a guide for the behavior of a complete system.
Global model
Device characterization
Chapter 4 has drawn the following conclusions in relation to the parameter choices
for the studied DPOAE algorithm:
I Sampling rate: A higher sampling rate increases linearly the CPU load, which
means a rise in consumption.
I Buffer length: Long buffers soften the impact of overhead processing at the
expense of using more RAM memory.
I Averaging and DFT: Algorithms have been sorted based on performance, the
most efficient among possible implementations should be used.
I Fixed vs. floating point: It may be advantageous to work with fixed point
instead of with floating point in some cases, even though not all implications
of fixed point have been looked into detail.
I CPU frequency: It should be brought down to an optimal value. This will
typically lie close to the full CPU load frequency or even below, but in any
case the chosen frequency must respect the limit imposed by fτ .
48 CASE SCENARIOS
I Partition stage: If the device takes over more data processing, data throughput
is reduced.
This last point indicates the potential gain in wireless consumption by heavier on-
device processing. If all other parameter choices can be fixed, a wireless consump-
tion model will help to determine the best partitioning spot of the communication-
computation trade-off.
Wireless consumption model
The chosen wireless technology must support the data throughput required by the
partition stage while consuming as less as possible. Required throughput can be
higher than 1 Mbit/s for raw audio transmission or practically negligible for a stan-
dalone version.
Because of this disparity, two technologies have been initially considered: standard
Bluetooth and Bluetooth Low Energy (BLE).
The characteristics of Bluetooth have been extracted from [12]. According to this
source, Bluetooth’s maximum data rate is 720 kbps, and it consumes 102.6 mW in
transmission. As the reference voltage supply for all the study has been 3.3 V, this
power value corresponds to 31.1 mA in current.
BLE protocol has been considered under the terms discussed in [13]. According to
this, BLE is capable of sending up to four notifications each connInterval seconds.
A notification is a message in a high layer of the protocol stack that can carry up to
20 bytes of data payload, and connInterval is the time elapsed between two of these
transmissions. Considering maximum sizes, BLE carries 80 bytes per connInterval.
For the minimum value of connInterval (7.5 ms), BLE can theoretically achieve a
throughput of 85.33 kbps in an environment without errors. Increasing connInterval
will decrease throughput rate and consumption. An estimation of how consumption
current is affected by this parameter can be found in Figure 5.1.
Systematic setting of parameters
With the help of built models and gathered information, the following method can
be used to guide the process of deciding system settings for DPOAE:
1. Fix sampling rate and buffer length for a given frequency resolution. The aim in
this first step is to minimize sampling frequency and maximize buffer length,
as this leads to better performance. Nevertheless, two restrictions must be
considered:
Systematic setting of parameters 49
Figure 5.1.: BLE consumption current. Image source: [13]
a) Nyquist criterion: sampling rate must be greater than twice the highest
frequency of interest.
b) Available RAM: Buffer length improves performance at the expense
of RAM occupation. The most minimal setup consists of two stimulus
buffers, two ping-pong reception buffers and an average buffer. This leads
to a minimum RAM occupation described by Equation 5.28:
5× buffer length× sample size (5.28)
Extra memory costs are not considered here, but in-ear calibration tables,
extracted frequency values or additional variables do also take up RAM.
Stimulus buffers can be shorter than the rest, but this will reduce the
achievable frequency resolution for f1 and f2.
2. Choose averaging and frequency extraction implementations, either for fixed
or floating point. Choice criteria here include microcontroller architecture,
available RAM and number of significant DFT terms.
3. Taking the OAE extraction rate into account, predict values of τ and iA at the
maximum clock frequency for each partition and find the optimal operating
frequency that minimizes consumption current. For simplicity, a linear model
with IA = II is assumed.
4. Choose the most suitable wireless technology and estimate its consumption
for each partition, accordingly to the required data throughput. Add it to
computation consumption and pick the lowest result.
From the method above, the following application specifications can be deduced:
50 CASE SCENARIOS
I From the perspective of the medical application: frequency resolution, highest
frequency of interest, span of the observation region and refresh rate. The last
one can be defined as the rate at which new information must be presented
on the host. In regard of the highest frequency of interest, frequency range of
stimulus signals in DPOAE spans normally from 2000 to 4000 Hz [6].
I The selected microcontroller imposes some restrictions through its architecture
and the size of its RAM.
In order to exemplify this decision making process, two case scenarios will be pre-
sented. Both will exhibit similar medical specifications for different hardware. In all
cases the application will be based in ARM’s Cortex-M4.
First case scenario: High performance hardware
For the first case scenario, the application will be executed on a Cortex-M4F (Same
as in the Wonder Gecko) with 64 KB RAM. The frequency resolution cannot be
worse than 15 Hz and the noise will be always averaged over a region around the
OAE at 2f1−f2 with a total frequency span of f2−f1 (f2 = 1.22f1). The host must
receive new data every 0.125 seconds. Buffer must be weighted according to a score
obtained through large magnitude artifact rejection.
As the maximum frequency of interest is 4000 Hz, any sampling rate above 8000 Hz
is valid. The lowest profiled sampling rate will be selected, namely 16000 Hz. This
forces the buffer length to fulfill the condition:
buffer length >fs∆f
=16000
15= 1067 samples
Consequently, the next power of 2 is chosen, 2048. The achieved frequency resolution
is 16000/2048 = 7.8125 Hz.
No requirement is provided about sampling bits, so 16 bits will be chosen for now.
This makes the basic memory consumption be 5× 2048× 2 = 20 KB, according to
Equation 5.28. Therefore, there are still 44 KB of RAM left.
The observation region is (f2 − f1) = 0.22f1, which for the worst case (f1 = 4000)
equals 880 Hz or 880/7.8125 ' 113 DFT terms (this number must be odd so that
the region is symmetric around the OAE).
A new buffer is acquired every 2048/16000 = 0.128 seconds, which forces the ex-
traction rate to equal one in order to refresh every 0.128 seconds. This is slightly
above the requirement, but as refresh rate is not a critical parameter it can still be
considered valid.
As for algorithm choices, there are multiple suitable alternatives. As the core is
a Cortex-M4F and there is enough RAM, it is sensible to work in floating point.
Systematic setting of parameters 51
Thus, averaging will be performed through a floating point implementation of the
weighted sum algorithm. For this averaging scheme the in-loop variant is preferred
over the use of DSP functions. The high number of extracted frequency components
also speaks in favor of using FFT, again in a floating point version. In addition,
Repeatability AR is not needed because weighting is only based in Large magnitude
AR.
Algorithm choices demand more RAM than originally. Averaged samples are single
precision floating point now so the averaged buffer doubles in size, while an extra
floating point buffer of length 2048 is needed for FFT. RAM occupation becomes:
4× buffer size× 2 + 2× buffer size× 4 = 16× buffer size = 16× 2048 = 32 KB
Still half of the RAM space is free for the stack, calibration tables and other variables.
In order to predict τ and ıA for all partitions, some reference values are needed
for current consumption and duration of the different processing stages. These are
obtained from the standalone version for the already selected sampling rate, buffer
length and data type in the test without USB at maximum clock frequency.
Experimental data shows that current values do not change substantially between
different implementations of the same algorithms, but speed does. For this reason,
the measured durations for all stages are recalculated according to the experimental
results:
I Because only a value is needed, it is still recommendable to use Goertzel for
Large magnitude AR.
I According to Table 4.5, the selected averaging algorithm reduces time to 41.9%.
I According to experimental data, FFT is 1.4 times slower than Goertzel in
floating point for the original 5 extracted DFT terms. On the other hand, if
Goerztel were used it would be roughly 113/5 = 22.6 times slower. It will be
assumed that the overall FFT speed for N = 2048 is not greatly affected by
squaring 113 terms instead of 5.
I SNR computation also becomes slower as a result of the increment in DFT
terms. Thus, the original time must be multiplied by 113/5 = 22.6.
Table 5.1 summarizes current value istage and duration tstage for the different process-
ing stages, and it also indicates the magnitude of the new value t′stage as a result of
software optimization. As discussed, performance decreases in Frequency extraction
and SNR computation.
Now, iA and τ can be computed for the different partitions. τ is calculated as the
52 CASE SCENARIOS
Table 5.1.: Summary of stage performance for first case scenario at full clock frequency
istage tstage t′stageLarge AR 19.1 mA 0.9 ms 0.9 msAveraging 16.0 mA 1.8 ms 0.8 ms
Frequency extraction 21.8 mA 2.9 ms 4.0 msSNR computation 17.2 mA 0.6 ms 12.8 ms
sum of involved active stages’ t′stage over buffer period (128 ms). iA is the average
of involved istage weighted by t′stage. Thanks to the assumption IA = II , qA can be
calculated from Equation 4.12 just as:
qA =iA|f=fmax − II
fmax(5.29)
After that, optimal clock frequency and overall computation current can be esti-
mated using Equation 4.20. f must be greater than fτ to avoid overrun, so for
simplicity it has been decided to define a set of available clock frequencies from 1
to 48 MHz in steps of 500 kHz and round up to the immediate bigger value in this
set. Computation current iC , clock frequency fclk and all intermediate values for
this case scenario can be found in Table 5.2.
Throughput can also be estimated by determining the size of the message that is
created every buffer period. Then the BLE parameter connInterval can be estimated
as the size of a connection message over throughput. The results for the case scenario
are visible in Table 5.3.
Table 5.2.: Estimation of partition computation consumption. First case scenario
iA τ fτ fclk iCNo processing 7.8 mA 0.0% 0 MHz 1 MHz 3.1 mA
Averaging 17.7 mA 1.3% 0.6 MHz 1 MHz 3.2 mAFrequency extraction 20.6 mA 4.4% 2.1 MHz 2.5 MHz 3.8 mA
SNR computation 18.3 mA 14.4% 6.9 MHz 7 MHz 5.2 mA
Because all buffers are averaged through weighting, AR does not bring throughput
down for this case and is not an issue within the communication-computation trade-
off. AR partition is, therefore, not considered for this case.
“Frequency extraction” and “SNR computation” are the only partitions with valid
connInterval values. This means that “No processing” and “Averaging” partitions
should use regular Bluetooth, whose consumption is much greater than the differ-
ences in computation consumption. This is a sufficient argument to discard these
Systematic setting of parameters 53
Table 5.3.: Estimation of partition throughput. First case scenario
Messagesize
Throughput connIntervalComm.current
No processing 4096 B 256.00 kbps 2.5 ms ∼ 30 mAAveraging 8192 B 512.00 kbps 1.3 ms ∼ 30 mA
Frequency extraction 452 B 28.25 kbps 22.7 ms ∼ 2 mASNR computation 4 B 0.25 kbps 2560.1 ms ∼ 0.3 mA
two first partitions.
In the case of SNR computation, connInterval value is much bigger than the required
refresh rate, which would cause an unacceptable delay. connInterval = refresh rate
= 125 ms would be preferably used, where now the size of the connection message
is only four bytes.
Looking back to Figure 5.1, 125 ms yield a value between 0.1 and 1 mA, let 0.3 mA be
an approximation of it. For 22.7 ms, the current value is likely to lie between 2 and
3 mA. The difference in computation consumption between the last two partitions is
only 1.4 mA, while for communication consumption is apparently greater. Although
the available data for BLE consumption only allows such a rough estimation, in this
case scenario SNR computation is the most likely partition to be the most efficient.
Second case scenario: mid-performance hardware
In this second scenario, the available MCU has a Cortex-M4 core with 8 KB RAM.
This core is similar to the Cortex-M4F but lacks a hardware FPU, which may cause
floating point operations to take longer to execute.
Medical requirements in this case are less demanding: frequency resolution must be
around 30 Hz and noise is calculated averaging in a region of 125 Hz around the
OAE. Buffers must still be weighted according to Large magnitude AR and refreshing
rate needs to be less than 0.2 ms now.
Sampling frequency should be kept as low as possible. If buffer length equals 512:
∆f =16000
512= 31.25 Hz
Under this ∆f , for the observation region 125 Hz = 5 frequency bins, as in the
reference implementation. Frequency resolution cannot be pushed below that figure,
as basic RAM consumption assuming 16 bits per sample already takes more than
half of the RAM:
54 CASE SCENARIOS
5× 512× 2 = 5 KB
Furthermore, if there is no FPU, all arithmetic should be fixed point. In that case
averaging and frequency extraction should make use of 32-bit registers to handle
range issues, making the averaged buffer to double in size:
4× 512× 2 + 512× 4 = 6 KB
Only 2 KB are left for stack, calibration tables and static variables.
Goertzel shall be used instead of FFT because it is faster in 32-bit fixed point for 5
frequency bins. Weighted sum is also preferably done in-loop for this data type, as
it was in the former case scenario for floating point.
Buffer period is 512/16000 = 32 ms. As a result, extraction rate has to be set to 6
so that:
refresh rate = 6× 32 = 192 ms < 0.2 s
If Table 5.1 is repeated under these circumstances, it becomes:
Table 5.4.: Summary of stage performance for second case scenario at full clock frequency
istage tstage t′stageLarge AR 17.7 mA 0.2 ms 0.2 msAveraging 16.1 mA 0.5 ms 0.1 ms
Frequency extraction 20.4 mA 1.0 ms 1.0 msSNR computation 17.2 mA 0.6 ms 0.6 ms
The only stage experiencing improvement this time is averaging. The first reason for
this is that the number of extracted frequency terms remains 5, as in the reference
implementation. In AR and Frequency extraction, the explanation is completed
by the fact that Goertzel performance presents a very small variance among data
types. As for SNR computation, no big differences have been found in the conducted
experiments by not using the FPU.
An extraction rate greater than one has an influence in the calculation of clock
frequency and average current. For the last two partitions, frequency components
and SNR are only obtained every six buffers. τ is then the weighted average of the
CPU load during OAE extraction (denoted as τext) and that from the “Averaging”
partition. The same applies for iA, where now iA−ext describes the averaged current
when all stages in the partition are executed.
Table 5.5 gathers some of the values used for that estimation. Note that the oper-
Systematic setting of parameters 55
ating clock frequency is chosen in a conservative approach considering τext and not
τ , so that OAEs can be extracted before the buffer period is over.
Table 5.5.: Estimation of partition computation consumption. Second case scenario
iA−ext τext fext fclk iCNo processing 7.8 mA 0% 0 MHz 1 MHz 3.1 mA
Averaging 17.1 mA 1.1% 0.5 MHz 1 MHz 3.2 mAFrequency extraction 19.5 mA 4.2% 2.0 MHz 2.5 MHz 3.4 mA
SNR computation 18.8 mA 6.0% 2.9 MHz 3 MHz 3.5 mA
Table 5.6.: Estimation of partition throughput. Second case scenario
Messagesize
Throughput connIntervalComm.current
No processing 1024 B 256.00 kbps 2.5 ms ∼ 30 mAAveraging 2048 B 85.33 kbps 7.5 ms ∼ 10 mA
Frequency extraction 20 B 0.83 kbps 768.0 ms ∼ 0.1 mASNR computation 4 B 0.17 kbps 3840.1 ms ∼ 0.1 mA
When examining Table 5.6 for communication consumption, it can be noted that,
thanks to the greater refreshing rate, the Averaging partition can now be imple-
mented with BLE, as it yields a valid connInterval value. However, it leads to a
wireless consumption of 10 mA, which makes it inappropriate.
For Frequency extraction and SNR computation connInterval is too high and causes
latency. It should be set in both cases to the refresh rate, 200 ms. It corresponds to
a consumption of around 0.1 mA in Figure 5.1, but in any case it is the same value
for both. The consumption gap between both partitions remains the same and thus
Frequency extraction partition stands as the new best alternative, although SNR
computation does not lie far from there.
6Conclusions
“Mientras haya un misterio para el
hombre, ¡habra poesıa!”
“As long as there is a mystery for man,
there will be poetry!”
Gustavo A. Becquer, Rima IV
This study has provided a methodical way to analyze the impacts of software choices
in the performance of OAE algorithms. As a result, it has lead to a model that can
predict the behavior of such algorithms when a set of conditions is provided.
It has been concluded that the sampling rate should be set as low as the Nyquist
criterion permits and the buffer length as large as the device’s RAM allows, in order
to get the best results in terms of frequency resolution and energy consumption.
Clock frequency should be adjusted to the optimum value, which will typically lie
close to the full load frequency.
Regarding algorithm implementations, average normalization should be avoided and
Goertzel may be used instead of FFT if the number of extracted frequency compo-
nents does not reach a certain threshold. All these algorithms may also be imple-
mented in fixed point for better efficiency.
In a wireless scenario, algorithm partitioning emerges as a parameter to minimize
overall energy consumption. What the case scenarios show is that the best partition
strategy implies performing at least Artifact Rejection, averaging and frequency
extraction on the device.
Future development
In spite of the accomplished progress, there is still a certain number of topics that
could be discussed and implemented using this work as a base. Some of them will
be listed below:
Future development 57
I Codec profiling: The audio codec, despite being vital for the application, has
not been profiled in depth. Different parts could be integrated in the system
and examined regarding both performance and consumption, and then this
new variable could be added to the equation to achieve a better solution.
I Wireless implementation: Wireless communication has only been addressed
theoretically. Using predictions from this document, a wireless version of the
device can be implemented and the exactitude of predictions can be asserted.
I TEOAE analysis: This other method could be studied in a similar manner
as DPOAE has been, obtaining a global vision of OAE algorithms.
I Calibration analysis: In-ear calibration also implies somewhat similar pro-
cessing as OAE algorithms. On/off-device decisions and communication-computation
trade-off also apply for them, which makes them a suitable research issue.
I Fixed point implementations: Fixed point has been discussed only superfi-
cially. If this is a real option, then actual fixed point versions of the algorithms
should be implemented (not only profiled) and its accuracy could be confronted
with floating point’s.
I Clinical significance: Clinical performance of the algorithms has been omit-
ted, focusing on computation performance. Interdisciplinary work to evaluate
the system under realistic parameters, along with real clinical testing, are es-
sential to deliver a reliable final product.
AAppendices
DPOAE with USB. Current and CPU Load
Profile
Sampling frequency
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
No processing
iIiAi
τ
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Artifact Rejection
iIiAi
τ
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Averaging
iIiAi
τ
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Frequency Extraction
iIiAi
τ
DPOAE with USB. Current and CPU Load Profile 59
16000 24000 32000 48000Sampling frequency [Hz]
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Standalone
iIiAi
τ
Buffer length
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30C
onsu
mpti
oncu
rren
t[m
A]
No processing
iIiAi
τ
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Artifact Rejection
iIiAi
τ
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Averaging
iIiAi
τ
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Frequency Extraction
iIiAi
τ
512 1024 2048Buffer length in samples
0
5
10
15
20
25
30
CP
UL
oad
[%]
10
15
20
25
30
Con
sum
pti
oncu
rren
t[m
A]
Standalone
iIiAi
τ
60 APPENDICES
Sample size
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30
mA
Averaged current
16bit
24bit
No process AR Averaging Freq-Ext. Standalone0
5
10
15
20
25
30
%
CPU Load
16bit
24bit
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30
mA
Active current
16bit
24bit
No process AR Averaging Freq-Ext. Standalone10
15
20
25
30
mA
Idle current
16bit
24bit
DPOAE with USB. Energy and Partitions Profile 61
DPOAE with USB. Energy and Partitions Pro-
file
Energy consumption is averaged for a single processing cycle. Only extraction cycles
are considered.
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Ener
gy[m
J]
95.5% 91.8% 86.2% 88.0% 85.4%
4.5% 4.6%8.0%
5.3% 5.2%
1.53 mJ 1.56 mJ 1.57 mJ 1.6 mJ 1.62 mJ
Energy consumption. 16 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
95.8% 93.6% 88.8% 92.4% 90.6%
4.2% 4.3%7.6%
CPU occupation. 16 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Ener
gy[m
J]
91.8% 88.6% 86.9% 88.7% 86.2%
8.2% 8.1% 8.1%5.3% 5.2%
1.55 mJ 1.58 mJ 1.59 mJ 1.62 mJ 1.63 mJ
Energy consumption. 16 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
92.4% 90.5% 89.3% 93.0% 91.2%
7.6% 7.6% 7.7%
CPU occupation. 16 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
62 APPENDICES
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Ener
gy[m
J]
95.8% 93.0% 87.3% 89.9% 88.6%
4.2% 4.1%7.6%
4.6% 4.5%
3.07 mJ 3.11 mJ 3.14 mJ 3.18 mJ 3.2 mJ
Energy consumption. 16 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
96.1% 94.5% 89.7% 93.9% 93.0%
7.2%
CPU occupation. 16 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Ener
gy[m
J]
92.2% 89.6% 87.9% 90.5% 89.1%
7.8% 7.7% 7.6%4.5% 4.5%
3.11 mJ 3.14 mJ 3.16 mJ 3.21 mJ 3.23 mJ
Energy consumption. 16 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
92.8% 91.3% 90.2% 94.4% 93.4%
7.2% 7.2% 7.2%
CPU occupation. 16 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
1
2
3
4
5
6
Ener
gy[m
J]
95.9% 93.5% 87.7% 90.9% 90.2%
4.1%7.6%
4.2% 4.2%
6.13 mJ 6.2 mJ 6.26 mJ 6.35 mJ 6.36 mJ
Energy consumption. 16 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
96.2% 94.9% 90.0% 94.6% 94.2%
7.2%
CPU occupation. 16 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
Ener
gy[m
J]
92.9% 87.9%79.6% 82.5% 78.9%
7.1%6.8%
12.0%7.4% 7.3%
1.03 mJ 1.05 mJ 1.07 mJ1.1 mJ 1.12 mJ
Energy consumption. 24 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
93.4% 90.4%83.1%
88.6% 86.0%
6.6%6.4%
11.5%4.4% 4.4%
CPU occupation. 24 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
DPOAE with USB. Energy and Partitions Profile 63
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2E
ner
gy[m
J]
87.6% 83.1% 80.7% 83.7% 80.0%
12.4%12.1% 12.0%
7.4% 7.3%
1.05 mJ 1.07 mJ 1.08 mJ1.11 mJ 1.13 mJ
Energy consumption. 24 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
88.4% 85.6% 84.0%89.6% 86.9%
11.6%11.5% 11.5%
4.4% 4.4%
CPU occupation. 24 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
Ener
gy[m
J]
93.7% 89.5%81.3%
85.3% 83.5%
6.3% 6.2%11.2%
6.6% 6.5%
2.06 mJ 2.1 mJ 2.13 mJ 2.18 mJ 2.19 mJ
Energy consumption. 24 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
94.1% 91.7%84.6%
90.9% 89.6%
5.9%5.9%
10.7%
CPU occupation. 24 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
Ener
gy[m
J]
88.5% 84.6% 82.2%86.2% 84.4%
11.5% 11.4% 11.2%6.6% 6.6%
2.1 mJ 2.14 mJ 2.16 mJ 2.21 mJ 2.22 mJ
Energy consumption. 24 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
89.2% 86.9% 85.3%91.6% 90.4%
10.8% 10.8% 10.7%
CPU occupation. 24 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
1
2
3
4
Ener
gy[m
J]
94.0% 90.4%81.8%
86.7% 85.7%
6.0% 5.8%11.2%
6.1% 6.1%
4.11 mJ 4.19 mJ 4.24 mJ 4.34 mJ 4.35 mJ
Energy consumption. 24 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
94.5% 92.4%85.0%
92.0% 91.3%
5.5% 5.5%10.7%
CPU occupation. 24 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
64 APPENDICES
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
90.7% 83.9%73.3%
77.4% 72.8%
9.3%9.1%
15.5% 5.1%9.7% 9.5%4.3%4.1%4.7% 4.5%4.4% 4.3%0.78 mJ
0.8 mJ 0.82 mJ0.85 mJ 0.86 mJ
Energy consumption. 32 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
91.3% 87.1%77.7%
85.0% 81.5%
8.7%8.7%
15.2%5.8% 5.8%
CPU occupation. 32 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
83.6% 78.1% 74.7%78.6% 74.1%
16.4%15.6% 15.5% 5.1%
9.9% 9.8%
0.8 mJ0.82 mJ 0.83 mJ
0.86 mJ 0.88 mJ
Energy consumption. 32 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
84.7% 81.2% 78.8%86.0% 82.6%
15.3%15.0% 15.1%
6.0% 6.0%
CPU occupation. 32 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
Ener
gy[m
J]
91.8% 86.2%75.5%
81.0% 78.5%
8.2%8.2%
14.8%
8.5% 8.5%4.3%
4.1% 4.1%1.55 mJ 1.59 mJ 1.62 mJ
1.67 mJ 1.69 mJ
Energy consumption. 32 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
92.3% 88.9%79.5%
88.0% 86.2%
7.7%7.8%
14.3%4.9% 4.9%
CPU occupation. 32 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
Ener
gy[m
J]
84.3% 79.9% 76.6%81.9% 79.5%
15.7%14.9% 14.8%
8.6% 8.5%
1.59 mJ 1.63 mJ 1.65 mJ1.7 mJ 1.72 mJ
Energy consumption. 32 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
85.3% 82.7% 80.4%88.7% 86.9%
14.7%14.2% 14.3%
5.0% 5.0%
CPU occupation. 32 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
DPOAE with USB. Energy and Partitions Profile 65
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5E
ner
gy[m
J]
92.0% 87.3%76.0%
82.5% 81.6%
8.0%7.8%
14.9%
8.1% 8.0%4.3%
4.2% 4.1%3.1 mJ 3.18 mJ 3.24 mJ
3.33 mJ 3.35 mJ
Energy consumption. 32 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
92.6% 89.8%79.9%
89.3% 88.7%
7.4%7.4%
14.4%4.5% 4.5%
CPU occupation. 32 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Ener
gy[m
J]
88.4%76.3%
61.1%67.8% 61.3%
11.6%
13.3%
22.7% 7.4%
13.9%13.5%
6.2%5.9%
5.7%6.8%
6.6%6.3%
6.2%0.52 mJ
0.55 mJ0.57 mJ
0.6 mJ0.61 mJ
Energy consumption. 48 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
89.2%80.6%
66.5%77.5% 72.2%
10.8%
13.0%
22.8% 5.4%
8.7% 8.7%
4.4% 4.4% 4.3%4.2% 4.2% 4.2% 4.2%
CPU occupation. 48 kHz, 512 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Ener
gy[m
J]
76.1% 67.4% 62.9%69.5% 64.0%
23.9%23.4%
23.0% 7.2%
14.1%13.7%5.0%
4.8%4.6%
5.7% 5.7%5.4%
5.1%0.54 mJ
0.57 mJ 0.58 mJ0.61 mJ
0.62 mJ
Energy consumption. 48 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
77.4%71.4% 67.9%
78.9% 74.6%
22.6%23.0%
23.0% 5.3%
9.0% 9.0%
CPU occupation. 48 kHz, 512 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Ener
gy[m
J]
87.7% 79.9%63.9%
72.2% 69.0%
12.3%11.7%
21.8%
12.5% 12.3%6.2%
6.0% 5.9%4.8%
4.6%4.4% 4.4%
1.05 mJ1.09 mJ
1.12 mJ1.17 mJ 1.19 mJ
Energy consumption. 48 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
88.4% 83.8%
69.1%81.5% 79.0%
11.6%11.3%
21.7%
7.5% 7.5%4.3% 4.3% 4.3%
CPU occupation. 48 kHz, 1024 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
66 APPENDICES
No process ARAveraging
Freq-Ext.Standalone
0.0
0.2
0.4
0.6
0.8
1.0
1.2E
ner
gy[m
J]
77.6% 70.1% 65.7%73.9% 70.5%
22.4%22.3%
21.6%
12.4% 12.2%5.2%
5.0% 4.9%4.2% 4.1%
1.09 mJ1.13 mJ 1.15 mJ
1.2 mJ 1.22 mJ
Energy consumption. 48 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
79.0% 73.7% 70.4%82.8% 80.2%
21.0%21.8%
21.6%
7.5% 7.5%
CPU occupation. 48 kHz, 1024 samples, 24 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0.0
0.5
1.0
1.5
2.0
2.5
Ener
gy[m
J]
88.2% 81.0%64.8%
74.8% 73.2%
11.8%11.7%
21.8%
11.6% 11.5%6.3%
6.1% 6.0%2.1 mJ
2.17 mJ2.23 mJ
2.33 mJ 2.34 mJ
Energy consumption. 48 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
No process ARAveraging
Freq-Ext.Standalone
0
20
40
60
80
100
Tim
eocc
upat
ion
[%]
89.0% 84.5%
69.7%
83.8% 82.6%
11.0%11.3%
21.7%
6.8% 6.8%4.4% 4.4% 4.4%
CPU occupation. 48 kHz, 2048 samples, 16 bits.
Large AR
Rep. AR
Avg.
DFT
SNR
Tx
Idle
Averaging Profile 67
Averaging Profile
Single precision floating point
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Floating point Single precision Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Floating point Single precision Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Floating point Single precision Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Floating point Single precision Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Floating point Single precision DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Floating point Single precision DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
68 APPENDICES
16 bit fixed point
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 16-bit Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Fixed point 16-bit Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 16-bit Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5T
ime
[ms]
Fixed point 16-bit Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 16-bit DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Fixed point 16-bit DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
Averaging Profile 69
32 bit fixed point
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 32-bit Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Fixed point 32-bit Straightforward
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 32-bit Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Fixed point 32-bit Optimized
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.02
0.04
0.06
0.08
0.10
0.12
Ener
gy[m
J]
Fixed point 32-bit DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
0 500 1000 1500 2000Buffer length
0.5
1.0
1.5
Tim
e[m
s]
Fixed point 32-bit DSP
WCMA
Weighted Sum
CMA
Sum-Scale
Sum
Sum-Shift
70 APPENDICES
FFT and Goertzel Profile
Single precision floating point
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
SP floating point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
SP floating point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
SP floating point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
SP floating point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
SP floating point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
SP floating point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
SP floating point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
SP floating point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
FFT and Goertzel Profile 71
16 bit fixed point
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
16-bit fixed point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
16-bit fixed point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
16-bit fixed point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
16-bit fixed point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
16-bit fixed point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
16-bit fixed point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
16-bit fixed point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
16-bit fixed point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
72 APPENDICES
32 bit fixed point
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
32-bit fixed point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
32-bit fixed point. N = 256
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
32-bit fixed point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
32-bit fixed point. N = 512
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
32-bit fixed point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
32-bit fixed point. N = 1024
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
32-bit fixed point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
5 10 15 20 25 30K (Goertzel DFT terms)
0
2
4
6
8
10
12
Dura
tion
[ms]
32-bit fixed point. N = 2048
FFT
Software Goertzel algorithm
DSP Goertzel filter
FFT and Goertzel Profile 73
FFT comparison
256 512 1024 2048FFT Size
0.0
0.2
0.4
0.6
0.8
Ener
gy[m
J]
FFT Arithmetic comparative analysis
32-bit fixed point
16-bit fixed point
SP floating point
256 512 1024 2048FFT Size
0
2
4
6
8
10
12
Dura
tion
[ms]
FFT Arithmetic comparative analysis
32-bit fixed point
16-bit fixed point
SP floating point
These two last figures compare FFT performance for the different considered data
types. Pure FFT is used, without any squared terms.
Bibliography
[1] WHO, “Global estimates on prevalence of hearing loss.” Available at http:
//www.who.int/pbd/deafness/WHO_GE_HL.pdf, 2012. [Online; accessed 19-
September-2017]. cited on p. 1, 2
[2] M. P. Moeller, “Early intervention and language development in children who
are deaf and hard of hearing,” Pediatrics, vol. 106, no. 3, pp. e43–e43, 2000.
cited on p. 2
[3] B. O. Olusanya, M. J. Chapchap, S. Castillo, H. Habib, S. Z. Mukari, N. V.
Martinez, H.-C. Lin, B. McPherson, et al., “Progress towards early detection
services for infants with hearing loss in developing countries,” BMC health ser-
vices research, vol. 7, no. 1, p. 14, 2007. cited on p. 4
[4] P. Cerwall, “Ericsson mobility report,” tech. rep., Ericsson, June 2017.
Available at https://www.ericsson.com/assets/local/mobility-report/
documents/2017/ericsson-mobility-report-june-2017.pdf. [Online; ac-
cessed 19-September-2017]. cited on p. 4
[5] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time Signal Process-
ing (2nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1999. cited
on p. 8, 44
[6] S. Dhar and J. W. Hall, Otoacoustic Emissions: Principles, Procedures, and
Protocols. Core Clinical Concepts in Audiology, Plural Publishing, Inc, 2012.
cited on p. 8, 11, 12, 13, 15, 16, 50
[7] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative
approach. Elsevier, 2011. cited on p. 9
[8] D. Reay, Digital Signal Processing and Applications with the OMAP-L138 eX-
perimenter. John Wiley & Sons, Inc., 2012. cited on p. 9, 10
[9] M. Tahir and R. Farrell, “Optimal communication-computation tradeoff for
wireless multimedia sensor network lifetime maximization,” in Wireless Com-
munications and Networking Conference, 2009. WCNC 2009. IEEE, pp. 1–6,
IEEE, 2009. cited on p. 10
Bibliography 75
[10] T. Martin, The Designer’s Guide to the Cortex-M Processor Family. Oxford:
Newnes, 2013. cited on p. 18
[11] Silicon Labs, Application Note 0007.0: MCU and Wireless MCU Energy
Modes, 2017. Available at https://www.silabs.com/documents/public/
application-notes/an0007.0-efm32-ezr32-series-0-energymodes.pdf.
[Online; accessed 19-September-2017]. cited on p. 22
[12] J.-S. Lee, Y.-W. Su, and C.-C. Shen, “A comparative study of wireless protocols:
Bluetooth, UWB, ZigBee, and Wi-Fi,” in Industrial Electronics Society, 2007.
IECON 2007. 33rd Annual Conference of the IEEE, pp. 46–51, IEEE, 2007.
cited on p. 48
[13] C. Gomez, J. Oller, and J. Paradells, “Overview and evaluation of Bluetooth
Low Energy: An emerging low-power wireless technology,” Sensors, vol. 12,
no. 9, pp. 11734–11753, 2012. cited on p. 48, 49