audio captcha: existing solutions assessment and a new implementation for voip telephony
TRANSCRIPT
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8
ava i lab le at www.sc ienced i rec t . com
journa l homepage : www.e lsev ie r . com/ loca te /cose
Audio CAPTCHA: Existing solutions assessment anda new implementation for VoIP telephony
Yannis Soupionis*, Dimitris Gritzalis
Information Security and Critical Infrastructure Protection Research Group, Dept. of Informatics, Athens University of Economics
& Business (AUEB), 76 Patission Ave., Athens GR-10434, Greece
a r t i c l e i n f o
Article history:
Received 13 September 2009
Received in revised form
4 December 2009
Accepted 7 December 2009
Keywords:
SPIT
Audio CAPTCHA attributes
VoIP
Authentication
Evaluation
Speech Recognition
Turing Test
* Corresponding author.E-mail addresses: [email protected] (Y. Soup
0167-4048/$ – see front matter ª 2009 Elsevidoi:10.1016/j.cose.2009.12.003
a b s t r a c t
SPam over Internet Telephony (SPIT) is a potential source of future annoyance in Voice
over IP (VoIP) systems. A typical way to launch a SPIT attack is the use of an automated
procedure (i.e., bot), which generates calls and produces unsolicited audio messages.
A known way to protect against SPAM is a Reverse Turing Test, called CAPTCHA
(Completely Automated Public Turing Test to Tell Computer and Humans Apart). In this
paper, we evaluate existing audio CAPTCHA, as this type of format is more suitable for VoIP
systems, to help them fight bots. To do so, we first suggest specific attributes-requirements
that an audio CAPTCHA should meet in order to be effective. Then, we evaluate this set of
popular audio CAPTCHA, and demonstrate that there is no existing implementation suit-
able enough for VoIP environments. Next, we develop and implement a new audio
CAPTCHA, which is suitable for SIP-based VoIP telephony. Finally, the new CAPTCHA is
tested against users and bots and demonstrated to be efficient.
ª 2009 Elsevier Ltd. All rights reserved.
1. Introduction a human but a machine. In the spam protection world this
With the rapid worldwide growth of VoIP services, the spam
issue in VoIP systems becomes increasingly important
(Rosenberg et al., 2006), which is the reason why important
companies, like NEC and Microsoft, have already developed
mechanisms (Quittek et al., 2007; Graham-Rowe, 2006) to
tackle SPam over Internet Telephony (SPIT). A serious obstacle
when trying to prevent SPIT is identifying VoIP communica-
tions, which originate from software robots (‘‘bots’’). Alan
Turing’s ‘‘Turing Test’’ paper (Turing, 1950) discusses the
special case of a human tester who wishes to distinguish
humans from computer programs. Nowadays, there has been
a considerable interest in applying an alternate form of the
Turing Test, the so called Reverse Turing Test. The term
‘‘Reverse Turing Test’’ is used to describe that the tester is not
ionis), [email protected] (D.er Ltd. All rights reserved
kind of computer administrated Reverse Turing Test is also
called CAPTCHA (Completely Automated Public Turing Test to
Tell Computer and Humans Apart). The research interest in
this subject has spurred a number of relevant proposals (Blum
et al., 2000; von Ahn et al., 2003, 2004; Chellapilla et al., 2005;
Yan and El Ahamad, 2009). Commercial examples include
major stakeholders in the field, such as Google and MSN,
which require CAPTCHA (visual or audio), in order to provide
services to users. However, there exist computer programs,
which can break the CAPTCHA that have been proposed so far.
In this paper, an audio CAPTCHA was developed that is
suitable for use in VoIP systems. In specific, first we present
the background and related work and explain the main
aspects of SPIT and CAPTCHA. Then, we provide the basic
requirements of a CAPTCHA, briefly explain why an audio
Gritzalis)..
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8604
CAPTCHA is suitable for VoIP systems, and present an algo-
rithm for selecting a suitable CAPTCHA. In Section 3, a classi-
fication of the characteristics/attributes of audio CAPTCHA is
proposed. In Section 4 a number of popular CAPTCHA is
introduced. In Section 5, the procedure to be followed for
testing a CAPTCHA is described; this includes a bot and
a speech recognition tool. In Section 6 we demonstrate that
the existing audio CAPTCHA implementations are not
adequate enough for a VoIP system. In Section 7, the experi-
mental environment which was used for testing the proposed
CAPTCHA is presented. The VoIP experimental environment
was based on the Session Initiation Protocol (SIP), because one
of the most known and deployed multimedia protocols for
VoIP infrastructures. In Section 8, the new audio CAPTCHA is
presented, which is based on the attributes selected in Section
3. Finally, we provide the reader with the results of the tests
performed with the proposed CAPTCHA.
2. Background
SPIT constitutes an emerging type of threat in VoIP systems. It
illustrates several similarities to email spam. Both spammers
and ‘‘spitters’’ use the Internet, so as to target a group of users
and initiate bulk and unsolicited messages and calls.
Compared to traditional telephony, IP telephony provides
a more effective channel, since messages are sent in bulk and
at a low cost. Individuals can use spam-bots to harvest VoIP
addresses. Furthermore, since call-route tracing over IP is
harder, the potential for fraud is considerably greater.
A CAPTCHA is a method that is widely used to uphold
automated SPAM attacks. The same technique can be used to
mitigate SPIT. According to this, each time a callee receives
a call from an unknown caller, an automated Reverse Turing
Test would be triggered. The ‘‘spit-bot’’ needs to solve this test
in order to complete its attack. Integrating such a technique
into a VoIP system raises two main issues. First, the CAPTCHA
module should be combined with other anti-SPIT controls, i.e.,
not every call should pass through the CAPTCHA challenge,
since each CAPTCHA requires considerable computational
resources. A simultaneous triggering of several CAPTCHA
challenges can soon lead to denial of service. Challenges
would also cause annoyance to users, if they had to solve one
CAPTCHA for every call they make. Second, a CAPTCHA needs
to be friendly and easy to solve (‘‘pass’’) for a human user.
2.1. CAPTCHA
A CAPTCHA is a test that most humans should be able to pass,
but computer programs should not. Such a test is often based
on hard open AI problems, e.g., automatic recognition of dis-
torted text, or of human speech against a noisy background.
Differing from the original Turing Test, CAPTCHA challenges
are automatically generated and graded by a computer. Since
only humans are able to return a sensible response, an auto-
mated Turing Test embedded in a protocol can verify whether
there is a human or a bot behind the challenged computer.
Although the original Turing Test was designed as a measure
of progress for AI, CAPTCHA is rather a human-nature-
authentication mechanism.
This paper is focused on audio CAPTCHA. These were
initially created to enable people that are visually impaired to
register or make use of a service that requires solving
a CAPTCHA. Today, an audio CAPTCHA would be useful to
defend against automated audio VoIP messages, as visual
CAPTCHA are hard to apply in VoIP systems, mainly due to the
limitations of end-user devices. For example, nowadays not
many people have a home telephony device with a screen
capable of displaying a proper (high resolution) image
CAPTCHA. If an adequate CAPTCHA is used, it should be hard
for a spit-bot to respond correctly and thus manage to initiate
a call. Also, audio CAPTCHA seems attractive, as text-based
CAPTCHA has been demonstrated breakable (Chew and Baird,
2003; Mori and Malik, 2003; Defeated CAPTCHA; Yan and El
Ahmad, 2007; Yan and El Ahamad, 2008).
2.2. Related work
As the audio CAPTCHA technology is practically in its infancy,
the relevant research work is currently limited.
Bigham and Cavender demonstrated that existing audio
CAPTCHA are clearly more difficult and time-consuming to
complete as compared to visual CAPTCHA (Bigham and Cav-
ender, 2009). They created a comparison between the existing
CAPTCHA implementations, but they do not reach to any
conclusion on how their characteristics affect the user
success rate. They developed and evaluated an optimized
interface for non-visual use, which can be added in-place to
an existing audio CAPTCHA. In their published CAPTCHA
evaluation they mentioned that Facebook, Veoh, and Craigs-
list use different CAPTCHA; today, all three of them use
Recaptcha (Recaptcha Audio CAPTCHA).
Tam et al. (2008a,b) described a number of security tests of
audio CAPTCHA. The authors used machine learning tech-
niques, which are similar to the ones used for breaking visual
CAPTCHA. They analyzed three audio CAPTCHA taken
from popular websites (Google (Google Audio CAPTCHA),
Recaptcha (Recaptcha Audio CAPTCHA), Digg (DIGG)). In some
cases they reached correct solutions with an accuracy of up to
71%. The main issue with this work is that they only tested
the audio CAPTCHA implementations and did not analyze
what is the impact of audio CAPTCHA characteristics on its
performance.
Yan and El Ahmad (2008) worked on the usability issues
that should be taken into consideration when developing
a CAPTCHA. Their work does not specifically focus on audio
CAPTCHA, with the exception of a few characteristics (i.e.,
character set). Their work was concluded with a framework
referring to CAPTCHA usability.
Bursztein and Bethard (2009) developed a prototype audio
CAPTCHA decoder, called decaptcha, which is able to success-
fully break 75% of the eBay audio CAPTCHA. They described an
automated process for downloading audio CAPTCHA, training
the decaptcha bot and finally solving the eBay CAPTCHA.
Finally, Markkola and Lindqvist (2008) proposed a number
of ‘‘voice’’ CAPTCHA for Internet telephony. However, they did
not explain in detail how this could be integrated into an
Internet telephony infrastructure. Also, their work lacks
experimentation results.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 605
2.3. A new approach
In the paper, apart from classifying the audio CAPTCHA attri-
butes and evaluating the current audio CAPTCHA imple-
mentations, a new audio CAPTCHA for VoIP environments will
be developed. The proposed CAPTCHA must be easy for human
users to solve, easy for a tester machine to generate and grade,
and hard for a software bot to solve. The validation of its
performance will be made by two means; namely, by user tests
and by a bot configured to solve ‘‘difficult’’ audio CAPTCHAs.
The latter requirement implies that a specific kind of test
should be developed; i.e., a test that is easy to generate but
intractable to pass without knowledge that is available to
humans but not to machines. Audio recognition fits in this
category. For example, humans can easily identify words in an
environment, whereas this is usually hard for machines
(Dusan and Rabiner, 2005; von Ahn et al., 2008). Specification-
wise, a CAPTCHA should ideally be 100% effective at identifying
software bots, but it was proved (Chellapilla et al., 2005) that
a CAPTCHA could be designed to fight bots with a low failure
rate (i.e.,<0.1%). Generically, a CAPTCHA is effective as long as
the cost of using a software robot remains higher than the cost
of using a human, even when the spammers use cheap labor to
solve CAPTCHA (Trend Micro’s TrendLabs).
In order to develop a new audio CAPTCHA, we followed an
iterative algorithm: (a) we selected a set of attributes that are
appropriate for audio CAPTCHA, (b) we developed a CAPTCHA
that is based on these attributes, and (c) we evaluated the
CAPTCHA by calculating the success rates of a bot and of
a number of users, until the results were adequately (Fig. 1).
3. CAPTCHA attributes
A high user success rate is a key factor in deciding whether
a new CAPTCHA is effective or not. This is particularly
important in the case of an audio CAPTCHA, as it does not only
refer to VoIP callers, but also to visually impaired users of
a VoIP service. Equally important is the bot success rate,
which should be kept to a minimum. Both factors depend on
a number of attributes. The main characteristic of these
attributes is that they should all be adjusted in the production
procedure of the CAPTCHA. We classified these attributes into
four categories: (a) vocabulary, (b) background noise, (c) time, and
(d) audio production.
Fig. 1 – A generic CAPTCHA
3.1. Vocabulary attributes
Audio CAPTCHA designs vary, mainly due to the vocabulary
used. Variations depend upon: (a) the set of characters the
audio CAPTCHA consists of, (b) the number of characters of
a single CAPTCHA, and (c) the local settings, e.g., the language
that CAPTCHA characters belong to.
3.1.1. Adequate data fieldA data field (called ‘‘alphabet’’) is used as a pool for selecting
the characters to be included in an audio CAPTCHA. In order to
integrate an audio CAPTCHA into a VoIP system, we chose an
alphabet of ten one-digit numbers, i.e., {0, ., 9}. Such a choice
allows the use of the DTMF method for answering the audio
CAPTCHA. Other examples of audio CAPTCHA that use only
digits are the MSN and the Google ones. Moreover, some
CAPTCHA includes beep sounds in their vocabulary, so as to
inform the user that the audio CAPTCHA begins. From the
other side, a limited alphabet and beep sounds may make an
audio method quite vulnerable to attacks (Chan, 2003).
3.1.2. Spoken characters variationIn order to make the CAPTCHA solution even harder for a bot
to solve, we introduce a number of different human speakers
for each digit of the alphabet. For example, if there are X
different speakers for each character, then there will be X
different ways to pronounce each character. This essentially
means that each speaker makes a difference for a bot, but
hardly for a human.
Another drawback for a CAPTCHA implementation is the
use of a fixed number of characters. A non-variable number of
characters, in combination with a limited alphabet, can make
a CAPTCHA vulnerable to attack. For example, if only 3-digit
CAPTCHA are used and a bot can successfully recognize only 2
of the digits, then it can reach a success rate of �10% just by
guessing the remaining digit. On the other hand, if the number
of digits of a CAPTCHA is not fixed and a bot can successfully
recognize only 2 of them, then the number of remaining digits
is not known to the bot.
3.1.3. Language requirementsAnother important factor is the mother tongue of the users, as
it plays a major role in achieving a human user high success
rate. This is particularly important in the case of audio
methods, where identifying spoken characters is hard to do, in
development process.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8606
case the mother tongue of the speaker and the user differs.
Therefore, the language should meet the scope of the specific
CAPTCA implementation. As a good practice, the spoken
characters should be not more than a few. The CAPTCHA we
developed can be adjusted for non-English users, as it is
created dynamically and different characters can be added
easily.
3.2. Noise attributes
The noise is still another important attribute of an audio
CAPTCHA, as it can help to increase the difficulty for an
automated procedure to solve it (Jurafsky and Martin, 2008).
3.2.1. Background noiseThe background noise, which can be added during the
production of a voice message, can make CAPTCHA particu-
larly resistant to attacks by automated bots. Application of
background noise requires a great variety of such noises to be
available. These noises should be rotated in an erratic
manner. In our proposal, instead of developing a repository
with noises we chose to proceed with a dynamic production of
them, while ensuring that they are distorted in a random
manner. The way various noises are produced should prevent
their easy elimination by automated programs that use
learning techniques (Tam et al., 2008a). In any case, the final
version of the audio message, resulting from the combined
use of different distortion techniques and added noise, should
be such that the majority of users can easily recognize it. In
the proposed CAPTCHA there was a real-time distortion,
applied in between the characters, as there appears to be no
effective method for evaluating how people understand digits
with distortion.
3.2.2. Intermediate noiseIntermediate noise may prevent an automated program from
isolating correctly spoken characters from a voice message.
The developer needs to select the scale in which the inter-
mediate noise will be applied, because intermediate noise can
decrease not only the automated bot success rate but also that
of the user (Festa, 2003). Also, as this noise should have the
same characteristics as the background noise, it should be
created dynamically.
3.3. Time attributes
A set of variables should be defined during the production of
an audio snapshot (Gibbs et al., 1994). The variables refer to
the length of the audio message, which depends on: (a) the
number of characters spoken, (b) the characters chosen, and
(c) the time required for each character to be announced,
which in turn depends on the speaker of each character. Both,
the beginning and the end of each spoken character, should
also be defined. This depends on the duration of each char-
acter, as well as on the duration of the pause between spoken
characters. If the above time parameters follow specific
patterns, then the resistance of the audio CAPTCHA to a bot
will decrease significantly. In the proposed CAPTCHA we aim
at eliminating such time-related patterns.
3.4. Audio production attributes
In principle, an audio CAPTCHA production procedure should
be automated. In practice, an acceptable human interference
could be allowed only for the adjustment of the various
thresholds.
3.4.1. Automated production processThe automation of the CAPTCHA production process is
a desirable, though hard to achieve, property. The various
elements that compose an audio CAPTCHA, such as the
number of characters of a message, the speaker of each
character, the background sound, the timing and the distor-
tion of the message, make the process time-costly and
demanding in terms of hardware resources. Our choice is to
produce audio CAPTCHA periodically, in order: (a) not to
produce them in real-time, and (b) not to produce identical
snapshots for extended time periods.
3.4.2. Audio CAPTCHA reappearanceAn audio CAPTCHA should reappear as rare as possible.
However, with short alphabets every CAPTCHA is actually
expected to reappear after a while. Due to the attributes of the
voice messages (e.g., technical distortion, added noise,
language, speakers, etc.), as well as to the context of the user
(e.g., noisy environment, etc.), a voice message sometimes
cannot be identified by the user on the first attempt. There-
fore, a second chance should be given. In this case, a different
CAPTCHA should be used.
3.4.3. Audio CAPTCHA reproductionAn audio CAPTCHA should be reproduced in a streaming way.
The main reason for this is that most of the bots need
a training session before they are able to solve a CAPTCHA.
Therefore, if the audio reproduction process is not streaming,
then the bot could easily download all audio CAPTCHA that
are needed for the training session.
Fig. 2 refers to all the attributes of an audio CAPTCHA.
4. Audio CAPTCHA evaluation
In this section we evaluate some popular audio CAPTCHA
utilizing theabovementioned characteristics.First,wecollected
twelve (12) different audio CAPTCHA, not only from popular
websites (i.e., Google, Hotmail, Recaptcha), but also from other
sources (Secure Image CAPTCHA). For each of them we down-
loaded100 examples (in .wav or .mp3format), resulting ina total
of 1200 audio files that were used for the evaluation.
Then, for each audio CAPTCHA we provided a short
description of its functionality. We summarized with drafting
a table that includes all these CAPTCHA, together with their
attributes.
Two interesting points, regarding our analysis, are:
1. User’s success rate was calculated by inviting 10 users to
solve 5 CAPTCHA of each implementation. All CAPTCHA
were in English, which was the mother tongue of one (1) of
the participants (as a requirement, all users should speak
Fig. 2 – Audio CAPTCHA attributes.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 607
English). All users had a university degree. Also, they all use
a PC for more than 20 h/week.
2. The ‘‘automated creation’’ attribute was not put in-place
for the commercial CAPTCHA (Google, MSN), as their rele-
vant algorithms are not publicly available.
4.1. Google
The Google Audio CAPTCHA uses a limited data field of ten
digits (0, ., 9), which seems not adequate for every situation;
however, it is suitable for a VoIP system. The number of digits
for each audio CAPTCHA is not fixed, but it ranges from 5 to 10
digits. Moreover, this CAPTCHA is available in multiple
languages. This CAPTCHA uses background and intermediate
noise. The noise at the beginning is louder and then a different
speaker is used for the announcement of each character. In
addition, the duration of a CAPTCHA ranges from 20 to 50 s
(based on our Google Audio sample). Google uses three beeps
every time an audio CAPTCHA begins. These beeps make the
audio CAPTCHA vulnerable to attacks because it is much easier
for a bot to know when a CAPTCHA begins. Furthermore,
Google Audio CAPTCHA is announced twice in every audio file,
therefore an attacker can process it twice and has multiple
attempts to find the right answer. Finally, the most important
drawback is the user success rate, which is not adequately high.
4.2. MSN
The MSN Audio CAPTCHA uses a limited data field of ten (10)
digits, with a fixed number of spoken characters (10) in each
one. The frequency of the spoken characters varies, since
a number of different speakers are used. That makes MSN
Audio CAPTCHA vulnerable to attacks. Also, it is available in
multiple languages. MSN uses weak and constant background
noise. The distance between the words is, to a far extend,
constant. Moreover, the duration of the CAPTCHA is not always
the same (e.g., one CAPTCHA lasts 0:07 s, another 0:16 s). There
are no beeps at the beginning of this audio CAPTCHA. The main
advantage of MSN Audio CAPTCHA is it is easy for a user to
understand. As a result, the user success rate is high.
4.3. Recaptcha
The Recaptcha Audio CAPTCHA uses a large data field that
includes various phrases. Therefore, the number of spoken
words varies and it is available only in English. Recaptcha uses
no background noise. On the other hand, it uses distortion
techniques and multiple speakers, with different pronuncia-
tion and different pace. The user can hear twice the audio
CAPTCHA in one audio file (like Google). Recaptcha does not
use beeps. The duration of this CAPTCHA is almost fixed.
Moreover, the user success rate is significantly low. Recaptcha
Audio CAPTCHA meets most of the requirements for an
effective tool. Its main drawbacks are the vocabulary (includes
more than digits), as well as the user success rate, which is
low. The latter happens because it seems not easy for a user to
understand the words and their combination.
4.4. eBay
The eBay Audio CAPTCHA has a limited data field of ten (10)
digits (0–9). The number of spoken characters is always six (6).
The CAPTCHA uses different speakers and it is available in
several languages, depending on the specific eBay sites (i.e.,
the digits in www.ebay.fr are pronounced in French). More-
over, there is a different background noise for each digit, but
there is no intermediate noise. Finally, the duration of the
CAPTCHA, as well as the speaker pace, are both fixed. The
main advantages of this implementation are the high user
success rate, the lack of beeps at the beginning or end of the
CAPTCHA, and its streaming reproduction.
4.5. Secure Image CAPTCHA
Secure Image CAPTCHA uses an adequate data field of digits
(0–9) and letters (A–Z). The number of spoken characters is
fixed and it is available only in English. On the other hand, this
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8608
CAPTCHA uses the same speaker all the time. Moreover, it
uses simple background noise and there is no intermediate
one. Also, the CAPTCHA duration and the speaker pace are
fixed. Secure Image CAPTCHA is an open-source free PHP
CAPTCHA script; therefore most of the attributes can be fine-
tuned. However, there is no functionality allowing the auto-
mated production of new CAPTCHA instances. The main
advantage of this implementation is the high user success
rate.
4.6. Mp3Captcha
This CAPTCHA (Mp3Captcha) uses an adequate data field of
digits (0–9) and letters (A–Z). Also, it is available in multiple
languages, which is very helpful for non-English users.
Moreover, it does not use beeps at the beginning or specific
extra tokens that help the bot understand when the charac-
ters of the CAPTCHA are announced. On the other hand, the
speaker is only one, which makes it easy for a computer-based
audio recognition tool to correctly identify it. Additionally,
there are no background noise or distortion techniques. The
duration of the CAPTCHA is fixed and the time for solving the
CAPTCHA is short. Furthermore, it uses a specific number of
spoken characters and the pace is fixed. Finally, the main
advantage is that the user success rate is high.
4.7. Captchas.net
The Captchas.net audio CAPTCHA (Captchas.net) uses letters
and digits. Also, this implementation is friendly to non-
English users, as it is available in the most popular languages.
When a character in the CAPTCHA is a letter, then a word is
announced and the requested answer is the first letter of this
word. For example, if the announced word is ‘‘horse’’, then the
requested character is ‘‘h’’. The number of spoken characters
is fixed; therefore the CAPTCHA is vulnerable to attacks. The
implementation uses distortion techniques and NATO
pronunciation, but no background noise. The speaker is
always the same person. The pace and the duration of the
CAPTCHA are fixed. There are no beeps at the beginning and
no extra tokens. The user success rate is high and the duration
for solving the CAPTCHA is short.
4.8. Bokehman
Bokehman’s (Bokehman Audio CAPTCHA) data field includes
numbers (0–9), letters (A–Z), and some extra tokens. These
tokens are the words ‘‘capital’’ and ‘‘lower’’, which the user
hears before the announcement of each character, so as to
understand whether the following letter is lowercase or
uppercase. The use of extra tokens makes the CAPTCHA
vulnerable, because a bot can identify them easily and
understand when to expect each character. Moreover, it is
available only in English. The implementation does not use
background noise or distortion techniques. The spoken char-
acters are always four (4). Finally, it always uses the same
speaker, the same pace, and the same duration. The user
success rate is high, but the implementation suffers draw-
backs, due to the use of mainly static characteristics.
4.9. Slashdot
Slashdot audio CAPTCHA (Slashdot) uses a strong data field
that contains letters (A–Z) and words. First the speaker says
the whole word and then he/she spells it. This makes the
CAPTCHA solution easier for the users. Moreover, each word
contains a different number of characters, which makes the
CAPTCHA even harder. Also, this implementation does not
use extra tokens or beeps at the beginning. On the other hand,
it is available only in English, it does not use background
noise, the speaker is always the same and the duration of each
CAPTCHA is almost fixed. Additionally, these CAPTCHA
reappear often. There is no available information about their
production process. Finally, we should mention that the user
success rate is one of the highest (95%).
4.10. Authorize
Authorize audio CAPTCHA (Authorize) data field uses digits
(0–9) and letters (A–Z). The number of spoken characters is
fixed. There is no use of beeps or extra tokens. On the other
hand, it is available only in English. Moreover, there is no
background noise and no use of distortion techniques, which
make the CAPTCHA vulnerable to attacks. Also, the speaker is
always the same and the duration is fixed. Finally, it is easy for
a user to understand.
4.11. AOL
AOL audio CAPTCHA (AOL) data field uses letters (A–Z) and
digits (0–9). The number of spoken characters is fixed. There
are two speakers. One says some characters and the other the
rest. The sequence is not specific but changes as one pass
from one CAPTCHA to another. It is available only in English. It
uses voices for background noise, but no distortion tech-
niques. The duration is fixed. It does not use extra tokens. It
uses three (3) beeps not only at the beginning, but also at the
end of the CAPTCHA. This makes the CAPTCHA vulnerable to
attacks, as a bot can be programmed to identify when the
CAPTCHA starts and ends. Finally, this CAPTCHA imple-
mentation is easy for a user to understand.
4.12. Digg
The last audio CAPTCHA is Digg (DIGG). It uses an adequate
data field of digits (0–9) and letters (A–Z). The number of
spoken characters is fixed (i.e., 5). Moreover, it is available only
in English. Digg uses a constant background noise, which is
louder at the end. It also uses a pause before the announce-
ment of each character. The speaker is the same and the
duration of the CAPTCHA is fixed. Digg’s developers suggested
a way to defeat a bot; i.e., they randomly put a sound in an
audio CAPTCHA (the background noise for every character),
without including any character. However, this is not hard for
a bot to identify (this sound is always the same) and just
ignore it. This implementation is easy for a user to
understand.
Table 1 depicts the main attributes of the previously
described audio CAPTCHA implementations.
Ta
ble
1–
Au
dio
CA
PT
CH
Aco
mp
ara
tiv
eo
verv
iew
.
Au
dio
CA
PT
CH
AG
oo
gle
MS
NR
eca
ptc
ha
eB
ay
Secu
reIm
age
CA
PT
CH
AM
p3C
ap
tch
aC
ap
tch
as.
net
bo
keh
ma
nsl
ash
do
tA
uth
ori
zeA
OL
Dig
g
Att
rib
ute
s
Use
rsu
ccess
rate
60%
80%
50%
95%
98%
98%
98%
98%
95%
95%
95%
95%
Ba
ckgro
un
dn
ois
eV
oic
e,
no
ise
Vo
ice,
no
ise
No
ise
No
ise,
vo
ice
No
ise
No
ne
No
ne
No
ne
No
ne
No
ne
Vo
ice
No
ise
Inte
rmed
iate
no
ise
No
ise
No
ise
No
ne
No
ne
No
ne
No
ne
No
ne
No
ne
No
ne
No
ne
No
ise
No
ne
Da
tafi
eld
0–9
0–9
Ph
rase
s0–9
A–Z
,a
–z,
0–9
A–Z
,a
–z,
0–9
a–z
,0–9
A–Z
,a
–z,
0–9
Wo
rd(a
–z)
A–Z
,a
–z,
0–9
A–Z
,a
–z,
0–9
A–Z
,a
–z,
0–9
Sp
ok
en
cha
ract
ers
va
ria
tio
n
5–1
010
Yes
64
46
4<
95
85
Str
ea
min
g
rep
rod
uct
ion
Yes
Yes
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
Ra
rere
ap
pea
ran
ceY
es
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Pro
du
ctio
np
roce
ssN
ot
ap
pli
cab
le
No
t
ap
pli
cab
le
No
t
ap
pli
cab
le
No
t
ap
pli
cab
le
Au
tom
ate
dA
uto
ma
ted
Au
tom
ate
dA
uto
ma
ted
No
t
ap
pli
cab
le
No
t
ap
pli
cab
le
No
t
ap
pli
cab
le
No
t
ap
pli
cab
le
La
ngu
age
req
uir
em
en
ts
Mu
ltip
le
lan
gu
ages
Mu
ltip
le
lan
gu
ages
en
Mu
ltip
le
lan
gu
ages
en
en
,fr
,it
,d
een
,d
e,
it,
nl,
fren
en
en
en
En
Va
rio
us
spea
kers
Yes
No
Yes
Yes
No
No
No
No
No
No
Yes
No
Du
rati
on
(sec)
0:1
0–0
:15
0:0
5–0
:09
w0:0
4w
0:0
4w
0:0
4w
0:0
4w
0:0
80:0
4–0
:05
0:0
3–0
:04
0:0
50:1
00:0
8
Beep
s(b
efo
re,
aft
er)
3,0
00
00
00
00
03,2
0
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 609
5. CAPTCHA bots
Given the user success rate of a CAPTCHA, one has to test it
against automated audio recognition tools. In this paper
a state-of-the-art open-source speech recognition tool
(SPHINX) was used (Walker et al., 2004; SPHINX). In addition,
a frequency and energy pick detection bot, called devoicecaptcha
(Defeating Audio (Voice) CAPTCHA), was also utilized. The
criteria for selecting those two bots were (a) they have proven
record for audio CAPTCHA solving, especially the devoice-
captcha bot (Bursztein and Bethard, 2009), (b) they are widely
used, and (c) both can easily adapted in a VoIP environment.
5.1. Automatic speech recognition bots
Automatic speech recognition (ASR), also known as computer
speech recognition, is the technology that makes it possible
for a computer to identify the components of human speech.
The process begins with a spoken utterance being captured by
a microphone or an audio file and ends with the recognized
words being output by the system. In particular, the basic
function is to convert the spoken word to properly encoded
data that can be recognized by a computer. The ultimate aim
of this technology is to identify, in real time and with a degree
of success close to 100%, words spoken by humans, regardless
of the size of vocabulary, noise levels, the characteristics of
speaker like pronunciation, and conditions of the channel
through which the human voice is transmitted.
On a practical level, ASR tools can achieve a high perfor-
mance when used in controlled conditions. These limitations
are usually associated with the discharge of adding additional
information not related to the acoustic signal recognition.
Thus, the environments in which high degree of success is
achieved are characterized usually by the absence of any form
of noise or distortion technique. Depending on the extent of
the various restrictions, there are different levels of perfor-
mance. The closer the conditions are to the optimal ones the
higher the performance is. In order for an ASR to work, it has
to build a Speech Recognition Engine (SRE). An SRE requires
two types of files to recognize speech.
� An acoustic model, which is created by taking audio
recordings of speech and ‘compiling’ them into a statistical
representations of the sounds that make up each word (this
process is called ‘training’).
� A language model, or grammar file, which is a file that
contains the available vocabulary and the probabilities of
the words’ sequences.
When the vocabulary is limited, it requires no training to
recognize a small number of words (e.g., the ten digits) as
spoken by most speakers. Such systems are popular for
routing incoming phone calls to their destinations in large
organizations. This is why we used these tools without
involving special training.
5.1.1. SPHINXThere are various ASR methods/tools to recognize the words
spoken in an audio file. Among the most known ones are the
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8610
Hidden Markov Model Toolkit (HTK) and SPHINX (the latest
version of Carnegie Mellon University’s repository of Sphinx
speech recognition systems was developed by CMU, SUN
Microsystems and Mitsubishi Electric Research Laboratories).
We decided to use SPHINXbecauseit isopen-source – thuseasily
configurable – and has a large community of developers who use
and maintain it. The HTK is developed in Cambridge University
Engineering Department, but its source has not been made
publicly available. The major advantage of the SPHINX is that it
has pluggable language/grammar and acoustic models.
In our test environment, we used a language model called
HUB4. It uses a large vocabulary and a customized acoustic
model, which expects not only digits. Other language models
like TIDIGITS were not used, because they cannot handle
random distortion even though its vocabulary is only digits.
5.2. Frequency and energy peak detection bots
The second category for an automated audio recognition bot
employs frequency and energy peak detection methods. It can
be used for solving audio CAPTCHA, for the following reasons:
� Such bots have been proven effective: demonstrative (though
perhaps not thorough enough) tests of such bots against
popular audio CAPTCHA implementations have been
successful (Defeating Audio (Voice) CAPTCHA; Breaking
Gmail’s audio CAPTCHA) (e.g., SPIT prevention infrastruc-
tures, registrations for visually impaired people, etc.).
� Such bots are easy to implement: frequency and energy peak
detection bots are comparatively easy to implement using
open-source software.
� Such bots require limited time to solve a CAPTCHA: fast
CAPTCHA solving is required because most services leave
a small time frame for their users to solve the tests (5–15 s),
especially when VoIP services are considered. The
CAPTCHA solving bot must analyze and reform the solution
to the desired form (SIP message, DTMF, etc.) within
a limited time frame.
� Such bots require a small amount of system recourses: an auto-
mated SPAM attack is chosen when its cost is lower than
employing humans. Also, a ‘‘spitter’’ performs multiple
attacks simultaneously (e.g., the goal is to initiate SIP calls or
messages in parallel). Thus, a bot must be inexpensive in
terms of system recourses, which will allow the spammer/
spitter to run several instances of the bot at the same time.
Regarding time constraints, frequency and energy peak
Fig. 3 – Audio analy
detection processes are less demanding than approaches
using different methods, such as Hidden Markov Models
(HMM) (HTK).
There are certain drawbacks when using these bots. This is
mainly due to the fact that they require a training session. In
this session a user identifies a number of selected CAPTCHA.
Then, he/she recognizes the announced characters and
records them in a file, from which the bot receives the data to
solve the CAPTCHA. The set of training audio CAPTCHA might
be extensive, if the CAPTCHA data field (alphabet) is long.
However, in a VoIP system the available alphabet is relatively
small as it contains only digits (0–9), which increase the
applicability of the mechanism.
5.2.1. The bot usedFor the purpose of this paper we used the devoicecaptcha bot
developed by Vorm (Defeating Audio (Voice) CAPTCHA). This
bot uses frequency analysis and energy peak detection, in
order to segment and solve an audio CAPTCHA in real-time.
The bot works as follows: first it reads the audio file and skips
as many starting bytes as the user has predefined (to avoid the
starting bells that some implementations have, e.g., Google).
Then, the samples are processed with a hamming window
defined by the user. Each block is transformed into the
frequency domain using Discrete Fourier Transformation. The
frequencies are put in a predefined number of bins (the bins
are not equally wide, the higher the frequency the larger the
band). After that, the bot looks at the highest frequency bin.
Every block that has more energy in a window than the pre-
defined threshold energy is considered a peak (see Fig. 3).
These peaks are used to segment the audio file in the different
spoken digits. Then the bot looks for a number of windows
around the peaks and prints all the frequency bins. This is the
profile of the digit. The profiles of the digits are then compared
to the ones in the training file. The closest match is chosen as
a possible guess for each digit.
During the training session of the bot the user gives as
input to the bot an audio CAPTCHA. Then, for each profile
of the digit that the bot chooses the user enters which
digit it actually was (this procedure can be automated if
the user gives a name to the audio files accordingly, i.e., if
an audio CAPTCHA file include digits 6, 9 and 2, the file
name can be ‘‘692.wav’’). The larger the number of audio
CAPTCHA in the training set is, the higher the bot’s
success ratio would be.
sis of the bot.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 611
6. CAPTCHA applicability for VoIPenvironment
In this section we discuss which of the CAPTCHA in Section 4
could be candidates for anti-SPIT purposes. The only require-
ment this CAPTCHA should have is that the vocabulary should
be limited to digits {0, ., 9}, as the audio CAPTCHA will be used
for an SIP-based VoIP system, where DTMF signals need to be
sent. Sending letters to answer a CAPTCHA could be difficult for
an average user. Not many users can write 3–4 letters with
a phone keyboard (e.g., pressing multiple times a key to get the
letter) in a short time period. An implementation of this kind
should not ignore or underestimate the digital divide.
Based on the algorithm introduced in Section 2, the user
success rate should be high (>80%). The Google and Recaptcha
CAPTCHA cannot meet this requirement. Nearly the same
user success rates were also presented by Bigham and Cav-
ender (2009). Moreover, the Recaptcha uses phrases (not
digits).
DIGG, AOL, Slashdot and Authorize do include characters
other than digits. They are, also, not open-source, therefore
their data field is not able to be altered. As far as it concerns
eBay audio CAPTCHA, it has already been ‘‘cracked’’ (Bursz-
tein and Bethard, 2009).
The problem with MSN CAPTCHA is the number of digits
each one includes. As a result of the user tests that we per-
formed with normal phones, user success rate decreases
significantly, from 80% to 25%, because it was not easy for
a user to type the digits and hear the CAPTCHA at the same
time, or to remember all 10 digits and type them after the
CAPTCHA ends. MSN CAPTCHA can be of practical use only if
the telephone device has a microphone and a headphone
separated from the telephone keyboard.
The remaining CAPTCHA implementations (Secure Image
CAPTCHA; Captchas.net; Bokehman Audio CAPTCHA;
Mp3Captcha) could be, in principle, used for anti-SPIT
purposes. Even though their vocabulary contains letters, this
can be changed to only digits because they are open-source.
However, in practice only the Secure Image CAPTCHA and the
Captchas.net can be taken into account, because Bokehman
and Mp3Captcha are very similar to the Captchas.net (i.e., no
background noise) and they are both more vulnerable to
attacks (they use only one speaker (Tam et al., 2008a; Chan,
2003)).
0
20
40
60
80
100
SPHINX Devoicecaptcha Users
Su
cc
es
s ra
te
(%
)
Secure image captcha Captchas.net
Fig. 4 – Evaluation of audio CAPTCHA.
6.1. Evaluation of selected audio CAPTCHA
At this stage, we have decided upon the two selected
CAPTCHAs. The next step was to evaluate them against the
two bots presented in Section 5.
For the devoicecaptcha bot we had to create a training
session, because it works with a comparison to a training set.
We took 50 audio files of each CAPTCHA as a training set and
tested it with the remaining 50 audio files. The result was
a clear defeat of the two CAPTCHA, as the bot had a 77%
success rate for the Secure Image CAPTCHA and an 81%
success rate for Captchas.net. Both success rates are large
enough, thus the CAPTCHA is considered not effective.
For the SPHINX test environment a small custom application
was created, in order to decode multiple wav files in batch form
and send to output the corresponding results. Even though the
SPHINX success rate was not high, it was large enough (>8%) for
the two implementations to be considered not effective.
Both experiments were conducted in a Windows XP SP2 PC
with 2.1 GHz Core2Duo processor and 2 GB RAM memory. The
experiments are depicted in Fig. 4, which includes the users’
success rate as it was depicted in Table 1.
To sum up, based on the aforementioned tests and the VoIP
system requirements (e.g., only digits in vocabulary), we
concluded that there is practically no existing audio CAPTCHA
implementation that could be considered as efficient enough
for a VoIP system.
7. Audio CAPTCHA experimentalenvironment/integration
We now proceed to the development of a new audio CAPTCHA
implementation. A key question for the development of such
a new CAPTCHA is whether it is applicable to a VoIP system, in
particular in an SIP-based environment. This section describes
our laboratory VoIP system, the development of the new audio
CAPTCHA, the applicability of the bot in the SIP-based VoIP
system, and the results of the user evaluation.
7.1. Experimental lab infrastructure and CAPTCHAintegration
The test computing environment, which was used, is depicted
in Fig. 5. It consists of 2 SIP proxy servers. The SIP server
application is scalable and reliable open-source software
called SIP Express Router (SER 2.0) (SER server version 2.0). It
can act as an SIP registrar, proxy, or redirect server. Each of the
SIP servers creates a different VoIP domain. Both, the bots’
host computer and the users, belong to the first domain. The
callee, who is protected by the proposed audio CAPTCHA,
belongs to the second domain. The functionality of the second
domain has been extended, in order to be able to send/stream
an audio CAPTCHA. Each time a call reaches the second
domain, the call is redirected to a media server, which
Fig. 5 – Laboratory infrastructure.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8612
reproduces the audio CAPTCHA and validates the caller’s or
bots’ answer.
The media server is the SIP Express Media server (SEMS)
(SIP express media server version 2.0), which is a reliable
media and application server for SIP-based VoIP services. In
order for the caller (user or bot) to hear the audio CAPTCHA,
a media session should be established by exchanging SIP
messages. The SIP message number of the audio CAPTCHA is
182 and the subject (header field) is ‘‘CAPTCHA’’.
7.2. Bots’ applicability to SIP-based VoIP
In order to integrate the bots in an SIP-based VoIP system and
examine their applicability, the implementation procedure
was decided to include three stages (the procedure and the SIP
messages exchanged between the participating entities, are
presented in Fig. 6).
Stage 0: it is dominated by the administrator of the callee’s
domain (Domain2). When the callee’s domain receives an SIP
INVITE message, there are three possible distinct outcomes:
(a) forward the message to the caller, (b) reject the message,
and (c) send a CAPTCHA to the caller (UA1). In the test envi-
ronment we forward every INVITE message to the media
server, which sends a CAPTCHA to the caller.
Stage 1: an audio CAPTCHA is sent (in the form of a 182
message) to the caller (UA1). In the proposed implementation,
the caller is replaced by a bot. It must record the audio
Fig. 6 – SIP message exch
CAPTCHA, reform it to an appropriate audio format (wav,
8000 Hz, 16 bit) and identify the announced digits. The
procedure depends mainly on the time needed to reform the
message. Moreover, the particular bot needs approximately
0.10 s to identify a 3-digit CAPTCHA and 0.15 s to identify a 4-
digit one.
Stage 2: when the bot has generated an answer, it forms an SIP
message by using SIPp (SIPP traffic generator for the SIP
protocol), which includes the DTMF answer. This answer is
sent as a reply of the CAPTCHA. If the caller does not receive
a 200 OK message, a new CAPTCHA is sent and the bot starts to
record again (Stage 1).
The above procedure should be completed within a specific
time frame. The time slot opens when the audio file is received
by the caller and closes when the timeout of the user’s input
expires (defined by the service CAPTCHA provider (Fig. 7)). The
duration of the CAPTCHA playback does not affect the time
frame, because the waiting time for an answer starts when the
playback is complete. If an answer arrives before the timeout,
then it is validated by CAPTCHA service (and if it is correct the
call is established), otherwise the bot has another try. In our
implementation, the bot is given 6 s to respond to the
CAPTCHA, whereas the maximum number of attempts is set
to three (3).
Table 2 illustrates the time required by the various stages
in the proposed implementation. The selected bot can answer
properly the CAPTCHA puzzle in much less time than the time
ange for CAPTCHA.
Fig. 7 – A CAPTCHA time frame.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 613
frame. Since the CAPTCHA is desired to be easy for users, we
suggest that the time frame, in which the caller should answer
the CAPTCHA puzzle, should be not less than 3 s. This is
because many groups of users, such as minors or elderly, may
not be able to respond promptly. Finally, we note that our bots’
host computer can accomplish the two stages for 82 CAPTCHA
simultaneously.
7.3. User applicability
The users, who were invited to solve the CAPTCHA samples,
were 32, most of them aged between 20 and 30 years old. Most
of them were university students (21 out of 32). We had 6
persons older than 40 years old. All CAPTCHA were in English,
which was the mother tongue of 1 of the participants (there
was a requirement for every user to speak English). In order
for the user to take the tests, all users’ PCs (in Fig. 5 depicted as
the caller) were equipped with soft-IP-phones (X-lite and
Twinkle). These phones were used to initiate a call, to listen to
the CAPTCHA, and to send the answer in a DTMF tone format.
8. Audio CAPTCHA implementation process
In this section, the details of the development of a new audio
CAPTCHA will be explained.
8.1. Selected attributes
In order to develop an effective new audio CAPTCHA, we
decided upon the following attributes:
Different announcers (speakers): the announcer (speaker) of each
and every digit is selected randomly among a given set of
(more than one) speakers.
Random positioning of each digit in the CAPTCHA: the digits used
by the CAPTCHA are physically distributed randomly in the
available space.
Background noise of each digit: background noise, randomly
selected, is added to each and every digit of the audio
CAPTCHA. The audio noise files are segments (from 1 to 3 s) of
Table 2 – Stage duration.
Stage Step Duration (s)
1 Reform audio w1.00
Identify digits w0.15
2 Create SIPp message w0.40
Send SIPp message w0.00
Total duration (s) w1.55
randomly selected music files. They are not auto-generated by
other methods (e.g., creation of white noise). We tried to
ensure that the noise will be least annoying for the user to
listen to. The background and intermediate noises were
automatically generated in-line with the requirements set
forth by a statistical analysis. The volume level of the noise is
lower than the level of the digits, so that they remain audible
to the users.
Loud noise between digits: loud noise is introduced between the
digits (the noise is not very loud, in order to minimize the
discomfort of the user).
Different duration and file size: each audio CAPTCHA file has
different duration and different size.
Vocabulary: the vocabulary was limited to digits {0, ., 9},
because the audio CAPTCHA was designed for an SIP-based
VoIP system where DTMF signals need to be sent.
8.2. CAPTCHA development
The audio CAPTCHA development was carried out in five
stages, in terms of the number of attributes adopted. Each
development stage was tested and evaluated upon its effi-
ciency according to the success rate of the bot and the success
rate of human users.
During Stage 1, the produced audio CAPTCHA was
pronounced by one sole announcer. It did not include addi-
tional features, such as background noise, or noise between
the digits. The first digit of every word started at the exact
same point as the other ones. The time difference between
two consecutive digits was fixed. The waveforms of the
resulting 3- and 4-digit CAPTCHA appear in Fig. 8a and b. In
such a simple audio CAPTCHA, a bot can use a detection
method (e.g., energy peak detection) and easily segment and
recognize the digits. An important factor in this process is the
number of audio CAPTCHA that was used during the training
of the devoicecaptcha bot. If a small number was used, then
there is a high chance that not all digits are given as an input
to the training process; thus, the bot may have a low success
rate. That is the case with the 4-digit CAPTCHA (Fig. 8b). The
random training sequence did not involve many instances of
some digits (such us 8 and 9); therefore, even though the bot
recognized successfully a large number of CAPTCHA, it failed
to recognize others and resulted in a relatively low (69%) bot
success rate.
The SPHINX software did not achieve such a high success
rate; it reached a (success rate of) 27%. The main reason for
this is that there was no background and intermediate noise
within the CAPTCHA.
During Stage 2, the audio CAPTCHA was produced by using
7 different announcers. Each digit was pronounced by
a randomly selected announcer. Even though this affected the
Fig. 8 – a) Stage 1 (3 digits). b) Stage 1 (4 digits).
Fig. 9 – a) Stage 2 (3 digits). b) Stage 2 (4 digits).
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8614
success of the devoicecaptcha bot in the case of 3-digit
CAPTCHA, it did not do so in the case of the 4-digit ones. This
mainly hinges upon the training set. Moreover, for the same
number of training CAPTCHA instances, 4-digit ones offer
more digits to the training procedure. For example, if 100 3-
digits CAPTCHA are used for training, 300 digits are recorded,
whereas with the same number of 4-digit CAPTCHA 400 digits
are recorded.
The SPHINX software success decreased dramatically (i.e.,
0.9% for the 3 digits CAPTCHA and 0.7% for the 4 digits
CAPTCHA). This is because there was considerable back-
ground noise, due to the microphone recording. Fig. 9a and
b shows the waveforms of the produced digits.
In Stage 3 background noise was added against each digit.
This way the success rate of the devoicecaptcha bot was
suppressed to 30% for the 3-digit CAPTCHA and 55% for the 4-
digit ones, but it still remained relatively high. Fig. 10a and
b shows the waveforms of the produced digits with the
background noise. The high success rate is due to the ability of
the Frequency bot to cut-off the low energy sounds (i.e., the
noise), by checking above certain threshold energy levels. In
that way, it can – in most cases – isolate the noise behind each
digit. The difference between the successes of 3- and 4-digit
CAPTCHA is due to the difference in the training sets. In this
case, a training of 50 audio CAPTCHAs was allowed for the 3-
Fig. 10 – a) Stage 3 (3 digits
digit ones and 150 for the 4-digit ones. As a result, the available
digits taking part in the training process were 150 and 600,
respectively.
The SPHINX software repeated the same low success rate,
because the background noise added further difficulty for
solving the CAPTCHA.
In Stage 4 the volume of the background noise of each digit
was raised. Although the devoicecaptcha bot’s success rate
fell noticeably (10–15% success), and the SPHINX software was
unable to solve any CAPTCHA correctly, the produced audio
CAPTCHA was too difficult to solve for the users, as the loud
background noise made it hard for the users to distinguish the
digits spoken. For that reason, loud background noise was not
included in our final strategy.
In Stage 5 loud noise was introduced between every couple
of digits (intermediate noise) (Fig. 11a and b). This resulted in
the devoicecaptcha bot being unable to segment the audio file
correctly. This happened because there were more energy
peaks than the digits spoken. The loud intermediate noises
were recognized as additional digits, because they produce
high energy peaks as well, when transformed with the Discrete
Fourier Transformation. As a consequence, this bot could not
be trained, as it failed to successfully recognize any digits.
The SPHINX software repeated the same low success rate.
The main issue remains that such speech recognition tools are
). b) Stage 3 (4 digits).
Fig. 11 – a) Stage 5 (3 digits). b) Stage 5 (4 digits).
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 615
effective only in ‘‘controlled’’ conditions, such as with only
one speaker and without any noise (Section 3).
Stage 5 is described, in more detail in Fig. 12, where the
CAPTCHA includes intermediate noise between the digits.
When the bot transforms such an audio into the frequency
domain, the energy peaks that can be found are both digits and
noise. As a result, the bot recognizes more digits than those
which are actually included in the file. One possible solution for
the devoicecaptcha bot would be to raise or lower the threshold
of the energy. In that case (Fig. 12), the bot would still fail. If the
threshold energy is very high, then the bot would not recognize
some of the digits in the CAPTCHA, while at the same time it
would recognize some intermediate noise as digits. On the
other hand, if the threshold energy is lowered, then the bot
would recognize all digits, but at the same time all intermediate
noises would also be considered digits, as well. Thus, the bot
would assume that there were 12–15 digits in the CAPTCHA.
8.3. CAPTCHA testing
Users’ and bots’ success rates are the main factors, which
prove whether a CAPTCHA is efficient or not. The corre-
sponding success rates, as per the CAPTCHA described in
Section 5.2, appear on Fig. 13a–c. Each attribute added effi-
ciency to the CAPTCHA and directly affected the user and bot
success rates. The CAPTCHA developed in Stage 5 had an
average user success rate of 87%, with an average bots’
success rate of less than 1%.
8.4. CAPTCHA implementation
During the implementation of the proposed audio CAPTCHA,
the audio files had the following attributes:
a) They were produced automatically; therefore, they can be
updated at random time periods without human inter-
vention. The overall process for creating a full set of 3-digit
Fig. 12 – Demonstration of the devoicecapt
CAPTCHA took 8 s, whereas creating a full set of 4-digit
CAPTCHA took 107 s. Thus, the reproduction of the whole
set of CAPTCHA does not cause significant overload to our
VoIP system (the VoIP server was a 2.1 GHz Core2Duo, with
2 GB RAM).
b) All constituting parts of the audio CAPTCHA, such as the
digits and the noise, lay in different folders. Moreover, each
time a set of CAPTCHA is produced, the program selects
randomly each digit from a different announcer, as well as
a random background noise.
c) The noise between the digits is selected randomly and has
different volume and energy.
d) The noise and the pronounced digits have random dura-
tion, which results in a random duration of each audio
CAPTCHA.
Table 3 depicts the attributes of the proposed VoIP
CAPTCHA implementation. The attributes are the same as
those in Table 1. It is clear that all the requirements shown in
Section 6 for VoIP CAPTCHA were fulfilled and moreover that
the proposed CAPTCHA is bot resistant.
9. Discussion and limitations
The evaluation process of the current CAPTCHA imple-
mentations included the positive and negative characteristics
of each one. Moreover, the user success rate for every
CAPTCHA was presented but the bot success rate was intro-
duced only for those which are easily applicable to a VoIP
infrastructure. Therefore, the remaining CAPTCHA could be
evaluated for their resistance against bots.
Additionally, the testing environment for the proposed
VoIP CAPTCHA is a lab environment; therefore there might be
issues in order the proposed CAPTCHA to be integrated to the
overall security infrastructure of the VoIP provider. However,
cha bot failing to solve the CAPTCHA.
0
20
40
60
80
100 S
uc
ce
ss
ra
te
(%
)
3 digits 4 digits
0
20
40
60
80
100
Stage 1 Stage 2 Stage 3 Stage4 Stage 5 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
Su
ccess rate (%
)
3 digits 4 digits
0
20
40
60
80
100
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
% o
f S
olv
ed
C
AP
TC
HA
s
3 digits 4 digits
a b
c
Fig. 13 – a) SPHINX success rates. b) Devoicecaptcha bot success rates. c) Users success rates.
Table 3 – Proposed VoIP CAPTCHA attribute.
Proposed VoIP CAPTCHA attributes
User success rate 88% Rare reappearance Yes
Background noise Music, noise Production process Automated
Intermediate noise Voice, music, noise Language requirements Multiple languages
Data field 0–9 Various speakers Yes
Spoken characters variation 3–4 Duration (sec) 2–6
Streaming reproduction Yes Beeps (before, after) 0
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8616
a further experimentation clearly requires the co-operation of
a major SIP-based VoIP service provider, especially for busi-
ness purposes, since the applicability of this mechanism has
been introduced and justified in this paper.
A limitation of the proposed CAPTCHA is that there could
be no evaluation of its effectiveness and its attributes by some
additional audio/speech recognition tools, as those intro-
duced by Tam et al. (2008a).
Another possible limitation was due to the sample of the
users used for experimentation. The experiment procedure
could consider different populations of users and take into
consideration the specific requirements of each set.
1 The pseudo-random C-function rand was used for producingthe CAPTCHA.
10. Conclusions
CAPTCHA are expected to play a key role for preventing email
spam and voice spam (SPIT) in the near future. In order for
them to be effective, they must be easy to solve for the users,
while at the same time very hard for bots to pass.
In this paper, we provided the reader with an overview of
existing audio CAPTCHA implementations, in order to identify
their main characteristics. Based on these characteristics, we
identified that two of them may be able, in principle, to be
appropriate audio CAPTCHA for a VoIP system. After an
evaluation process, which included a test procedure by two
speech recognition tools, we demonstrated that the existing
audio CAPTCHA implementations are clearly inadequate
candidates for a VoIP system.
As a result of the aforementioned facts, we proposed a new
audio CAPTCHA implementation. This CAPTCHA incorporates
several attributes, such us different digit announcers, back-
ground noise against each digit, noise between digits and all of
them in a random1 and automated way.
Then, we produced a number of audio CAPTCHAs, which
are regularly refreshed, with a limited chance of creating the
same instance of an audio CAPTCHA more than once, and
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 617
reproducing in streaming mode. The production of the
CAPTCHA was done in five stages. Each time the CAPTCHA
was tested not only by a number of users, but also by two
automated speech recognition tools (SPHINX and devoice-
captcha bot). The bots managed to achieve a high success rate
during the first four stages (up to 98%), but that rate dropped
dramatically at the last one (less than 2%). That was mainly
due to the addition of intermediate noises, which made the
bot unable to segment properly the audio file, to be trained
properly, and thus to solve the CAPTCHA.
We also determined an appropriate level of background
noise of each digit, in order for the CAPTCHA to be solvable by
users and difficult to break by bots. However, such a low bot
success rate could not have been achieved without the
combination of all the above mentioned attributes. Each
attribute alone is not enough for making CAPTCHA robust; it is
the combination of the features that make the CAPTCHA
resistant.
r e f e r e n c e s
Authorize, www.authorize.net/application/ [retrieved 07.05.09].AOL, http://my.aol.com/ [retrieved 07.05.09].von Ahn L, Blum M, Hopper N, Langford J. CAPTCHA: using hard
AI problems for security. In: Biham E, editor. Proceedings ofthe international conference on the theory and applications ofcryptographic techniques (EUROCRYPT ’03). Poland: Springer;2003. p. 294–311 (LNCS 2656).
von Ahn L, Blum M, Langford J. Telling humans and computerapart automatically. Communications of the ACM 2004;47(2):57–60.
von Ahn L, Maurer B, McMillen C, Abraham D, Blum M.reCAPTCHA: human-based character recognition via websecurity measures. Science 2008;321(5895):1465–8.
Blum M, von Ahn L, Langford J, Hopper N. The CAPTCHA project,USA, November 2000.
Bigham J, Cavender A. Evaluating existing audio CAPTCHAoptimized for non-visual use. In: Proceedings of the ACMconference on human factors in systems (CHI 2009), USA;2009, p. 1829–38.
Breaking Gmail’s Audio CAPTCHA, http://blog.wintercore.com/?p¼11 [retrieved 10.10.08].
Bursztein E, Bethard S. Decaptcha: breaking 75% of eBay audioCAPTCHA. In: Procedings of the 3rd USENIX workshop onoffensive technologies (WOOT ’09), Canada; 2009.
Bokehman Audio CAPTCHA, http://bokehman.com/captcha_verification.php [retrieved 5.05.09].
Chellapilla K, Larson K, Simard P, Czerwinski M. Buildingsegmentation based human friendly human interactionproofs. In: Proceedings of the SIGCHI conference on humanfactors in computing systems. ACM Press; 2005. p. 711–20.
Chew M, Baird H. Baffletext: a human interactive proof. In:Proceedings of the 10th SPIE/IS&T document recognition andretrieval conference, USA; 2003, p. 305–16.
Chan T-Y. Using a text-to-speech synthesizer to generatea Reverse Turing Test. In: Proceedings of the 15th IEEEinternational conference on tools with artificial intelligence(ICTAI’03); 2003, p. 226.
Captchas.net, http://captchas.net/ [retrieved 02.05.09].Dusan S, Rabiner L. On integrating insights from human speech
perception into automatic speech recognition. In:INTERSPEECH, Portugal; 2005, p. 1233–6.
Defeated CAPTCHA, http://libcaca.zoy.org/wiki/PWNtcha[retrieved 18.05.08].
DIGG, http://digg.com/ [retrieved 07.05.09].Defeating Audio (Voice) CAPTCHA, http://vorm.net/captchas/
[retrieved 30.08.09].eBay Audio CAPTCHA, http://https://scgi.ebay.com/ws/eBayISAPI.
dll?RegisterEnterInfo [retrieved 03.07.09].Festa P. Spam-bot tests flunk the blind. CNET, News.com. Available
at: www.news.com/2100-1032-1022814.html; July 2, 2003.Gibbs S, Breiteneder C, Tsichritzis D. Data modeling of time-based
media. In: Proceedings of the ACM SIGMOD internationalconference on management of data, USA; 1994, pp. 91–102.
Google Audio CAPTCHA, www.google.com/accounts/NewAccount [retrieved 26.03.09].
Graham-Rowe D. A sentinel to screen phone calls technology.MIT Review 2006 [accessed 08.11.2009].
HTK: Hidden markov model toolkit, http://htk.eng.cam.ac.uk/[retrieved 10.10.08].
Jurafsky D, Martin J. , Speech and language processing: anintroduction to natural language processing, computationallinguistics and speech recognition. Prentice-Hall; 2008.
Mori G, Malik J. Recognizing objects in adversarial clutter:breaking a visual CAPTCHA. In: Proceedings of the computervision and pattern recognition conference. IEEE Press; 2003.p. 134–41.
Markkola A, Lindqvist J. Accessible voice CAPTCHA for Internettelephony. In: Proceedings of the 2008 symposium onaccessible privacy and security (SOAPS 2008), USA; 2008.
Mp3Captcha, http://scripts.titude.nl/ [retrieved 02.05.09].Quittek J, Niccolini S, Tartarelli S, Stiemerling M, Brunner M,
Ewald T. Detecting SPIT calls by checking humancommunication patterns. In: Proceedings of IEEE internationalconference on communications (ICC’07), United Kingdom;2007, p. 1979–84.
MSN Audio CAPTCHA, http://https://signup.live.com/ [retrieved26.03.09].
Recaptcha Audio CAPTCHA, http://recaptcha.net/learnmore.html[retrieved 03.07.09].
Rosenberg J, Jennings C, Peterson J. The session initiation protocol(SIP) and spam. In: Draft-ietf-sipping-spam-02; March 6, 2006.
Secure Image CAPTCHA, www.phpcaptcha.org [retrieved 28.03.09].Slashdot, http://slashdot.org/login.pl?op¼newuserform [retrieved
5.05.09].SPHINX: the CMU sphinx group open source speech recognition
engines, http://cmusphinx.sourceforge.net/html/cmusphinx.php [retrieved 02.06.09].
SER server version 2.0, www.iptel.org/ser, [retrieved 20.03.09].SIP express media server version 2.0, www.iptel.org/sems
[retrieved 20.03.09].SIPP traffic generator for the SIP protocol, http://sipp.sourceforge.
net/ [retrieved 30.09.08].Turing A. Computing machinery and intelligence. Mind October
1950;LIX(236):433–60.Tam J, Simsa J, Hyde S, von Ahn L. Breaking audio CAPTCHA: In:
Advances in neural information processing systems (NIPS);2008.
Tam J, Huggins-Daines JD, von Ahn L, Blum M. Improving audioCAPTCHAs. In: Proceedings of the 2008 symposium onaccessible privacy and security (SOAPS 2008), USA; July 2008.
Trend micro’s TrendLabs, threat reports, http://us.trendmicro.com/imperia/md/content/us/trend-watch/researchandanalysis/threat_roundup_may_2009.pdf; May 2009.
Walker W, Lamere P, Kwok P, Raj B, Singh R, Gouvea E, et al.Sphinx-4: a flexible open source framework for speechrecognition. Sun Microsystems, Technical Report TR-2004-139;November 2004.
Yan J, El Ahmad A. CAPTCHA Security: a case study. IEEE Securityand Privacy July/August 2009;7(4):22–8.
c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8618
Yan J, El Ahmad A. Breaking visual CAPTCHA with naivepattern recognition algorithms. In: Samarati P, et al., editors.Proc. of the 23rd annual computer security applicationsconference (ACSAC ’07). USA: IEEE Computer Society; 2007.p. 279–91.
Yan J, El Ahmad. A Low-cost attack on a microsoft CAPTCHA. InProceedings of the 15th ACM Conference on Computer andCommunications Security (CCS 2008), Virginia, USA; October,2008, pp 543–554.
Yan J, El Ahmad A. Usability of CAPTCHA or usability issues inCAPTCHA design. In: Proceedings of the 2008 symposiumon accessible privacy and security (SOAPS 2008), USA; 2008, p.44–52.
Yannis Soupionis ([email protected]) is a Researcher and a Ph.D.student with the Information Security and Critical InfrastructureProtection Research Group of the Dept. of Informatics, AthensUniversity of Economics and Business (AUEB), Greece. He holdsa B.Sc. (Informatics and Telecommunications, Univ. of Athens)
and a M.Sc. (Information Systems, AUEB). His current researchinterests include information systems security management,formal security policies, security and privacy in Voice over IP(VoIP) telephony, and information systems risk assessment/management.
Dimitris Gritzalis ([email protected]) is a Professor of ICT Security andthe Director of the Information Security and Critical InfrastructureProtection Research Group, Dept. of Informatics, Athens Univer-sity of Economics and Business (AUEB), Greece. He holds a B.Sc.(Mathematics, Univ. of Patras), a M.Sc. (Computer Science, CityUniversity of New York) and a Ph.D. (Critical Informa SystemsSecurity, Univ. of the Aegean). He has published 7 books and morethan 120 technical papers. His current research interests focus onsecurity in AmI, VoIP systems security, and critical infrastructureprotection. He has served as Associate Commissioner of the GreekData Protection Commission, as well as the President of the GreekComputer Society. He is the Editor of the Computers & SecurityJournal.