audio captcha: existing solutions assessment and a new implementation for voip telephony

c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8

ava i lab le at www.sc ienced i rec t . com

journa l homepage : www.e lsev ie r . com/ loca te /cose

Audio CAPTCHA: Existing solutions assessment anda new implementation for VoIP telephony

Yannis Soupionis*, Dimitris Gritzalis

Information Security and Critical Infrastructure Protection Research Group, Dept. of Informatics, Athens University of Economics

& Business (AUEB), 76 Patission Ave., Athens GR-10434, Greece

a r t i c l e i n f o

Article history:

Received 13 September 2009

Received in revised form

4 December 2009

Accepted 7 December 2009

Keywords:

SPIT

Audio CAPTCHA attributes

VoIP

Authentication

Evaluation

Speech Recognition

Turing Test

* Corresponding author.E-mail addresses: [email protected] (Y. Soup

0167-4048/$ – see front matter ª 2009 Elsevidoi:10.1016/j.cose.2009.12.003

a b s t r a c t

SPam over Internet Telephony (SPIT) is a potential source of future annoyance in Voice

over IP (VoIP) systems. A typical way to launch a SPIT attack is the use of an automated

procedure (i.e., bot), which generates calls and produces unsolicited audio messages.

A known way to protect against SPAM is a Reverse Turing Test, called CAPTCHA

(Completely Automated Public Turing Test to Tell Computer and Humans Apart). In this

paper, we evaluate existing audio CAPTCHA, as this type of format is more suitable for VoIP

systems, to help them fight bots. To do so, we first suggest specific attributes-requirements

that an audio CAPTCHA should meet in order to be effective. Then, we evaluate this set of

popular audio CAPTCHA, and demonstrate that there is no existing implementation suit-

able enough for VoIP environments. Next, we develop and implement a new audio

CAPTCHA, which is suitable for SIP-based VoIP telephony. Finally, the new CAPTCHA is

tested against users and bots and demonstrated to be efficient.

ª 2009 Elsevier Ltd. All rights reserved.

1. Introduction a human but a machine. In the spam protection world this

With the rapid worldwide growth of VoIP services, the spam

issue in VoIP systems becomes increasingly important

(Rosenberg et al., 2006), which is the reason why important

companies, like NEC and Microsoft, have already developed

mechanisms (Quittek et al., 2007; Graham-Rowe, 2006) to

tackle SPam over Internet Telephony (SPIT). A serious obstacle

when trying to prevent SPIT is identifying VoIP communica-

tions, which originate from software robots (‘‘bots’’). Alan

Turing’s ‘‘Turing Test’’ paper (Turing, 1950) discusses the

special case of a human tester who wishes to distinguish

humans from computer programs. Nowadays, there has been

a considerable interest in applying an alternate form of the

Turing Test, the so called Reverse Turing Test. The term

‘‘Reverse Turing Test’’ is used to describe that the tester is not

ionis), [email protected] (D.er Ltd. All rights reserved

kind of computer administrated Reverse Turing Test is also

called CAPTCHA (Completely Automated Public Turing Test to

Tell Computer and Humans Apart). The research interest in

this subject has spurred a number of relevant proposals (Blum

et al., 2000; von Ahn et al., 2003, 2004; Chellapilla et al., 2005;

Yan and El Ahamad, 2009). Commercial examples include

major stakeholders in the field, such as Google and MSN,

which require CAPTCHA (visual or audio), in order to provide

services to users. However, there exist computer programs,

which can break the CAPTCHA that have been proposed so far.

In this paper, an audio CAPTCHA was developed that is

suitable for use in VoIP systems. In specific, first we present

the background and related work and explain the main

aspects of SPIT and CAPTCHA. Then, we provide the basic

requirements of a CAPTCHA, briefly explain why an audio

Gritzalis)..

mailto:[email protected]


http://www.elsevier.com/locate/cose

http://www.sciencedirect.com


CAPTCHA is suitable for VoIP systems, and present an algo-

rithm for selecting a suitable CAPTCHA. In Section 3, a classi-

fication of the characteristics/attributes of audio CAPTCHA is

proposed. In Section 4 a number of popular CAPTCHA is

introduced. In Section 5, the procedure to be followed for

testing a CAPTCHA is described; this includes a bot and

a speech recognition tool. In Section 6 we demonstrate that

the existing audio CAPTCHA implementations are not

adequate enough for a VoIP system. In Section 7, the experi-

mental environment which was used for testing the proposed

CAPTCHA is presented. The VoIP experimental environment

was based on the Session Initiation Protocol (SIP), because one

of the most known and deployed multimedia protocols for

VoIP infrastructures. In Section 8, the new audio CAPTCHA is

presented, which is based on the attributes selected in Section

3. Finally, we provide the reader with the results of the tests

performed with the proposed CAPTCHA.

2. Background

SPIT constitutes an emerging type of threat in VoIP systems. It

illustrates several similarities to email spam. Both spammers

and ‘‘spitters’’ use the Internet, so as to target a group of users

and initiate bulk and unsolicited messages and calls.

Compared to traditional telephony, IP telephony provides

a more effective channel, since messages are sent in bulk and

at a low cost. Individuals can use spam-bots to harvest VoIP

addresses. Furthermore, since call-route tracing over IP is

harder, the potential for fraud is considerably greater.

A CAPTCHA is a method that is widely used to uphold

automated SPAM attacks. The same technique can be used to

mitigate SPIT. According to this, each time a callee receives

a call from an unknown caller, an automated Reverse Turing

Test would be triggered. The ‘‘spit-bot’’ needs to solve this test

in order to complete its attack. Integrating such a technique

into a VoIP system raises two main issues. First, the CAPTCHA

module should be combined with other anti-SPIT controls, i.e.,

not every call should pass through the CAPTCHA challenge,

since each CAPTCHA requires considerable computational

resources. A simultaneous triggering of several CAPTCHA

challenges can soon lead to denial of service. Challenges

would also cause annoyance to users, if they had to solve one

CAPTCHA for every call they make. Second, a CAPTCHA needs

to be friendly and easy to solve (‘‘pass’’) for a human user.

2.1. CAPTCHA

A CAPTCHA is a test that most humans should be able to pass,

but computer programs should not. Such a test is often based

on hard open AI problems, e.g., automatic recognition of dis-

torted text, or of human speech against a noisy background.

Differing from the original Turing Test, CAPTCHA challenges

are automatically generated and graded by a computer. Since

only humans are able to return a sensible response, an auto-

mated Turing Test embedded in a protocol can verify whether

there is a human or a bot behind the challenged computer.

Although the original Turing Test was designed as a measure

of progress for AI, CAPTCHA is rather a human-nature-

authentication mechanism.

This paper is focused on audio CAPTCHA. These were

initially created to enable people that are visually impaired to

register or make use of a service that requires solving

a CAPTCHA. Today, an audio CAPTCHA would be useful to

defend against automated audio VoIP messages, as visual

CAPTCHA are hard to apply in VoIP systems, mainly due to the

limitations of end-user devices. For example, nowadays not

many people have a home telephony device with a screen

capable of displaying a proper (high resolution) image

CAPTCHA. If an adequate CAPTCHA is used, it should be hard

for a spit-bot to respond correctly and thus manage to initiate

a call. Also, audio CAPTCHA seems attractive, as text-based

CAPTCHA has been demonstrated breakable (Chew and Baird,

2003; Mori and Malik, 2003; Defeated CAPTCHA; Yan and El

Ahmad, 2007; Yan and El Ahamad, 2008).

2.2. Related work

As the audio CAPTCHA technology is practically in its infancy,

the relevant research work is currently limited.

Bigham and Cavender demonstrated that existing audio

CAPTCHA are clearly more difficult and time-consuming to

complete as compared to visual CAPTCHA (Bigham and Cav-

ender, 2009). They created a comparison between the existing

CAPTCHA implementations, but they do not reach to any

conclusion on how their characteristics affect the user

success rate. They developed and evaluated an optimized

interface for non-visual use, which can be added in-place to

an existing audio CAPTCHA. In their published CAPTCHA

evaluation they mentioned that Facebook, Veoh, and Craigs-

list use different CAPTCHA; today, all three of them use

Recaptcha (Recaptcha Audio CAPTCHA).

Tam et al. (2008a,b) described a number of security tests of

audio CAPTCHA. The authors used machine learning tech-

niques, which are similar to the ones used for breaking visual

CAPTCHA. They analyzed three audio CAPTCHA taken

from popular websites (Google (Google Audio CAPTCHA),

Recaptcha (Recaptcha Audio CAPTCHA), Digg (DIGG)). In some

cases they reached correct solutions with an accuracy of up to

71%. The main issue with this work is that they only tested

the audio CAPTCHA implementations and did not analyze

what is the impact of audio CAPTCHA characteristics on its

performance.

Yan and El Ahmad (2008) worked on the usability issues

that should be taken into consideration when developing

a CAPTCHA. Their work does not specifically focus on audio

CAPTCHA, with the exception of a few characteristics (i.e.,

character set). Their work was concluded with a framework

referring to CAPTCHA usability.

Bursztein and Bethard (2009) developed a prototype audio

CAPTCHA decoder, called decaptcha, which is able to success-

fully break 75% of the eBay audio CAPTCHA. They described an

automated process for downloading audio CAPTCHA, training

the decaptcha bot and finally solving the eBay CAPTCHA.

Finally, Markkola and Lindqvist (2008) proposed a number

of ‘‘voice’’ CAPTCHA for Internet telephony. However, they did

not explain in detail how this could be integrated into an

Internet telephony infrastructure. Also, their work lacks

experimentation results.

c o m p u t e r s & s e c u r i t y 2 9 ( 2 0 1 0 ) 6 0 3 – 6 1 8 605

2.3. A new approach

In the paper, apart from classifying the audio CAPTCHA attri-

butes and evaluating the current audio CAPTCHA imple-

mentations, a new audio CAPTCHA for VoIP environments will

be developed. The proposed CAPTCHA must be easy for human

users to solve, easy for a tester machine to generate and grade,

and hard for a software bot to solve. The validation of its

performance will be made by two means; namely, by user tests

and by a bot configured to solve ‘‘difficult’’ audio CAPTCHAs.

The latter requirement implies that a specific kind of test

should be developed; i.e., a test that is easy to generate but

intractable to pass without knowledge that is available to

humans but not to machines. Audio recognition fits in this

category. For example, humans can easily identify words in an

environment, whereas this is usually hard for machines

(Dusan and Rabiner, 2005; von Ahn et al., 2008). Specification-

wise, a CAPTCHA should ideally be 100% effective at identifying

software bots, but it was proved (Chellapilla et al., 2005) that

a CAPTCHA could be designed to fight bots with a low failure

rate (i.e.,<0.1%). Generically, a CAPTCHA is effective as long as

the cost of using a software robot remains higher than the cost

of using a human, even when the spammers use cheap labor to

solve CAPTCHA (Trend Micro’s TrendLabs).

In order to develop a new audio CAPTCHA, we followed an

iterative algorithm: (a) we selected a set of attributes that are

appropriate for audio CAPTCHA, (b) we developed a CAPTCHA

that is based on these attributes, and (c) we evaluated the

CAPTCHA by calculating the success rates of a bot and of

a number of users, until the results were adequately (Fig. 1).

3. CAPTCHA attributes

A high user success rate is a key factor in deciding whether

a new CAPTCHA is effective or not. This is particularly

important in the case of an audio CAPTCHA, as it does not only

refer to VoIP callers, but also to visually impaired users of

a VoIP service. Equally important is the bot success rate,

which should be kept to a minimum. Both factors depend on

a number of attributes. The main characteristic of these

attributes is that they should all be adjusted in the production

procedure of the CAPTCHA. We classified these attributes into

four categories: (a) vocabulary, (b) background noise, (c) time, and

(d) audio production.

Fig. 1 – A generic CAPTCHA

3.1. Vocabulary attributes

Audio CAPTCHA designs vary, mainly due to the vocabulary

used. Variations depend upon: (a) the set of characters the

audio CAPTCHA consists of, (b) the number of characters of

a single CAPTCHA, and (c) the local settings, e.g., the language

that CAPTCHA characters belong to.

3.1.1. Adequate data fieldA data field (called ‘‘alphabet’’) is used as a pool for selecting

the characters to be included in an audio CAPTCHA. In order to

integrate an audio CAPTCHA into a VoIP system, we chose an

alphabet of ten one-digit numbers, i.e., {0, ., 9}. Such a choice

allows the use of the DTMF method for answering the audio

CAPTCHA. Other examples of audio CAPTCHA that use only

digits are the MSN and the Google ones. Moreover, some

CAPTCHA includes beep sounds in their vocabulary, so as to

inform the user that the audio CAPTCHA begins. From the

other side, a limited alphabet and beep sounds may make an

audio method quite vulnerable to attacks (Chan, 2003).

3.1.2. Spoken characters variationIn order to make the CAPTCHA solution even harder for a bot

to solve, we introduce a number of different human speakers

for each digit of the alphabet. For example, if there are X

different speakers for each character, then there will be X

different ways to pronounce each character. This essentially

means that each speaker makes a difference for a bot, but

hardly for a human.

Another drawback for a CAPTCHA implementation is the

use of a fixed number of characters. A non-variable number of

characters, in combination with a limited alphabet, can make

a CAPTCHA vulnerable to attack. For example, if only 3-digit

CAPTCHA are used and a bot can successfully recognize only 2

of the digits, then it can reach a success rate of �10% just by

guessing the remaining digit. On the other hand, if the number

of digits of a CAPTCHA is not fixed and a bot can successfully

recognize only 2 of them, then the number of remaining digits

is not known to the bot.

3.1.3. Language requirementsAnother important factor is the mother tongue of the users, as

it plays a major role in achieving a human user high success

rate. This is particularly important in the case of audio

methods, where identifying spoken characters is hard to do, in

development process.


case the mother tongue of the speaker and the user differs.

Therefore, the language should meet the scope of the specific

CAPTCA implementation. As a good practice, the spoken

characters should be not more than a few. The CAPTCHA we

developed can be adjusted for non-English users, as it is

created dynamically and different characters can be added

easily.

3.2. Noise attributes

The noise is still another important attribute of an audio

CAPTCHA, as it can help to increase the difficulty for an

automated procedure to solve it (Jurafsky and Martin, 2008).

3.2.1. Background noiseThe background noise, which can be added during the

production of a voice message, can make CAPTCHA particu-

larly resistant to attacks by automated bots. Application of

background noise requires a great variety of such noises to be

available. These noises should be rotated in an erratic

manner. In our proposal, instead of developing a repository

with noises we chose to proceed with a dynamic production of

them, while ensuring that they are distorted in a random

manner. The way various noises are produced should prevent

their easy elimination by automated programs that use

learning techniques (Tam et al., 2008a). In any case, the final

version of the audio message, resulting from the combined

use of different distortion techniques and added noise, should

be such that the majority of users can easily recognize it. In

the proposed CAPTCHA there was a real-time distortion,

applied in between the characters, as there appears to be no

effective method for evaluating how people understand digits

with distortion.

3.2.2. Intermediate noiseIntermediate noise may prevent an automated program from

isolating correctly spoken characters from a voice message.

The developer needs to select the scale in which the inter-

mediate noise will be applied, because intermediate noise can

decrease not only the automated bot success rate but also that

of the user (Festa, 2003). Also, as this noise should have the

same characteristics as the background noise, it should be

created dynamically.

3.3. Time attributes

A set of variables should be defined during the production of

an audio snapshot (Gibbs et al., 1994). The variables refer to

the length of the audio message, which depends on: (a) the

number of characters spoken, (b) the characters chosen, and

(c) the time required for each character to be announced,

which in turn depends on the speaker of each character. Both,

the beginning and the end of each spoken character, should

also be defined. This depends on the duration of each char-

acter, as well as on the duration of the pause between spoken

characters. If the above time parameters follow specific

patterns, then the resistance of the audio CAPTCHA to a bot

will decrease significantly. In the proposed CAPTCHA we aim

at eliminating such time-related patterns.

3.4. Audio production attributes

In principle, an audio CAPTCHA production procedure should

be automated. In practice, an acceptable human interference

could be allowed only for the adjustment of the various

thresholds.

3.4.1. Automated production processThe automation of the CAPTCHA production process is

a desirable, though hard to achieve, property. The various

elements that compose an audio CAPTCHA, such as the

number of characters of a message, the speaker of each

character, the background sound, the timing and the distor-

tion of the message, make the process time-costly and

demanding in terms of hardware resources. Our choice is to

produce audio CAPTCHA periodically, in order: (a) not to

produce them in real-time, and (b) not to produce identical

snapshots for extended time periods.

3.4.2. Audio CAPTCHA reappearanceAn audio CAPTCHA should reappear as rare as possible.

However, with short alphabets every CAPTCHA is actually

expected to reappear after a while. Due to the attributes of the

voice messages (e.g., technical distortion, added noise,

language, speakers, etc.), as well as to the context of the user

(e.g., noisy environment, etc.), a voice message sometimes

cannot be identified by the user on the first attempt. There-

fore, a second chance should be given. In this case, a different

CAPTCHA should be used.

3.4.3. Audio CAPTCHA reproductionAn audio CAPTCHA should be reproduced in a streaming way.

The main reason for this is that most of the bots need

a training session before they are able to solve a CAPTCHA.

Therefore, if the audio reproduction process is not streaming,

then the bot could easily download all audio CAPTCHA that

are needed for the training session.

Fig. 2 refers to all the attributes of an audio CAPTCHA.

4. Audio CAPTCHA evaluation

In this section we evaluate some popular audio CAPTCHA

utilizing theabovementioned characteristics.First,wecollected

twelve (12) different audio CAPTCHA, not only from popular

websites (i.e., Google, Hotmail, Recaptcha), but also from other

sources (Secure Image CAPTCHA). For each of them we down-

loaded100 examples (in .wav or .mp3format), resulting ina total

of 1200 audio files that were used for the evaluation.

Then, for each audio CAPTCHA we provided a short

description of its functionality. We summarized with drafting

a table that includes all these CAPTCHA, together with their

attributes.

Two interesting points, regarding our analysis, are:

1. User’s success rate was calculated by inviting 10 users to

solve 5 CAPTCHA of each implementation. All CAPTCHA

were in English, which was the mother tongue of one (1) of

the participants (as a requirement, all users should speak

Fig. 2 – Audio CAPTCHA attributes.


English). All users had a university degree. Also, they all use

a PC for more than 20 h/week.

2. The ‘‘automated creation’’ attribute was not put in-place

for the commercial CAPTCHA (Google, MSN), as their rele-

vant algorithms are not publicly available.

4.1. Google

The Google Audio CAPTCHA uses a limited data field of ten

digits (0, ., 9), which seems not adequate for every situation;

however, it is suitable for a VoIP system. The number of digits

for each audio CAPTCHA is not fixed, but it ranges from 5 to 10

digits. Moreover, this CAPTCHA is available in multiple

languages. This CAPTCHA uses background and intermediate

noise. The noise at the beginning is louder and then a different

speaker is used for the announcement of each character. In

addition, the duration of a CAPTCHA ranges from 20 to 50 s

(based on our Google Audio sample). Google uses three beeps

every time an audio CAPTCHA begins. These beeps make the

audio CAPTCHA vulnerable to attacks because it is much easier

for a bot to know when a CAPTCHA begins. Furthermore,

Google Audio CAPTCHA is announced twice in every audio file,

therefore an attacker can process it twice and has multiple

attempts to find the right answer. Finally, the most important

drawback is the user success rate, which is not adequately high.

4.2. MSN

The MSN Audio CAPTCHA uses a limited data field of ten (10)

digits, with a fixed number of spoken characters (10) in each

one. The frequency of the spoken characters varies, since

a number of different speakers are used. That makes MSN

Audio CAPTCHA vulnerable to attacks. Also, it is available in

multiple languages. MSN uses weak and constant background

noise. The distance between the words is, to a far extend,

constant. Moreover, the duration of the CAPTCHA is not always

the same (e.g., one CAPTCHA lasts 0:07 s, another 0:16 s). There

are no beeps at the beginning of this audio CAPTCHA. The main

advantage of MSN Audio CAPTCHA is it is easy for a user to

understand. As a result, the user success rate is high.

4.3. Recaptcha

The Recaptcha Audio CAPTCHA uses a large data field that

includes various phrases. Therefore, the number of spoken

words varies and it is available only in English. Recaptcha uses

no background noise. On the other hand, it uses distortion

techniques and multiple speakers, with different pronuncia-

tion and different pace. The user can hear twice the audio

CAPTCHA in one audio file (like Google). Recaptcha does not

use beeps. The duration of this CAPTCHA is almost fixed.

Moreover, the user success rate is significantly low. Recaptcha

Audio CAPTCHA meets most of the requirements for an

effective tool. Its main drawbacks are the vocabulary (includes

more than digits), as well as the user success rate, which is

low. The latter happens because it seems not easy for a user to

understand the words and their combination.

4.4. eBay

The eBay Audio CAPTCHA has a limited data field of ten (10)

digits (0–9). The number of spoken characters is always six (6).

The CAPTCHA uses different speakers and it is available in

several languages, depending on the specific eBay sites (i.e.,

the digits in www.ebay.fr are pronounced in French). More-

over, there is a different background noise for each digit, but

there is no intermediate noise. Finally, the duration of the

CAPTCHA, as well as the speaker pace, are both fixed. The

main advantages of this implementation are the high user

success rate, the lack of beeps at the beginning or end of the

CAPTCHA, and its streaming reproduction.

4.5. Secure Image CAPTCHA

Secure Image CAPTCHA uses an adequate data field of digits

(0–9) and letters (A–Z). The number of spoken characters is

fixed and it is available only in English. On the other hand, this

http://www.ebay.fr


CAPTCHA uses the same speaker all the time. Moreover, it

uses simple background noise and there is no intermediate

one. Also, the CAPTCHA duration and the speaker pace are

fixed. Secure Image CAPTCHA is an open-source free PHP

CAPTCHA script; therefore most of the attributes can be fine-

tuned. However, there is no functionality allowing the auto-

mated production of new CAPTCHA instances. The main

advantage of this implementation is the high user success

rate.

4.6. Mp3Captcha

This CAPTCHA (Mp3Captcha) uses an adequate data field of

digits (0–9) and letters (A–Z). Also, it is available in multiple

languages, which is very helpful for non-English users.

Moreover, it does not use beeps at the beginning or specific

extra tokens that help the bot understand when the charac-

ters of the CAPTCHA are announced. On the other hand, the

speaker is only one, which makes it easy for a computer-based

audio recognition tool to correctly identify it. Additionally,

there are no background noise or distortion techniques. The

duration of the CAPTCHA is fixed and the time for solving the

CAPTCHA is short. Furthermore, it uses a specific number of

spoken characters and the pace is fixed. Finally, the main

advantage is that the user success rate is high.

4.7. Captchas.net

The Captchas.net audio CAPTCHA (Captchas.net) uses letters

and digits. Also, this implementation is friendly to non-

English users, as it is available in the most popular languages.

When a character in the CAPTCHA is a letter, then a word is

announced and the requested answer is the first letter of this

word. For example, if the announced word is ‘‘horse’’, then the

requested character is ‘‘h’’. The number of spoken characters

is fixed; therefore the CAPTCHA is vulnerable to attacks. The

implementation uses distortion techniques and NATO

pronunciation, but no background noise. The speaker is

always the same person. The pace and the duration of the

CAPTCHA are fixed. There are no beeps at the beginning and

no extra tokens. The user success rate is high and the duration

for solving the CAPTCHA is short.

4.8. Bokehman

Bokehman’s (Bokehman Audio CAPTCHA) data field includes

numbers (0–9), letters (A–Z), and some extra tokens. These

tokens are the words ‘‘capital’’ and ‘‘lower’’, which the user

hears before the announcement of each character, so as to

understand whether the following letter is lowercase or

uppercase. The use of extra tokens makes the CAPTCHA

vulnerable, because a bot can identify them easily and

understand when to expect each character. Moreover, it is

available only in English. The implementation does not use

background noise or distortion techniques. The spoken char-

acters are always four (4). Finally, it always uses the same

speaker, the same pace, and the same duration. The user

success rate is high, but the implementation suffers draw-

backs, due to the use of mainly static characteristics.

4.9. Slashdot

Slashdot audio CAPTCHA (Slashdot) uses a strong data field

that contains letters (A–Z) and words. First the speaker says

the whole word and then he/she spells it. This makes the

CAPTCHA solution easier for the users. Moreover, each word

contains a different number of characters, which makes the

CAPTCHA even harder. Also, this implementation does not

use extra tokens or beeps at the beginning. On the other hand,

it is available only in English, it does not use background

noise, the speaker is always the same and the duration of each

CAPTCHA is almost fixed. Additionally, these CAPTCHA

reappear often. There is no available information about their

production process. Finally, we should mention that the user

success rate is one of the highest (95%).

4.10. Authorize

Authorize audio CAPTCHA (Authorize) data field uses digits

(0–9) and letters (A–Z). The number of spoken characters is

fixed. There is no use of beeps or extra tokens. On the other

hand, it is available only in English. Moreover, there is no

background noise and no use of distortion techniques, which

make the CAPTCHA vulnerable to attacks. Also, the speaker is

always the same and the duration is fixed. Finally, it is easy for

a user to understand.

4.11. AOL

AOL audio CAPTCHA (AOL) data field uses letters (A–Z) and

digits (0–9). The number of spoken characters is fixed. There

are two speakers. One says some characters and the other the

rest. The sequence is not specific but changes as one pass

from one CAPTCHA to another. It is available only in English. It

uses voices for background noise, but no distortion tech-

niques. The duration is fixed. It does not use extra tokens. It

uses three (3) beeps not only at the beginning, but also at the

end of the CAPTCHA. This makes the CAPTCHA vulnerable to

attacks, as a bot can be programmed to identify when the

CAPTCHA starts and ends. Finally, this CAPTCHA imple-

mentation is easy for a user to understand.

4.12. Digg

The last audio CAPTCHA is Digg (DIGG). It uses an adequate

data field of digits (0–9) and letters (A–Z). The number of

spoken characters is fixed (i.e., 5). Moreover, it is available only

in English. Digg uses a constant background noise, which is

louder at the end. It also uses a pause before the announce-

ment of each character. The speaker is the same and the

duration of the CAPTCHA is fixed. Digg’s developers suggested

a way to defeat a bot; i.e., they randomly put a sound in an

audio CAPTCHA (the background noise for every character),

without including any character. However, this is not hard for

a bot to identify (this sound is always the same) and just

ignore it. This implementation is easy for a user to

understand.

Table 1 depicts the main attributes of the previously

described audio CAPTCHA implementations.

Ta

ble

1–

Au

dio

CA

PT

CH

Aco

mp

ara

tiv

eo

verv

iew

.

Au

dio

CA

PT

CH

AG

oo

gle

MS

NR

eca

ptc

ha

eB

ay

Secu

reIm

age

CA

PT

CH

AM

p3C

ap

tch

aC

ap

tch

as.

net

bo

keh

ma

nsl

ash

do

tA

uth

ori

zeA

OL

Dig

g

Att

rib

ute

s

Use

rsu

ccess

rate

60%

80%

50%

95%

98%

98%

98%

98%

95%

95%

95%

95%

Ba

ckgro

un

dn

ois

eV

oic

e,

no

ise

Vo

ice,

no

ise

No

ise

No

ise,

vo

ice

No

ise

No

ne

No

ne

No

ne

No

ne

No

ne

Vo

ice

No

ise

Inte

rmed

iate

no

ise

No

ise

No

ise

No

ne

No

ne

No

ne

No

ne

No

ne

No

ne

No

ne

No

ne

No

ise

No

ne

Da

tafi

eld

0–9

0–9

Ph

rase

s0–9

A–Z

,a

–z,

0–9

A–Z

,a

–z,

0–9

a–z

,0–9

A–Z

,a

–z,

0–9

Wo

rd(a

–z)

A–Z

,a

–z,

0–9

A–Z

,a

–z,

0–9

A–Z

,a

–z,

0–9

Sp

ok

en

cha

ract

ers

va

ria

tio

n

5–1

010

Yes

64

46

4<

95

85

Str

ea

min

g

rep

rod

uct

ion

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

Yes

Yes

Yes

Ra

rere

ap

pea

ran

ceY

es

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

Pro

du

ctio

np

roce

ssN

ot

ap

pli

cab

le

No

t

ap

pli

cab

le

No

t

ap

pli

cab

le

No

t

ap

pli

cab

le

Au

tom

ate

dA

uto

ma

ted

Au

tom

ate

dA

uto

ma

ted

No

t

ap

pli

cab

le

No

t

ap

pli

cab

le

No

t

ap

pli

cab

le

No

t

ap

pli

cab

le

La

ngu

age

req

uir

em

en

ts

Mu

ltip

le

lan

gu

ages

Mu

ltip

le

lan

gu

ages

en

Mu

ltip

le

lan

gu

ages

en

en

,fr

,it

,d

een

,d

e,

it,

nl,

fren

en

en

en

En

Va

rio

us

spea

kers

Yes

No

Yes

Yes

No

No

No

No

No

No

Yes

No

Du

rati

on

(sec)

0:1

0–0

:15

0:0

5–0

:09

w0:0

4w

0:0

4w

0:0

4w

0:0

4w

0:0

80:0

4–0

:05

0:0

3–0

:04

0:0

50:1

00:0

8

Beep

s(b

efo

re,

aft

er)

3,0

00

00

00

00

03,2

0


5. CAPTCHA bots

Given the user success rate of a CAPTCHA, one has to test it

against automated audio recognition tools. In this paper

a state-of-the-art open-source speech recognition tool

(SPHINX) was used (Walker et al., 2004; SPHINX). In addition,

a frequency and energy pick detection bot, called devoicecaptcha

(Defeating Audio (Voice) CAPTCHA), was also utilized. The

criteria for selecting those two bots were (a) they have proven

record for audio CAPTCHA solving, especially the devoice-

captcha bot (Bursztein and Bethard, 2009), (b) they are widely

used, and (c) both can easily adapted in a VoIP environment.

5.1. Automatic speech recognition bots

Automatic speech recognition (ASR), also known as computer

speech recognition, is the technology that makes it possible

for a computer to identify the components of human speech.

The process begins with a spoken utterance being captured by

a microphone or an audio file and ends with the recognized

words being output by the system. In particular, the basic

function is to convert the spoken word to properly encoded

data that can be recognized by a computer. The ultimate aim

of this technology is to identify, in real time and with a degree

of success close to 100%, words spoken by humans, regardless

of the size of vocabulary, noise levels, the characteristics of

speaker like pronunciation, and conditions of the channel

through which the human voice is transmitted.

On a practical level, ASR tools can achieve a high perfor-

mance when used in controlled conditions. These limitations

are usually associated with the discharge of adding additional

information not related to the acoustic signal recognition.

Thus, the environments in which high degree of success is

achieved are characterized usually by the absence of any form

of noise or distortion technique. Depending on the extent of

the various restrictions, there are different levels of perfor-

mance. The closer the conditions are to the optimal ones the

higher the performance is. In order for an ASR to work, it has

to build a Speech Recognition Engine (SRE). An SRE requires

two types of files to recognize speech.

� An acoustic model, which is created by taking audio

recordings of speech and ‘compiling’ them into a statistical

representations of the sounds that make up each word (this

process is called ‘training’).

� A language model, or grammar file, which is a file that

contains the available vocabulary and the probabilities of

the words’ sequences.

When the vocabulary is limited, it requires no training to

recognize a small number of words (e.g., the ten digits) as

spoken by most speakers. Such systems are popular for

routing incoming phone calls to their destinations in large

organizations. This is why we used these tools without

involving special training.

5.1.1. SPHINXThere are various ASR methods/tools to recognize the words

spoken in an audio file. Among the most known ones are the


Hidden Markov Model Toolkit (HTK) and SPHINX (the latest

version of Carnegie Mellon University’s repository of Sphinx

speech recognition systems was developed by CMU, SUN

Microsystems and Mitsubishi Electric Research Laboratories).

We decided to use SPHINXbecauseit isopen-source – thuseasily

configurable – and has a large community of developers who use

and maintain it. The HTK is developed in Cambridge University

Engineering Department, but its source has not been made

publicly available. The major advantage of the SPHINX is that it

has pluggable language/grammar and acoustic models.

In our test environment, we used a language model called

HUB4. It uses a large vocabulary and a customized acoustic

model, which expects not only digits. Other language models

like TIDIGITS were not used, because they cannot handle

random distortion even though its vocabulary is only digits.

5.2. Frequency and energy peak detection bots

The second category for an automated audio recognition bot

employs frequency and energy peak detection methods. It can

be used for solving audio CAPTCHA, for the following reasons:

� Such bots have been proven effective: demonstrative (though

perhaps not thorough enough) tests of such bots against

popular audio CAPTCHA implementations have been

successful (Defeating Audio (Voice) CAPTCHA; Breaking

Gmail’s audio CAPTCHA) (e.g., SPIT prevention infrastruc-

tures, registrations for visually impaired people, etc.).

� Such bots are easy to implement: frequency and energy peak

detection bots are comparatively easy to implement using

open-source software.

� Such bots require limited time to solve a CAPTCHA: fast

CAPTCHA solving is required because most services leave

a small time frame for their users to solve the tests (5–15 s),

especially when VoIP services are considered. The

CAPTCHA solving bot must analyze and reform the solution

to the desired form (SIP message, DTMF, etc.) within

a limited time frame.

� Such bots require a small amount of system recourses: an auto-

mated SPAM attack is chosen when its cost is lower than

employing humans. Also, a ‘‘spitter’’ performs multiple

attacks simultaneously (e.g., the goal is to initiate SIP calls or

messages in parallel). Thus, a bot must be inexpensive in

terms of system recourses, which will allow the spammer/

spitter to run several instances of the bot at the same time.

Regarding time constraints, frequency and energy peak

Fig. 3 – Audio analy

detection processes are less demanding than approaches

using different methods, such as Hidden Markov Models

(HMM) (HTK).

There are certain drawbacks when using these bots. This is

mainly due to the fact that they require a training session. In

this session a user identifies a number of selected CAPTCHA.

Then, he/she recognizes the announced characters and

records them in a file, from which the bot receives the data to

solve the CAPTCHA. The set of training audio CAPTCHA might

be extensive, if the CAPTCHA data field (alphabet) is long.

However, in a VoIP system the available alphabet is relatively

small as it contains only digits (0–9), which increase the

applicability of the mechanism.

5.2.1. The bot usedFor the purpose of this paper we used the devoicecaptcha bot

developed by Vorm (Defeating Audio (Voice) CAPTCHA). This

bot uses frequency analysis and energy peak detection, in

order to segment and solve an audio CAPTCHA in real-time.

The bot works as follows: first it reads the audio file and skips

as many starting bytes as the user has predefined (to avoid the

starting bells that some implementations have, e.g., Google).

Then, the samples are processed with a hamming window

defined by the user. Each block is transformed into the

frequency domain using Discrete Fourier Transformation. The

frequencies are put in a predefined number of bins (the bins

are not equally wide, the higher the frequency the larger the

band). After that, the bot looks at the highest frequency bin.

Every block that has more energy in a window than the pre-

defined threshold energy is considered a peak (see Fig. 3).

These peaks are used to segment the audio file in the different

spoken digits. Then the bot looks for a number of windows

around the peaks and prints all the frequency bins. This is the

profile of the digit. The profiles of the digits are then compared

to the ones in the training file. The closest match is chosen as

a possible guess for each digit.

During the training session of the bot the user gives as

input to the bot an audio CAPTCHA. Then, for each profile

of the digit that the bot chooses the user enters which

digit it actually was (this procedure can be automated if

the user gives a name to the audio files accordingly, i.e., if

an audio CAPTCHA file include digits 6, 9 and 2, the file

name can be ‘‘692.wav’’). The larger the number of audio

CAPTCHA in the training set is, the higher the bot’s

success ratio would be.

sis of the bot.


6. CAPTCHA applicability for VoIPenvironment

In this section we discuss which of the CAPTCHA in Section 4

could be candidates for anti-SPIT purposes. The only require-

ment this CAPTCHA should have is that the vocabulary should

be limited to digits {0, ., 9}, as the audio CAPTCHA will be used

for an SIP-based VoIP system, where DTMF signals need to be

sent. Sending letters to answer a CAPTCHA could be difficult for

an average user. Not many users can write 3–4 letters with

a phone keyboard (e.g., pressing multiple times a key to get the

letter) in a short time period. An implementation of this kind

should not ignore or underestimate the digital divide.

Based on the algorithm introduced in Section 2, the user

success rate should be high (>80%). The Google and Recaptcha

CAPTCHA cannot meet this requirement. Nearly the same

user success rates were also presented by Bigham and Cav-

ender (2009). Moreover, the Recaptcha uses phrases (not

digits).

DIGG, AOL, Slashdot and Authorize do include characters

other than digits. They are, also, not open-source, therefore

their data field is not able to be altered. As far as it concerns

eBay audio CAPTCHA, it has already been ‘‘cracked’’ (Bursz-

tein and Bethard, 2009).

The problem with MSN CAPTCHA is the number of digits

each one includes. As a result of the user tests that we per-

formed with normal phones, user success rate decreases

significantly, from 80% to 25%, because it was not easy for

a user to type the digits and hear the CAPTCHA at the same

time, or to remember all 10 digits and type them after the

CAPTCHA ends. MSN CAPTCHA can be of practical use only if

the telephone device has a microphone and a headphone

separated from the telephone keyboard.

The remaining CAPTCHA implementations (Secure Image

CAPTCHA; Captchas.net; Bokehman Audio CAPTCHA;

Mp3Captcha) could be, in principle, used for anti-SPIT

purposes. Even though their vocabulary contains letters, this

can be changed to only digits because they are open-source.

However, in practice only the Secure Image CAPTCHA and the

Captchas.net can be taken into account, because Bokehman

and Mp3Captcha are very similar to the Captchas.net (i.e., no

background noise) and they are both more vulnerable to

attacks (they use only one speaker (Tam et al., 2008a; Chan,

2003)).

0

20

40

60

80

100

SPHINX Devoicecaptcha Users

Su

cc

es

s ra

te

(%

)

Secure image captcha Captchas.net

Fig. 4 – Evaluation of audio CAPTCHA.

6.1. Evaluation of selected audio CAPTCHA

At this stage, we have decided upon the two selected

CAPTCHAs. The next step was to evaluate them against the

two bots presented in Section 5.

For the devoicecaptcha bot we had to create a training

session, because it works with a comparison to a training set.

We took 50 audio files of each CAPTCHA as a training set and

tested it with the remaining 50 audio files. The result was

a clear defeat of the two CAPTCHA, as the bot had a 77%

success rate for the Secure Image CAPTCHA and an 81%

success rate for Captchas.net. Both success rates are large

enough, thus the CAPTCHA is considered not effective.

For the SPHINX test environment a small custom application

was created, in order to decode multiple wav files in batch form

and send to output the corresponding results. Even though the

SPHINX success rate was not high, it was large enough (>8%) for

the two implementations to be considered not effective.

Both experiments were conducted in a Windows XP SP2 PC

with 2.1 GHz Core2Duo processor and 2 GB RAM memory. The

experiments are depicted in Fig. 4, which includes the users’

success rate as it was depicted in Table 1.

To sum up, based on the aforementioned tests and the VoIP

system requirements (e.g., only digits in vocabulary), we

concluded that there is practically no existing audio CAPTCHA

implementation that could be considered as efficient enough

for a VoIP system.

7. Audio CAPTCHA experimentalenvironment/integration

We now proceed to the development of a new audio CAPTCHA

implementation. A key question for the development of such

a new CAPTCHA is whether it is applicable to a VoIP system, in

particular in an SIP-based environment. This section describes

our laboratory VoIP system, the development of the new audio

CAPTCHA, the applicability of the bot in the SIP-based VoIP

system, and the results of the user evaluation.

7.1. Experimental lab infrastructure and CAPTCHAintegration

The test computing environment, which was used, is depicted

in Fig. 5. It consists of 2 SIP proxy servers. The SIP server

application is scalable and reliable open-source software

called SIP Express Router (SER 2.0) (SER server version 2.0). It

can act as an SIP registrar, proxy, or redirect server. Each of the

SIP servers creates a different VoIP domain. Both, the bots’

host computer and the users, belong to the first domain. The

callee, who is protected by the proposed audio CAPTCHA,

belongs to the second domain. The functionality of the second

domain has been extended, in order to be able to send/stream

an audio CAPTCHA. Each time a call reaches the second

domain, the call is redirected to a media server, which

Fig. 5 – Laboratory infrastructure.


reproduces the audio CAPTCHA and validates the caller’s or

bots’ answer.

The media server is the SIP Express Media server (SEMS)

(SIP express media server version 2.0), which is a reliable

media and application server for SIP-based VoIP services. In

order for the caller (user or bot) to hear the audio CAPTCHA,

a media session should be established by exchanging SIP

messages. The SIP message number of the audio CAPTCHA is

182 and the subject (header field) is ‘‘CAPTCHA’’.

7.2. Bots’ applicability to SIP-based VoIP

In order to integrate the bots in an SIP-based VoIP system and

examine their applicability, the implementation procedure

was decided to include three stages (the procedure and the SIP

messages exchanged between the participating entities, are

presented in Fig. 6).

Stage 0: it is dominated by the administrator of the callee’s

domain (Domain2). When the callee’s domain receives an SIP

INVITE message, there are three possible distinct outcomes:

(a) forward the message to the caller, (b) reject the message,

and (c) send a CAPTCHA to the caller (UA1). In the test envi-

ronment we forward every INVITE message to the media

server, which sends a CAPTCHA to the caller.

Stage 1: an audio CAPTCHA is sent (in the form of a 182

message) to the caller (UA1). In the proposed implementation,

the caller is replaced by a bot. It must record the audio

Fig. 6 – SIP message exch

CAPTCHA, reform it to an appropriate audio format (wav,

8000 Hz, 16 bit) and identify the announced digits. The

procedure depends mainly on the time needed to reform the

message. Moreover, the particular bot needs approximately

0.10 s to identify a 3-digit CAPTCHA and 0.15 s to identify a 4-

digit one.

Stage 2: when the bot has generated an answer, it forms an SIP

message by using SIPp (SIPP traffic generator for the SIP

protocol), which includes the DTMF answer. This answer is

sent as a reply of the CAPTCHA. If the caller does not receive

a 200 OK message, a new CAPTCHA is sent and the bot starts to

record again (Stage 1).

The above procedure should be completed within a specific

time frame. The time slot opens when the audio file is received

by the caller and closes when the timeout of the user’s input

expires (defined by the service CAPTCHA provider (Fig. 7)). The

duration of the CAPTCHA playback does not affect the time

frame, because the waiting time for an answer starts when the

playback is complete. If an answer arrives before the timeout,

then it is validated by CAPTCHA service (and if it is correct the

call is established), otherwise the bot has another try. In our

implementation, the bot is given 6 s to respond to the

CAPTCHA, whereas the maximum number of attempts is set

to three (3).

Table 2 illustrates the time required by the various stages

in the proposed implementation. The selected bot can answer

properly the CAPTCHA puzzle in much less time than the time

ange for CAPTCHA.

Fig. 7 – A CAPTCHA time frame.


frame. Since the CAPTCHA is desired to be easy for users, we

suggest that the time frame, in which the caller should answer

the CAPTCHA puzzle, should be not less than 3 s. This is

because many groups of users, such as minors or elderly, may

not be able to respond promptly. Finally, we note that our bots’

host computer can accomplish the two stages for 82 CAPTCHA

simultaneously.

7.3. User applicability

The users, who were invited to solve the CAPTCHA samples,

were 32, most of them aged between 20 and 30 years old. Most

of them were university students (21 out of 32). We had 6

persons older than 40 years old. All CAPTCHA were in English,

which was the mother tongue of 1 of the participants (there

was a requirement for every user to speak English). In order

for the user to take the tests, all users’ PCs (in Fig. 5 depicted as

the caller) were equipped with soft-IP-phones (X-lite and

Twinkle). These phones were used to initiate a call, to listen to

the CAPTCHA, and to send the answer in a DTMF tone format.

8. Audio CAPTCHA implementation process

In this section, the details of the development of a new audio

CAPTCHA will be explained.

8.1. Selected attributes

In order to develop an effective new audio CAPTCHA, we

decided upon the following attributes:

Different announcers (speakers): the announcer (speaker) of each

and every digit is selected randomly among a given set of

(more than one) speakers.

Random positioning of each digit in the CAPTCHA: the digits used

by the CAPTCHA are physically distributed randomly in the

available space.

Background noise of each digit: background noise, randomly

selected, is added to each and every digit of the audio

CAPTCHA. The audio noise files are segments (from 1 to 3 s) of

Table 2 – Stage duration.

Stage Step Duration (s)

1 Reform audio w1.00

Identify digits w0.15

2 Create SIPp message w0.40

Send SIPp message w0.00

Total duration (s) w1.55

randomly selected music files. They are not auto-generated by

other methods (e.g., creation of white noise). We tried to

ensure that the noise will be least annoying for the user to

listen to. The background and intermediate noises were

automatically generated in-line with the requirements set

forth by a statistical analysis. The volume level of the noise is

lower than the level of the digits, so that they remain audible

to the users.

Loud noise between digits: loud noise is introduced between the

digits (the noise is not very loud, in order to minimize the

discomfort of the user).

Different duration and file size: each audio CAPTCHA file has

different duration and different size.

Vocabulary: the vocabulary was limited to digits {0, ., 9},

because the audio CAPTCHA was designed for an SIP-based

VoIP system where DTMF signals need to be sent.

8.2. CAPTCHA development

The audio CAPTCHA development was carried out in five

stages, in terms of the number of attributes adopted. Each

development stage was tested and evaluated upon its effi-

ciency according to the success rate of the bot and the success

rate of human users.

During Stage 1, the produced audio CAPTCHA was

pronounced by one sole announcer. It did not include addi-

tional features, such as background noise, or noise between

the digits. The first digit of every word started at the exact

same point as the other ones. The time difference between

two consecutive digits was fixed. The waveforms of the

resulting 3- and 4-digit CAPTCHA appear in Fig. 8a and b. In

such a simple audio CAPTCHA, a bot can use a detection

method (e.g., energy peak detection) and easily segment and

recognize the digits. An important factor in this process is the

number of audio CAPTCHA that was used during the training

of the devoicecaptcha bot. If a small number was used, then

there is a high chance that not all digits are given as an input

to the training process; thus, the bot may have a low success

rate. That is the case with the 4-digit CAPTCHA (Fig. 8b). The

random training sequence did not involve many instances of

some digits (such us 8 and 9); therefore, even though the bot

recognized successfully a large number of CAPTCHA, it failed

to recognize others and resulted in a relatively low (69%) bot

success rate.

The SPHINX software did not achieve such a high success

rate; it reached a (success rate of) 27%. The main reason for

this is that there was no background and intermediate noise

within the CAPTCHA.

During Stage 2, the audio CAPTCHA was produced by using

7 different announcers. Each digit was pronounced by

a randomly selected announcer. Even though this affected the

Fig. 8 – a) Stage 1 (3 digits). b) Stage 1 (4 digits).



success of the devoicecaptcha bot in the case of 3-digit

CAPTCHA, it did not do so in the case of the 4-digit ones. This

mainly hinges upon the training set. Moreover, for the same

number of training CAPTCHA instances, 4-digit ones offer

more digits to the training procedure. For example, if 100 3-

digits CAPTCHA are used for training, 300 digits are recorded,

whereas with the same number of 4-digit CAPTCHA 400 digits

are recorded.

The SPHINX software success decreased dramatically (i.e.,

0.9% for the 3 digits CAPTCHA and 0.7% for the 4 digits

CAPTCHA). This is because there was considerable back-

ground noise, due to the microphone recording. Fig. 9a and

b shows the waveforms of the produced digits.

In Stage 3 background noise was added against each digit.

This way the success rate of the devoicecaptcha bot was

suppressed to 30% for the 3-digit CAPTCHA and 55% for the 4-

digit ones, but it still remained relatively high. Fig. 10a and

b shows the waveforms of the produced digits with the

background noise. The high success rate is due to the ability of

the Frequency bot to cut-off the low energy sounds (i.e., the

noise), by checking above certain threshold energy levels. In

that way, it can – in most cases – isolate the noise behind each

digit. The difference between the successes of 3- and 4-digit

CAPTCHA is due to the difference in the training sets. In this

case, a training of 50 audio CAPTCHAs was allowed for the 3-

Fig. 10 – a) Stage 3 (3 digits

digit ones and 150 for the 4-digit ones. As a result, the available

digits taking part in the training process were 150 and 600,

respectively.

The SPHINX software repeated the same low success rate,

because the background noise added further difficulty for

solving the CAPTCHA.

In Stage 4 the volume of the background noise of each digit

was raised. Although the devoicecaptcha bot’s success rate

fell noticeably (10–15% success), and the SPHINX software was

unable to solve any CAPTCHA correctly, the produced audio

CAPTCHA was too difficult to solve for the users, as the loud

background noise made it hard for the users to distinguish the

digits spoken. For that reason, loud background noise was not

included in our final strategy.

In Stage 5 loud noise was introduced between every couple

of digits (intermediate noise) (Fig. 11a and b). This resulted in

the devoicecaptcha bot being unable to segment the audio file

correctly. This happened because there were more energy

peaks than the digits spoken. The loud intermediate noises

were recognized as additional digits, because they produce

high energy peaks as well, when transformed with the Discrete

Fourier Transformation. As a consequence, this bot could not

be trained, as it failed to successfully recognize any digits.

The SPHINX software repeated the same low success rate.

The main issue remains that such speech recognition tools are

). b) Stage 3 (4 digits).



effective only in ‘‘controlled’’ conditions, such as with only

one speaker and without any noise (Section 3).

Stage 5 is described, in more detail in Fig. 12, where the

CAPTCHA includes intermediate noise between the digits.

When the bot transforms such an audio into the frequency

domain, the energy peaks that can be found are both digits and

noise. As a result, the bot recognizes more digits than those

which are actually included in the file. One possible solution for

the devoicecaptcha bot would be to raise or lower the threshold

of the energy. In that case (Fig. 12), the bot would still fail. If the

threshold energy is very high, then the bot would not recognize

some of the digits in the CAPTCHA, while at the same time it

would recognize some intermediate noise as digits. On the

other hand, if the threshold energy is lowered, then the bot

would recognize all digits, but at the same time all intermediate

noises would also be considered digits, as well. Thus, the bot

would assume that there were 12–15 digits in the CAPTCHA.

8.3. CAPTCHA testing

Users’ and bots’ success rates are the main factors, which

prove whether a CAPTCHA is efficient or not. The corre-

sponding success rates, as per the CAPTCHA described in

Section 5.2, appear on Fig. 13a–c. Each attribute added effi-

ciency to the CAPTCHA and directly affected the user and bot

success rates. The CAPTCHA developed in Stage 5 had an

average user success rate of 87%, with an average bots’

success rate of less than 1%.

8.4. CAPTCHA implementation

During the implementation of the proposed audio CAPTCHA,

the audio files had the following attributes:

a) They were produced automatically; therefore, they can be

updated at random time periods without human inter-

vention. The overall process for creating a full set of 3-digit

Fig. 12 – Demonstration of the devoicecapt

CAPTCHA took 8 s, whereas creating a full set of 4-digit

CAPTCHA took 107 s. Thus, the reproduction of the whole

set of CAPTCHA does not cause significant overload to our

VoIP system (the VoIP server was a 2.1 GHz Core2Duo, with

2 GB RAM).

b) All constituting parts of the audio CAPTCHA, such as the

digits and the noise, lay in different folders. Moreover, each

time a set of CAPTCHA is produced, the program selects

randomly each digit from a different announcer, as well as

a random background noise.

c) The noise between the digits is selected randomly and has

different volume and energy.

d) The noise and the pronounced digits have random dura-

tion, which results in a random duration of each audio

CAPTCHA.

Table 3 depicts the attributes of the proposed VoIP

CAPTCHA implementation. The attributes are the same as

those in Table 1. It is clear that all the requirements shown in

Section 6 for VoIP CAPTCHA were fulfilled and moreover that

the proposed CAPTCHA is bot resistant.

9. Discussion and limitations

The evaluation process of the current CAPTCHA imple-

mentations included the positive and negative characteristics

of each one. Moreover, the user success rate for every

CAPTCHA was presented but the bot success rate was intro-

duced only for those which are easily applicable to a VoIP

infrastructure. Therefore, the remaining CAPTCHA could be

evaluated for their resistance against bots.

Additionally, the testing environment for the proposed

VoIP CAPTCHA is a lab environment; therefore there might be

issues in order the proposed CAPTCHA to be integrated to the

overall security infrastructure of the VoIP provider. However,

cha bot failing to solve the CAPTCHA.

0

20

40

60

80

100 S

uc

ce

ss

ra

te

(%

)

3 digits 4 digits

0

20

40

60

80

100

Stage 1 Stage 2 Stage 3 Stage4 Stage 5 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Su

ccess rate (%

)

3 digits 4 digits

0

20

40

60

80

100

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

% o

f S

olv

ed

C

AP

TC

HA

s

3 digits 4 digits

a b

c

Fig. 13 – a) SPHINX success rates. b) Devoicecaptcha bot success rates. c) Users success rates.

Table 3 – Proposed VoIP CAPTCHA attribute.

Proposed VoIP CAPTCHA attributes

User success rate 88% Rare reappearance Yes

Background noise Music, noise Production process Automated

Intermediate noise Voice, music, noise Language requirements Multiple languages

Data field 0–9 Various speakers Yes

Spoken characters variation 3–4 Duration (sec) 2–6

Streaming reproduction Yes Beeps (before, after) 0


a further experimentation clearly requires the co-operation of

a major SIP-based VoIP service provider, especially for busi-

ness purposes, since the applicability of this mechanism has

been introduced and justified in this paper.

A limitation of the proposed CAPTCHA is that there could

be no evaluation of its effectiveness and its attributes by some

additional audio/speech recognition tools, as those intro-

duced by Tam et al. (2008a).

Another possible limitation was due to the sample of the

users used for experimentation. The experiment procedure

could consider different populations of users and take into

consideration the specific requirements of each set.

1 The pseudo-random C-function rand was used for producingthe CAPTCHA.

10. Conclusions

CAPTCHA are expected to play a key role for preventing email

spam and voice spam (SPIT) in the near future. In order for

them to be effective, they must be easy to solve for the users,

while at the same time very hard for bots to pass.

In this paper, we provided the reader with an overview of

existing audio CAPTCHA implementations, in order to identify

their main characteristics. Based on these characteristics, we

identified that two of them may be able, in principle, to be

appropriate audio CAPTCHA for a VoIP system. After an

evaluation process, which included a test procedure by two

speech recognition tools, we demonstrated that the existing

audio CAPTCHA implementations are clearly inadequate

candidates for a VoIP system.

As a result of the aforementioned facts, we proposed a new

audio CAPTCHA implementation. This CAPTCHA incorporates

several attributes, such us different digit announcers, back-

ground noise against each digit, noise between digits and all of

them in a random1 and automated way.

Then, we produced a number of audio CAPTCHAs, which

are regularly refreshed, with a limited chance of creating the

same instance of an audio CAPTCHA more than once, and


reproducing in streaming mode. The production of the

CAPTCHA was done in five stages. Each time the CAPTCHA

was tested not only by a number of users, but also by two

automated speech recognition tools (SPHINX and devoice-

captcha bot). The bots managed to achieve a high success rate

during the first four stages (up to 98%), but that rate dropped

dramatically at the last one (less than 2%). That was mainly

due to the addition of intermediate noises, which made the

bot unable to segment properly the audio file, to be trained

properly, and thus to solve the CAPTCHA.

We also determined an appropriate level of background

noise of each digit, in order for the CAPTCHA to be solvable by

users and difficult to break by bots. However, such a low bot

success rate could not have been achieved without the

combination of all the above mentioned attributes. Each

attribute alone is not enough for making CAPTCHA robust; it is

the combination of the features that make the CAPTCHA

resistant.

r e f e r e n c e s

Authorize, www.authorize.net/application/ [retrieved 07.05.09].AOL, http://my.aol.com/ [retrieved 07.05.09].von Ahn L, Blum M, Hopper N, Langford J. CAPTCHA: using hard

AI problems for security. In: Biham E, editor. Proceedings ofthe international conference on the theory and applications ofcryptographic techniques (EUROCRYPT ’03). Poland: Springer;2003. p. 294–311 (LNCS 2656).

von Ahn L, Blum M, Langford J. Telling humans and computerapart automatically. Communications of the ACM 2004;47(2):57–60.

von Ahn L, Maurer B, McMillen C, Abraham D, Blum M.reCAPTCHA: human-based character recognition via websecurity measures. Science 2008;321(5895):1465–8.

Blum M, von Ahn L, Langford J, Hopper N. The CAPTCHA project,USA, November 2000.

Bigham J, Cavender A. Evaluating existing audio CAPTCHAoptimized for non-visual use. In: Proceedings of the ACMconference on human factors in systems (CHI 2009), USA;2009, p. 1829–38.

Breaking Gmail’s Audio CAPTCHA, http://blog.wintercore.com/?p¼11 [retrieved 10.10.08].

Bursztein E, Bethard S. Decaptcha: breaking 75% of eBay audioCAPTCHA. In: Procedings of the 3rd USENIX workshop onoffensive technologies (WOOT ’09), Canada; 2009.

Bokehman Audio CAPTCHA, http://bokehman.com/captcha_verification.php [retrieved 5.05.09].

Chellapilla K, Larson K, Simard P, Czerwinski M. Buildingsegmentation based human friendly human interactionproofs. In: Proceedings of the SIGCHI conference on humanfactors in computing systems. ACM Press; 2005. p. 711–20.

Chew M, Baird H. Baffletext: a human interactive proof. In:Proceedings of the 10th SPIE/IS&T document recognition andretrieval conference, USA; 2003, p. 305–16.

Chan T-Y. Using a text-to-speech synthesizer to generatea Reverse Turing Test. In: Proceedings of the 15th IEEEinternational conference on tools with artificial intelligence(ICTAI’03); 2003, p. 226.

Captchas.net, http://captchas.net/ [retrieved 02.05.09].Dusan S, Rabiner L. On integrating insights from human speech

perception into automatic speech recognition. In:INTERSPEECH, Portugal; 2005, p. 1233–6.

Defeated CAPTCHA, http://libcaca.zoy.org/wiki/PWNtcha[retrieved 18.05.08].

DIGG, http://digg.com/ [retrieved 07.05.09].Defeating Audio (Voice) CAPTCHA, http://vorm.net/captchas/

[retrieved 30.08.09].eBay Audio CAPTCHA, http://https://scgi.ebay.com/ws/eBayISAPI.

dll?RegisterEnterInfo [retrieved 03.07.09].Festa P. Spam-bot tests flunk the blind. CNET, News.com. Available

at: www.news.com/2100-1032-1022814.html; July 2, 2003.Gibbs S, Breiteneder C, Tsichritzis D. Data modeling of time-based

media. In: Proceedings of the ACM SIGMOD internationalconference on management of data, USA; 1994, pp. 91–102.

Google Audio CAPTCHA, www.google.com/accounts/NewAccount [retrieved 26.03.09].

Graham-Rowe D. A sentinel to screen phone calls technology.MIT Review 2006 [accessed 08.11.2009].

HTK: Hidden markov model toolkit, http://htk.eng.cam.ac.uk/[retrieved 10.10.08].

Jurafsky D, Martin J. , Speech and language processing: anintroduction to natural language processing, computationallinguistics and speech recognition. Prentice-Hall; 2008.

Mori G, Malik J. Recognizing objects in adversarial clutter:breaking a visual CAPTCHA. In: Proceedings of the computervision and pattern recognition conference. IEEE Press; 2003.p. 134–41.

Markkola A, Lindqvist J. Accessible voice CAPTCHA for Internettelephony. In: Proceedings of the 2008 symposium onaccessible privacy and security (SOAPS 2008), USA; 2008.

Mp3Captcha, http://scripts.titude.nl/ [retrieved 02.05.09].Quittek J, Niccolini S, Tartarelli S, Stiemerling M, Brunner M,

Ewald T. Detecting SPIT calls by checking humancommunication patterns. In: Proceedings of IEEE internationalconference on communications (ICC’07), United Kingdom;2007, p. 1979–84.

MSN Audio CAPTCHA, http://https://signup.live.com/ [retrieved26.03.09].

Recaptcha Audio CAPTCHA, http://recaptcha.net/learnmore.html[retrieved 03.07.09].

Rosenberg J, Jennings C, Peterson J. The session initiation protocol(SIP) and spam. In: Draft-ietf-sipping-spam-02; March 6, 2006.

Secure Image CAPTCHA, www.phpcaptcha.org [retrieved 28.03.09].Slashdot, http://slashdot.org/login.pl?op¼newuserform [retrieved

5.05.09].SPHINX: the CMU sphinx group open source speech recognition

engines, http://cmusphinx.sourceforge.net/html/cmusphinx.php [retrieved 02.06.09].

SER server version 2.0, www.iptel.org/ser, [retrieved 20.03.09].SIP express media server version 2.0, www.iptel.org/sems

[retrieved 20.03.09].SIPP traffic generator for the SIP protocol, http://sipp.sourceforge.

net/ [retrieved 30.09.08].Turing A. Computing machinery and intelligence. Mind October

1950;LIX(236):433–60.Tam J, Simsa J, Hyde S, von Ahn L. Breaking audio CAPTCHA: In:

Advances in neural information processing systems (NIPS);2008.

Tam J, Huggins-Daines JD, von Ahn L, Blum M. Improving audioCAPTCHAs. In: Proceedings of the 2008 symposium onaccessible privacy and security (SOAPS 2008), USA; July 2008.

Trend micro’s TrendLabs, threat reports, http://us.trendmicro.com/imperia/md/content/us/trend-watch/researchandanalysis/threat_roundup_may_2009.pdf; May 2009.

Walker W, Lamere P, Kwok P, Raj B, Singh R, Gouvea E, et al.Sphinx-4: a flexible open source framework for speechrecognition. Sun Microsystems, Technical Report TR-2004-139;November 2004.

Yan J, El Ahmad A. CAPTCHA Security: a case study. IEEE Securityand Privacy July/August 2009;7(4):22–8.

http://www.authorize.net/application/

http://my.aol.com/

http://blog.wintercore.com/?p=11

http://blog.wintercore.com/%3Fp=11

http://bokehman.com/captcha_verification.php

http://bokehman.com/captcha_verification.php

http://captchas.net/

http://libcaca.zoy.org/wiki/PWNtcha

http://digg.com/

http://vorm.net/captchas/

http://https://scgi.ebay.com/ws/eBayISAPI.dll?FRegisterEnterInfo

http://https://scgi.ebay.com/ws/eBayISAPI.dll?FRegisterEnterInfo

http://www.news.com/2100-1032-1022814.html

http://www.google.com/accounts/NewAccount

http://www.google.com/accounts/NewAccount

http://htk.eng.cam.ac.uk/

http://scripts.titude.nl/

http://https://signup.live.com/

http://recaptcha.net/learnmore.html

http://www.phpcaptcha.org

http://slashdot.org/login.pl%3Fop=newuserform

http://slashdot.org/login.pl%3Fop=newuserform

http://cmusphinx.sourceforge.net/html/cmusphinx.php

http://cmusphinx.sourceforge.net/html/cmusphinx.php

http://www.iptel.org/ser

http://www.iptel.org/sems

http://sipp.sourceforge.net/

http://sipp.sourceforge.net/

http://us.trendmicro.com/imperia/md/content/us/trend-watch/researchandanalysis/threat_roundup_may_2009.pdf



http://blog.wintercore.com/?p=11


Yan J, El Ahmad A. Breaking visual CAPTCHA with naivepattern recognition algorithms. In: Samarati P, et al., editors.Proc. of the 23rd annual computer security applicationsconference (ACSAC ’07). USA: IEEE Computer Society; 2007.p. 279–91.

Yan J, El Ahmad. A Low-cost attack on a microsoft CAPTCHA. InProceedings of the 15th ACM Conference on Computer andCommunications Security (CCS 2008), Virginia, USA; October,2008, pp 543–554.

Yan J, El Ahmad A. Usability of CAPTCHA or usability issues inCAPTCHA design. In: Proceedings of the 2008 symposiumon accessible privacy and security (SOAPS 2008), USA; 2008, p.44–52.

Yannis Soupionis ([email protected]) is a Researcher and a Ph.D.student with the Information Security and Critical InfrastructureProtection Research Group of the Dept. of Informatics, AthensUniversity of Economics and Business (AUEB), Greece. He holdsa B.Sc. (Informatics and Telecommunications, Univ. of Athens)

and a M.Sc. (Information Systems, AUEB). His current researchinterests include information systems security management,formal security policies, security and privacy in Voice over IP(VoIP) telephony, and information systems risk assessment/management.

Dimitris Gritzalis ([email protected]) is a Professor of ICT Security andthe Director of the Information Security and Critical InfrastructureProtection Research Group, Dept. of Informatics, Athens Univer-sity of Economics and Business (AUEB), Greece. He holds a B.Sc.(Mathematics, Univ. of Patras), a M.Sc. (Computer Science, CityUniversity of New York) and a Ph.D. (Critical Informa SystemsSecurity, Univ. of the Aegean). He has published 7 books and morethan 120 technical papers. His current research interests focus onsecurity in AmI, VoIP systems security, and critical infrastructureprotection. He has served as Associate Commissioner of the GreekData Protection Commission, as well as the President of the GreekComputer Society. He is the Editor of the Computers & SecurityJournal.



audio captcha: existing solutions assessment and a new implementation for voip telephony

Documents