udfp1556-72811556-7346journal of digital forensic practice, …liuwy/publications/regap-j of... ·...

Journal of Digital Forensic Practice, 1:1–15, 2006Copyright © Taylor & Francis Group, LLCISSN: 1556-7281 print / 1556-7346 online DOI: 10.1080/15567280600995501

1

UDFP1556-72811556-7346Journal of Digital Forensic Practice, Vol. 1, No. 2, October 2006: pp. 1–36Journal of Digital Forensic Practice ARTICLE

REGAP: A Tool for Unicode-BasedWeb Identity Fraud Detection

REGAP: A Tool for Unicode-Based Web Identity Fraud Detection DeA. Y. Fu et al.Anthony Y. Fu, Xiaotie Deng, and Liu WenyinDepartment of Computer Science, City University of Hong Kong, 83, Tat Chee Avenue, Kow Long Tong, Hong Kong SAR, P. R. China

ABSTRACT We anticipate the widespread usage of an internationalizedresource identifier (IRI)1 or internationalized domain name (IDN)2 on the web ascomplement to universal resource identifier (URI). IRI/IDN is composed ofcharacters in a subset of Unicode, such that a Unicode attack3 to IRI/IDNcould happen. Hence, visually or semantically, certain phishing IRI/IDNsmay show high similarity to the real ones. The potential phishing attacksbased on this strategy are very likely to happen in the near future with theboosting utilization of IRI/IDN. We invented a method to detect such phish-ing attack. We constructed a unicode character similarity list (UC-SimList) basedon char-char visual and semantic similarities and use a nondeterministic finiteautomaton (NFA)4 to identify the potential IRI/IDN-based phishing patterns.We implemented a phishing IRI/IDN pattern generation tool, REGAP, bywhich phishing IRI/IDN patterns can be generated into regular expressions(RE) for phishing IRI/IDN detection. We also address how such a tool can beapplied to investigations.

KEYWORDS phishing, internationalized resource identifier (IRI), internationalizeddomain name (IDN), nondeterministic finite automaton (NFA), regular expression (RE),unicode attack, homograph attack, REGAP

INTRODUCTIONPhishing web pages are forged to mimic the legitimate ones to spoof users

into leaking their security sensitive information. Phishers usually use visuallyand/or semantically similar user interfaces to spoof users. The user interfacesinclude web links in address bars and web pages. Unwary Internet users couldbe induced by the phishing web pages to expose their bank accounts, pass-words, credit card numbers, etc.

Phishing web pages have recently been found to be increasing at a fasterpace. More attention has been paid by the industry, academia, news media, aswell as legislative entities. There were 20,209 phishing attacks reported to theAnti-Phishing Working Group (APWG)5 in May 2006 (according to the latestreport from APWG). While the users have been gradually educated to suchscams, phishers are resorting to more and more sophisticated tricks to avoiddetection and people’s suspicion.

5

1015

20

25

30

35

40

45

A. Y. Fu et al. 2

In this article, we review the Unicode-based webidentity fraud (Unicode attack to IRI/IDN)6 and pro-pose a counter measure to detect it. We first generatethe similarity list (UC-SimListv), which contains eachpair of characters’ visual similarity. We investigate thesemantic relationships of characters in a universal char-acter set (UCS)7 and create semantic similarity list ofUCS (UC-SimLists). We use the two lists to generatevisual/semantic similarity list of UCS (UC-SimList).The UC-SimList can be used to find similar charactersof a given character.

We used NFA as the tool to construct phishing IRI/IDN patterns. We constructed NFA with the keywordsof IRI/IDNs that we want to protect. If an IRI/IDN issuspected, we use the NFA to check whether the IRI/IDN is phishing. We used a regular expression to rep-resent potential phishing IRI/IDN patterns. Weimplemented a phishing IRI/IDN pattern generator(REGAP) to generate the regular expressions of phish-ing IRI/IDN patterns and proposed a framework tobuild anti-phishing systems with REGAP. We alsoaddress the way to apply such a technique to theinvestigation.

The rest of this article is organized as follows. Thefollowing introduces related work. Next, we present aUnicode attack against the IRI/IDN and discuss Uni-code character similarity and the generation, NFA-based phishing IRI/IDN pattern construction, and thephishing pattern generator, REGAP. Then we addressthe application to investigation. Finally, we concludeour work and discuss future work. To facilitate thedeeper understanding of the article, we address thetechnical detail in appendixes A to C. We also intro-duce the usage of REGAP 0.30 in Appendix D.

RELATED WORKA uniform resource identifier (URI)8 is composed

of a sequence of characters from a subset of ASCII.9

Hence, URI is Latin language oriented, and it is usedto represent Latin character–based semantics, suchthat URI provides a smooth way for people who arefamiliar with ASCII characters to understand andidentify web resources.

Non-Latin language scripts usually have a directmapping from their characters to A–Z throughphonetic translations, such as Chinese Pinyin andJapanese Romaji. Therefore, it is not a big problem forURI to represent non-Latin languages. However, most

people require being oriented and are used to local-ized operating systems, applications, and Internet ser-vices that can provide a natural way of humancomputer interaction. In fact most users are eager touse information systems in their language scripts. As aresult, more protocols, standards, and systems arerequired to use a wider range of character sets. IRI/IDN is a complementary standard to URI. An IRI/IDN is a sequence of characters from a subset of UCS.

The most popular version of UCS (ver. 2.1) uses 16bits to represent a character versus 8 bits for ASCII,and it covers characters that represent most of themajor languages’ scripts. IRI/IDN is more convenientfor non-English speakers since it provides a more natu-ral way for people who know Chinese but not Englishto input rather than www.bank-of-china.com 10 and for people who know Japanese butnot English to input ratherthan www.bank-of-tokyo-mitsubisi.com.11 AlthoughUnicode has advantages of character representation,to our knowledge, there is no true-type font that cancover all characters in the latest version of UCS (Ver.4.3).12 The most complete font that we can find isArial Unicode Font MS,13 which is a complete Uni-code font of Unicode Standard 2.1.12 It contains allthe characters in Arial plus full fonts for Japanese,Chinese, Korean, Arabic, Hebrew, and different sym-bol characters and character ranges.14

With the emergence of phishing scams in recentyears, phishers have used a number of spoofing tech-niques to make the appearance of web links and webpages visually and semantically similar to the realones. Many anti-phishing methods have been pro-posed. Previous research provides various duplicatedocument detection approaches that focus on plaintext documents and use pure text features in similar-ity measure.15–17 The most effective strategy fordetecting phishing is probably an active approachbased on visual comparison of DOM18,19 generatedfrom HTML, such as the one proposed by Liu et al.20

that uses the region-based approach to visual similar-ity of web pages. Another visual-based phishingdetections technique is proposed by Fu et al.,21

which is a phishing detection approach based onimage level visual similarity assessment using theEMD algorithm. People also proposed anti-phishingmethodologies from the cryptography perspec-tive.22,23 Strong authentication24 of web pages iswidely used in security-oriented websites. Legislative

5

6

50

55

60

65

70

75

80

85

90

95

100

105

110

115

120

125

130

135

140

3 REGAP: A Tool for Unicode-Based Web Identity Fraud Detection

action against Internet fraud is also being introducedin different countries.

The threat of Unicode attack and its methodologyof counter measure are addressed in the literature.25,26

Following these research results, in this article, wereview this attack to IRI/IDN and propose to use aregular generator to generate phishing patterns fromword level (semantically) to character level (visuallyand semantically) automatically.

UNICODE ATTACK TO IRI/IDNRecall that the IRN/IDN, a complement to the

URI, is a sequence of characters of Unicode.12 The uti-lization of Unicode attack26 on an IRI/IDN that maybe susceptible to a phishing attack could bring severesecurity problems since the universal character set(UCS)12 relies on a lot of visually and semanticallysimilar characters. Figure 1 shows some of the similarcharacters for “3” and “bank.”

It is very easy for phishers to spoof users by replacingcharacters in a target IRI/IDN with similar ones.Although a phishing IRI/IDN looks very similar to orexactly the same to the targeted one, they are different atcoding level. People could easily be deceived by this typeof scam.

On the character level, Unicode attack is similar tohomograph attack.27 However, on the word level, we

would like to consider the following three as the mostpotential types of semantic obfuscation:

a. Pronunciation-based replacement: A phisher mayreplace some words of real IRI/IDN with otherstrings without changing the pronunciation. Asshown in Figure 2, a phisher can change Chinesewords to Pinyin, Japanese words to correspondingRomaji, and English to other commonly usedforms.

b. Abbreviation-based replacement: A phisher may use theabbreviations of the keywords or acronyms of thereal IRI/IDN for obfuscation. As shown in Figure 3,a phisher may change the keywords to their abbre-viations, or conversely replace abbreviations intofull names.

c. Translation-based replacement: A phisher may replacesome words in a real IRI/IDN into the words ofother languages for obfuscation. As shown in Figure 4,there could be many mutations of each part of anIRI/IDN through translation-based replacement.

ANTI-PHISHING PATTERN GENERATION

Unicode Character SimilarityWe investigate the similarity of characters in UCS

from two aspects: visual similarity and semantic similarity.

FIGURE 1 Similar characters of “3” and “bank” in ArialUnicode MS font (adapted from UC-SimList28, and the codeunder each character are in hexadecimal form).

FIGURE 2 Examples of IRI/IDN-based phishing obfuscationusing the same pronunciations.

8

37

145

150

155

160

165

170

175

180

185

190

39

A. Y. Fu et al. 4

Visual similarity of two characters is measured byimage similarity assessment algorithms. As our datasetis quite large, consisting of thousands of readable char-acters, we may prefer the simple, fast, but effectivepixel-overlapping method, as discussed in appendix A.We denote the visual similarity list of Unicode charac-ters as UC-SimListv.

Semantic similarity of two characters is the mea-surement of character similarity in terms of meaning.

It is quite common that one character has other corre-sponding representations. English has uppercase andlowercase letters and uppercase “BANK” correspondsto lowercase “bank”; Chinese characters can be writtenin simplified style and traditional style; e.g., simplifiedChinese “ ” (bank) corresponds to traditionalChinese “ ” (bank); and Japanese katakana“ ” (bank) corresponds to Japanese hiragana“ ” (bank). We denote the semantic similar-ity list of Unicode characters as UC-SimLists. Althoughthe construction of UC-SimLists is very important, it isalso complicated. We have constructed the UC-Sim-Lists of three languages—English, Chinese, and Japa-nese—as discussed in appendix A. Finally, we mergeUC-SimListv and UC-SimLists to generate the UC-SimList, as also presented in appendix A. UC-SimListcontains both visually and semantically similar charac-ters of any given character in UCS.

Figure 5 shows the similar characters of Latin “A” and“a” in the UC-SimListv 0.8 (0.8 is the similarity thresholdas defined in appendix A). In UC-SimLists, the Latin “A”and “a” are semantically similar. Hence the similar char-acters of “A” in UC-SimList 0.8 are as shown in Figure 6.

Modeling the IRI/IDN-based Phishing Patterns with NFA

We used NFA to model the potential IRI/IDN-based phishing obfuscation patterns. In general, the

FIGURE 3 Examples of IRI/IDN-based phishing obfuscationusing abbreviations.

FIGURE 4 Examples of IRI/IDN-based phishing obfuscationusing translation.

FIGURE 5 The visually similar characters of “A (0041)” and “a(0061)” in UC-SimListv with threshold 0.8.

195

200

205

210

215

220

225

230


modeling process is to convert the IRI/IDN list thatwe need to protect (e.g., banks’ domain names) intoan NFA, which can be utilized to perform phishingIRI/IDN detection. It can be constructed by the fol-lowing procedures: (a) we manually construct an NFAon the semantic level; (b) we replace each nonemptytransition in the NFA with parallel-connected similarcharacters of this letter; (c) we replace each emptytransition in the NFA with parallel-connected symbolsthat are valid in IRI. This final NFA is the one used toperform phishing IRI/IDN detection. The detailedprocess is available at appendix B.

Phishing IRI/IDN Pattern GeneratorWe built up a tool, regular expression generator for

anti-phishing (REGAP ver. 0.30),28 to convert fromthe IRI/IDN NFA script directly to phishing detectionregular expression. The IRI/IDN NFA script is anXML-style text string. It is a tree-like structure. Forexample, Figure 7 shows an IRI/IDN NFA script defi-nition of “www.citibank.com”. In this script, we haveone IRI (“www.citibank.com”), which includes twowords (“citi” and “bank”). For each word, there areseveral representations. “Citi” can be represented as“citi,” “city,” “ ,” “ ,” “ ,” and “ .”“Bank” can be represented as “bank,” “ ,” “ ,”“ ,” and “ .” The generated regular expres-sion is shown in Figure 8. The delimiter-like character

set, S, is defined in Figure 9. {0,2} indicates (a standardrepresentation of regular expression) that the delimitercan repeat twice at the most at each position. Weaddress the user interface and usage of REGAP inappendix D in detail. The regular expressions gener-ated by REGAP are standard, such that it can be easilyadapted to Perl, Java, C#, Python, PCRE, PHP, VI,JavaScript, Shell tools, etc. REGAP also uses the regu-lar expression to detect phishing IRI/IDNs, as shownin appendix D.

FIGURE 6 The similar characters of “A (0041)” in UC-SimList0.8.

FIGURE 7 An example to IRI/IDN NFAkw definition script.

FIGURE 8 Regular expression generated by REGAP from theNFAkw definition in Figure 7.

7

8

235

240

250

255

260

265

A. Y. Fu et al. 6

APPLICATION TO INVESTIGATIONWith the popularity of IRI/IDN, many web sites

come to utilize this technology, especially in the coun-tries that are not using Latin characters in their lan-guage scripts. This phenomenon opens a door forUnicode-based identity fraud. People can easily register

any IRI/IDN from most of the DNS registrars. We wouldlike to address three ways of application to investiga-tions as follows.

Help Investigators to Detect Similar IRI/IDN

There are around 70 million domain names in theworld.29 We take “www.citibank.com” as an example.Figure 10 shows the 100% visually/semantically simi-lar characters to “www.citibank.com.” There are 4(w)* 4(w) * 4(w) * 9(c) * 14(i) * 6(t) * 14(i) * 5(b) * 7(a) *5(n) * 5(k) * 9(c) * 7(o) * 5(m) – 1=186,701,769,999possible faked IRI/IDN combinations in total.When we count in the obfuscations on the wordlevel (as discussed previously) or reduce the similar-ity threshold of characters, there will be a lot moremutations of “www.citibank.com.” Phishing IRI/IDNs can be hidden anywhere. Hence, it is impossi-ble for a human to manually detect similar IRI/IDNfrom such a large domain name pool. REGAP is atool that can help a lot. Our experiments (in C#code) show that 70 million domain names can bescanned by the CitiBank pattern, as shown in Figure8, in 80 minutes on a Pentium M 1.3-GHz 1GBRAM laptop.

FIGURE 9 Delimiter-like character set S.

FIGURE 10 One hundred percent visually/semantically similar characters to “www.citibank.com.”

Latin Character 100% Visually/Semantically Similar Characters

270

275

280

285

290

295


For example, “www.citibank.com” is a security-sensitive web site, and we want to find all possiblephishing IRI/IDNs from the domain name list. Withthe help of REGAP, we can generate a regular expres-sion or represent the potential phishing IRI/IDN pat-tern, as shown in Figure 8. Figure 11 shows someexamples of different types of obfuscation that can berecognized by this pattern.

Help Investigators to Collect Evidence through DNS RegistrarsThe application just presented is relatively passive,

because we can only dig out the existing phishing IRI/IDNs. The phishing IRI/IDNs may have been there forquite a while. There are around 1 million domain namesregistered in one day.29 Hence, it could be a goodapproach to install the phishing IRI/IDN detection mod-ule to DNS servers and help monitor the domain nameregistration. We can create a list of the security-sensitiveweb sites’ patterns and use the phishing IRI/IDN detec-tion module to detect and/or trace the potential attack-ers. Hence the attackers could be sued in real-time.

Our experiments (in C# code) show that a laptopwith Pentium M 1.3-GHz and 1GB RAM can helpregister one new domain name while protecting14,000 security-sensitive domain names in one sec-ond. Therefore, such an application performance isfeasible from a technical aspect.

Help DNS Registrars to Provide a New Type of Service

To some extent, it should be reasonable to blockthe registration of potential phishing IRI/IDNs ratherthan to allow them to register and then catch them.REGAP can be very helpful to provide such service forDNS registrars that people are not allowed to registervisually/semantically similar IRI/IDNs, and the per-formance is not a problem (similar to the performancejust described).

However, in the real world, we may be concernedmore about freedom of speech. People are not expect-ing such limitation to registering new domain names.Therefore, this kind of application is still conditionalto some important aspects, such as the policy of thegovernment, humanity, Internet security requirement,industrial promotion, etc.

CONCLUSION AND FUTURE WORKIn this article, we addressed the Unicode attack to

IRI/IDN. We built up the UC-SimList, which can beused to find similar characters for a specified one eas-ily. We proposed an effective detection approach tothis problem using NFA and address its constructionmethod. We built up a potential phishing IRI/IDNpattern generator REGAP Ver. 0.30, which can facili-tate the generation of regular expression from thekeyword-level NFA.

FIGURE 11 Examples of phishing IRI/IDNs that can be detected (in these examples, the prefix [“www.”] and suffix [“.com”] areomitted).

Type of phishing IRI/IDN

Character level visually similar

Character level semantically similar

Word level pronunciation

Word level different language

Word level abbreviation Noise insertion

Phishing IRI/IDN

10

9

300

305

310

315

320

325

330

335

340

345

350

A. Y. Fu et al. 8

We used the simple pixel-overlapping method tomeasure visual similarity of characters. More accuratemethods, like OCR, could definitely be employed.The current version of UC-SimLists only includesEnglish, Chinese, and Japanese characters. We are expect-ing that more languages in UCS could be investigatedin future work. We constructed the keyword-levelNFA manually at this stage. Although the atomizationof this process may involve more complicated seman-tic analysis, it is still possible to find an automatic wayto do it, such as to construct a word–word similaritylist and use it make word replacement.

We discussed three possible approaches to helpsearch phishing IRI/IDNs, collect phishing evidence,and/or provide a new domain name registration ser-vice. The anti-phishing framework we proposed previ-ously could be easily adapted to various systems thatcan process the content of email, instant messages,BBS, chat rooms, files, etc., to perform phishing IRI/IDN searching and/or monitoring. It will be quite use-ful for evidence collection in investigations. Anothergood usage of REGAP is to use it as a tool to provide aspecial domain name registration service that registrarscan help you protect all domain names that areacceptable to the potential phishing IRI/IDN patterns.However, it could cause some limitations to domainname registration.

All of our experiments on testing the performanceare in C# code. Hence, the performance of any appli-cations discussed should hopefully be much fasterwhen we code in C. More performance evaluationshould be carried out in future work.

Although the phishing IRI/IDN attack is not usual,we consider the countermeasure proposed in this arti-cle as a paradigm that can be used in the future. Wealso need to evaluate the false-positive detection whenwe can collect a number of real attack cases.

ACKNOWLEDGEMENTSThe work described in this article was fully sup-

ported by grants from the City University of HongKong (Project No. 7001842 and 7001975) and theChina Semantic Grid Research Plan (National GrandFundamental Research 973 Program, Project No.2003CB317002). We thank Markus Jakobsson andJudie Mulholland for the useful discussion. We alsothank the four anonymous referees for their construc-tive comments and suggestions.

NOTES1. IRI is a generalization of the uniform resource identifier (URI),

which is in turn a generalization of the uniform resource locator(URL). While URIs are limited to a subset of the ASCII character set,IRIs may contain characters from the universal character set (Uni-code/ISO 10646). Basically, an IRI is the internationalized version ofa URI

2. IDN is an Internet domain name that (potentially) contains non-ASCII characters. Such domain names could contain letters withdiacritics, as required by many European languages, or charactersfrom non-Latin scripts such as Arabic or Chinese.

3. Unicode attack is caused by the coexistence of a large number ofvisual/semantically similar Unicode strings. On the character level,the visually similar Unicode attack is homograph attack.

4. NFA is a finite state machine where for each pair of state and inputsymbol there may be several possible next states. We can use it torecognize a string of a certain pattern. When the last input symbolis consumed the NFA accepts if and only if there is some set oftransitions it could make that will take it to an accepting state.Equivalently, it rejects if no matter what choices it makes it wouldnot end in an accepting state.

5. Anti-Phishing Working Group, http://www.antiphishing.org6. M. Duerst and M. Suignard, RFC 3987: Internationalized Resource

Identifiers (IRIs) (The Internet Society, 2005).7. UCS is a character encoding defined in international standard ISO/

IEC 10646. It contains nearly a hundred thousand abstract charac-ters, each identified by an unambiguous name and an integernumber called its code point.

8. T. Berners-Lee, R. Fielding, and L. Masinter, RFC 3986: UniformResource Identifier (URI): Generic Syntax (The Internet Society,2005).

9. ANSI, “American Standard Code for Information Interchange,”http://www.ansi.org

10. “ ” is pronounced “Zhong Guo Yin Hang” and stands for“Bank of China”; “ ” is pronounced “Gong Si” and for“Company.”

11. “ ” is pronounced “Tou Kyou Mitsu Bishi Gin Ko”and stands for “Bank of Tokyo-Mitsubishi”; “ ” is pro-nounced “Kai Shya” and for “Company.”

12. The Unicode Consortium, http://www.unicode.org/ 13. Microsoft, Arial Unicode MS, v. 1.01 (2000).14. Microsoft, “Description of the Arial Unicode MS Font in Word

2002,” http://support.microsoft.com/kb/q287247/15. A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, “Collec-

tion Statistics for Fast Duplicate Document Detection,” ACMTransactions on Information Systems, 20, no. 2 (2002): 171–191.

16. T. C. Hoad and J. Zobel, “Methods for Identifying Versioned andPlagiarized Documents,” Journal of the American Society for Infor-mation Science and Technology, 54, no. 3 (2003): 203–215.

17. T. Nanno, S. Saito, and M. Okumura, “Structuring Webpages Basedon repetition of Elements,” in Proceedings of the 7th InternationalConference on Document Analysis and Recognition (2003).

18. DOM is a description of how an HTML or XML document is repre-sented in a tree structure. DOM provides an object oriented appli-cation programming interface that allows parsing HTML or XMLinto a well defined tree structure and operating on its contents.

19. L. Wood, “Document Object Model (DOM) Level 1 Specification,”http://www.w3.org/TR/REC-DOM-Level-1

20. W. Liu, X. Deng, G. Huang, and A. Y. Fu, An Anti-Phishing Strat-egy Based on Visual Similarity Assessment, IEEE Internet Comput-ing 10, no. 2 (2006): 58–65.

21. A. Y. Fu, W. Liu, and X. Deng, “Detecting Phishing Web Pages withVisual Similarity Assessment based on Earth Mover's Distance (EMD),”to appear in IEEE Transactions on Dependable and Secure Computing

22. R. Dhamija, and J. D. Tygar, “The Battle against Phishing: DynamicSecurity Skins,” in Proceedings of Symposium On Usable Privacyand Security 2005 (2005).

11

13

14

15

16

17

18

19

20

21

22

23

12

355

360

365

370

375

380

385

390

395

400

405

410

415

425

445

450

460


23. M. Jakobsson, “Modeling and Preventing Phishing Attacks,” inProceedings of Phishing Panel of Financial Cryptography 2005(2005).

24. Netscape, “SSL,” http://wp.netscape.com/eng/ssl3/25. A. Y. Fu, X. Deng, and W. Liu, “A Potential IRI Based Phishing

Strategy,” in Proceedings of the 6th International Conference onWeb Information Systems Engineering (WISE 2005) (New York:2005), 618–619.

26. A. Y. Fu, X. Deng, W. Liu, and G. Little, “The Methodology and anApplication to Fight against Unicode Attacks,” in Proceedings ofSOUPS 2006 Pittsburgh: CMU, (2006).

27. E. Gabrilovich and A. Gontmakher, “The Homograph Attack,”Communications of the ACM 45, no. 2 (2002): 128.

28. Anti-Phishing Group at CityU, http://antiphishing.cs.cityu.edu.hk/IRI, June 2005.

29. http://www.domaintools.com/internet-statistics/30. P. Linz, An Introduction to Formal Languages and Automata, 3rd

ed., (Jones and Bartlett Publishers, Inc., 2001).

APPENDIX A. UC-SIMLIST GENERATION

The visual similarity calculated with this method isempirically general for most fonts since charactershave the same style in the same font, such that similarcharacter pairs are still similar when we use anotherfont to calculate the similarity of the pairs of charac-ters. We used Unicode Arial MS Font Ver.1.0113 asour character image generation font since, among allfonts we could find, this font covers the largest num-ber of characters. It covers most of the characters inthe latest Unicode standard Ver. 4.312 and all charac-ters of Unicode Ver. 2.1.12 We calculate the similarityof each pair of characters using the pixel overlappingpercentage of them. For example, when we want tocalculate the similarity of two characters, we one char-acter onto the other and align them and count thenumber of the mutual pixels in the two glyphs; thesimilarity value is the mutual pixel number divided bythe larger glyph’s pixel number. Our experimentsshow that when we set the size of Arial Unicode Fontto 30 points, it performs well to calculate the similari-ties of character pairs without losing accuracy. It takesabout one minute to compute the similarities of onecharacter to all the others. Hence, it took 29 days togenerate the entire UC-SimListv on an ordinary PC(Pentium IV 2.4-GHz single CPU and 512-MB RAM).We selected a threshold to define whether a pair ofcharacters is similar. We also found from experimentsthat 0.8 is a good choice to simulate the human eye’sclassification curve. UC-SimListv with different thresh-olds are available.28

The current version of UC-SimLists includesEnglish, Chinese, and Japanese and is available online.28

We leave other language parts as an open engineeringproblem and expect people could add and more lan-guages to UC-SimLists as future work.

APPENDIX B. THE THREE STEPS TO CONSTRUCT IRI/IDN NFA

Construct an NFA on the Semantic Level

Suppose we have a list of IRIs LIRI that need to beprotected. We construct the keyword-level patternNFA NFAkw from LIRI; i.e., LIRI ® NFAkw. We findthe keywords for each IRI, expand the keywordssemantically, connect them properly with empty tran-sitions to form an isolated NFA, and then merge theisolated NFAs into one NFA NFAkw in a parallel way.

Suppose we have an IRIi Î LIRI. We convert i intosequences of separated words using the three mostpotential types of semantic expansion of keyword inthe Unicode Attack to IRI/IDN section to constructNFAkw. We denote the keyword-level pattern NFA as

where

is the keyword-level pattern NFA of i, Ni is the num-ber of possible sequential combinations of theexpanded keywords ofi, Vm denotes one such combina-tion, where 1 £ m £ Ni, and “È” denotes parallel connec-tion relationship of NFAs. We construct NFAkw

manually because the number of IRIs that need to beprotected is not big and we can construct NFAkw, foreach i Î LIRI carefully. In a practical system, we use XMLto represent NFAkw, as is discussed in previously. TheNFAkw is generated by merging all NFAkwi, i Î LIRI. Themerging process can be accomplished by merging all ini-tial state into a unique q0, and final states into a unique

. We construct the character-level pattern NFANFA from NFAkw; i.e., NFAkw ® NFAc. This pro-cess can be accomplished as in the following twosections.

Replace the Nonempty TransitionsReplace nonempty characters in NFAkw with

parallel-connected corresponding similar charac-ters that can be found from UC-SimList by emptystrings, and we denote it as NFAc; i.e., NFAkw ®

NFA NFAkw kwii LIRI

=Î∪ NFA NFAkwi m

m Ni

=£ £

ς1

∪

◎

24

25

26

27

28

29

30

31

32

31

475

485

490

495

505

510

515

520

525

530

535

540

545

550

555

560

A. Y. Fu et al. 10

NFAc. We get the similar characters of each character of{c,i,t,y,b,a,n,k, }from UC-SimList 0.8. Figure B1 shows the similarcharacter sets of these characters; each element is rep-resented by a character followed by its hexadecimalcode. We denote the NFA formed by similar characterset of characteras. Figure B2 shows the examples of

. The similar characters in each setare connected in a parallel way. We replace each char-acter char in NFAkw with S(char) to generate asshown in Figure B2. There could be more IRIs otherthan i=“www.citybank.com” in LIRI. They are repre-sented in a simplified way at the upper and lower posi-tion of where we denote the NFA generatedfrom other IRIs with NFAkw.

Replace the Empty TransitionsWe replace each empty transition in

with a transition with all delimiter characters in aparallel way and call it a delimiter transition (DT).The delimiter-like character set is denoted as S,which includes the empty string l and all characterslook like the 22 valid symbols (‘-’, ‘.’, ‘_’, ‘~’, ‘:’, ‘/’,‘?’, ‘#’, ‘[‘, ‘]’, ‘@’, ‘!’, ‘$’, ‘&’, ‘'’, ‘(‘,’)’, ‘*’, ‘+’, ‘,’,

‘;’, ‘=’) defined in RFC39868 and RFC3987;6 i.e.,S={l, S(–), S(.)…}. We denoted the generated NFA asNFAc; i.e., as shown in Figure B2.More concatenated DTs can be used to replace theempty string in and each DT means oneappearance of delimiter appearance in the context. Weused one DT in our approach for simplicity. So far, we

FIGURE B1 Similar character sets of each character in {c,i,t,y,b,a,n,k,,} (“ ” denotes that Form 2 is thesame as Form 1, and “ ” denotes the omitted content).

NFAc’

NFAkw.

λ¾ ®¾ NFAc¢

NFA NFAc c¢ ®

NFAc¢

33

565

570

575

580

585

590

595

600


have the key character pattern NFA of LIRI; i.e., NFAc,which is the final NFA that can be used to detectphishing IRI.

APPENDIX C. GENERATE THE PHISHING IRI/IDN PATTERN IN

REGULAR EXPRESSIONRegular expression (RE) is equivalent to NFA. We

convert NFAc to regular expression to ease the adapta-tion approach to anti-phishing systems. As NFAc isgenerated from LIRI with specified rules, the structureof NFAc is not complicated. We can apply the follow-ing procedure to convert NFAc to a regular expression.We use “|” to denote union, “” (empty string) for con-catenation, “{m,n}” for repetition number between mand n. In our practical approach (See Conclusion andFuture Work), we configure m to be 0 and n to be 1 for

simplicity. Figure C1 shows the regular expressiongeneration procedure in our approach.

Figure C2 shows the regular expression generatedfrom NFAc, where i = “www.citibank.com.” We repre-sent the regular expression string generated in step c asG (char), wherechar is the same character in S (char).Figure C3 shows an example of G (char) when char =“c” where we use “u” + Unicode to represent a charac-ter. The regular expression of NFAc is the concatena-tion of all regular expressions of (i Î LIRI) with“|”, as shown in step h of the regular expression gener-ation procedure.

APPENDIX D. HOW TO USE REGAPREGAP is a tool that can help generate regular

expressions to detect phishing IRI/IDNs. The currentversion of REGAP is 0.30. We demonstrate the usageof REGAP in the following steps.

FIGURE B2 NFA representation (qx and qy denote two states).

NFAci

605

610

615

620

625

630

635

A. Y. Fu et al. 12

Step 1. We can follow the format as shown in Figure D1to write the IRI/IDN NFA script. One script caninclude many IRI/IDNs. Each IRI/IDN has a name(between “<IRINAME>” and “</IRINAME>”)and is composed of one or more semantic word(s).Each semantic word has a name (between“<WORDNAME>” and “</WORDNAME>”)and is composed of one or more sub-word(s). Com-ments can be added anywhere except the areabetween any pair of “<SW> </SW>” tags.

Step 2. When we finish step 1, we can generate theregular expression according to the given IRI/IDNNFA script. We go to the “Regular Expression” tab-pad and click “Generate>>” button. The generatedregular expression is shown in Figure D2. We canalso control the generation precession using differ-ent configurations. The “Replace S with delimiters”checkbox controls the representation form ofdelimiters. When we uncheck it, the generated regu-lar expression will look like that in Figure D3, where

each long expression, as shown in Figure 9, isreplaced by S to simplify the representation. The“Display Code” check box controls whether to dis-play the code of each character in the regularexpression, as shown in Figure D4. “Range ofdelimiter number” controls the number of immedi-ate delimiters between each pair of characters; e.g.,when we set it to 2, then an IRI like “www.c--i--t--i-b-a.-n--k” can be recognized while “www.c--i--t--i-b-a.-n--k” cannot. “UC-SimList” controls the character–character similarity that you want to use.

Step 3. After generating the regular expression, we canuse REGAP to detect phishing IRI/IDNs. We cansimply copy and paste the IRI/IDNs for detectioninto the “IRI/IDN List” textbox in “Phishing IRI/IDN Detection” tabpad and click the “Detect>>”button. The detected potential phishing IRI/IDNswill be displayed in the “Detected List” textbox.Figure D5 shows that we have detected 10 phishingIRI/IDNs from 1656 IRI/IDNs.

FIGURE C1 Regular expression generation procedure.

a. Get the sub-NFA generated from IRILi ∈ in cNFA which has not been processed,

and denote it asicNFA ;

b. Get the )(charS (similar character set of character char ), concatenate each character in )(charS with “|”, and bracket the generated string;

c. If not all )(charS has been processed then goto b; otherwise, continue; d. Concatenate strings in parallel relationships with “|”, and bracket the generated

string; e. If not all

icNFA has been processed then goto a, otherwise, continue;

f. Replace all delimiter transitions (DTs) in cNFA to delimiter-like character set

repetition },{ nmΣ ;g. Concatenate strings in parallel relationships with “|”; h. Return the generated string.

FIGURE C2 Regular expression of when i = “www.citybank.com.”NFAci

)

))))(},{)(}((,{

)))(},{)((|))(},{)((((

|

))))(}((,{

)))((((

|

))))(},{)(},{)((|))(},{)(},{)(},{)(}((,{

)))(},{)(},{)((|))(},{)(},{)(},{)((|))(},{)(},{)(},{)((((

(

ΓΣΓΣΓΣΓΓΣΓ

ΓΣΓ

ΓΣΓΣΓΓΣΓΣΓΣΓΣΓΣΓΣΓΓΣΓΣΓΣΓΓΣΓΣΓΣΓ

nmnm

nmnm

nm

nmnmknmnnmanmbnm

nmnminmtnminmcynmtnminmc

FIGURE C3 G (char) (RE of S(char), when char = “c”).

(u0187))|(u03DA)|(u00C7)|(u10BA)|(u04AA)|(u0480)|(u0404)|(uFF23)|(u0421)|(u216D)|(u0043)|

(u0188)|(u0481)|(u0255)(u00E7)|(u04AB)|(u0454)|(u217D)|(u03F2)|(uFF43)|(u4410)|((u0063)|

34

35

35

34

640

645

650

655

660

665

670

675


In addition, we can edit the delimiters that we wantto detect in the “Delimiter List” tabpad, one line foreach delimiter. The default delimiters are the 22 legal

symbols in IRI/IDN standard. The REGAP also usesthe similar characters from UC-SimList for the regularexpression generation.

FIGURE D1 Input the IRI/IDN NFA script to the textbox in “Semantic IRI/IDN” tabpad.

FIGURE D2 Generate the regular expression of IRI/IDN NFA.

36

680 685690

A. Y. Fu et al. 14

FIGURE D3 Represent delimiter list with S.

FIGURE D4 Display the code of each character.


FIGURE D5 Phishing IRI/IDN detection.

FIGURE D6 Delimiter list.

38