a statistical approach to typosquatting detection dns ops workshop, 4-5 june 2008 alessandro linari...

27
A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari [email protected] and Oxford Brookes University

Upload: kelly-powers

Post on 28-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

A Statistical Approach to Typosquatting Detection

DNS Ops Workshop, 4-5 June 2008

Alessandro [email protected]

and Oxford Brookes University

Page 2: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

2

Introduction

Typosquatting is the practice of registering a domain name which contains a typographical error if compared to the name of a trademark or a famous domain

• Growing phenomenon over the Internet

– Well-understood from a legal point of view

– Lack of a technical characterisation

• First attempt for

– Technical definition

– Statistical characterisation

Page 3: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

3

Typosquatting: gooogle.co.uk

Page 4: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

4

Syntactic and Confusing Similarity

gooogle.co.uk

googgle.co.uk

bgoogle.co.uk

google-news.co.uk

google-groups.co.uk

askgoogles.co.uk

Syntactically similar Confusingly similar

GOOGL3.co.ukGO0GLE.co.uk

Visually similar

Page 5: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

5

Syntactic Neighbourhood

Given a domain D, the syntactic neighbourhood of D set of all domains in the registry whose edit distance from D is equal to 1

D

Registry

Ndist = 1

Page 6: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

6

Syntactic Neighbourhood

Given a domain D, the syntactic neighbourhood of D set of all domains in the registry whose edit distance from D is equal to 1

• Edit distance

– Minimum number of operations needed to transform one string into the other

– An operation is an insertion, deletion, or substitution of a single character

D

Registry

Ndist = 1

Page 7: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

7

Syntactic Neighbourhood

Given a domain D, the syntactic neighbourhood of D set of all domains in the registry whose edit distance from D is equal to 1

ominet1mominet

onminet ominet

noominet

d=1 d=2

Page 8: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

8

Outline

• Correlation between popularity of a domain nameand size of its neighbourhood

• Presence of “typosquatters friendly” registrars in the neighbourhood of popular domains

anyYdomain

anyA-domain

Aany-domain

NEIGHBOURHOOD

amazon

bbc

nominet

yahoo

REST-OF-THE-REGISTRY

ebay

theregisteraol

ANY-DOMAIN (.co.uk)

Page 9: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

9

Outline

• Correlation between popularity of a domain nameand size of its neighbourhood

• Presence of “typosquatters friendly” registrars in the neighbourhood of popular domains

anyYdomain

anyA-domain

Aany-domain

NEIGHBOURHOOD

amazon

bbc

nominet

yahoo

REST-OF-THE-REGISTRY

ebay

theregisteraol

ANY-DOMAIN (.co.uk)

Page 10: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

10

Experimental Setting

• Choose a domain name X

• Compute the distance between X and all domains in the registry

• Compute the size of X’s neighbourhood

• Compute the average size of a neighbourhood for domains of each length

– E.g., bbc.co.uk and allianceandleicester.co.uk have different distributions

Page 11: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

11

Experimental Setting

• Only .co.uk web sites considered (March 2008)

– Length refers to the third-level label

• Set of random domains (expected behaviour)

– 1000 domains for each length (random sample)

• Set of top-1000 popular domains (source NetCraft.com)

– Band A: domains with ranking in [1,100]

– Band B: domains with ranking in [101,500]

– Band C: domains with ranking in [501,1000]

Page 12: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

12

Size of the neighbourhood for popular domains

0

50

100

150

3 4 5 6 7 8 9 10 11 12

Num. chars (3rd level domain)

Avg. Num. domains

Random

Band A

Band B

Band C

google, amazon

Neighbourhood and Popularity

Page 13: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

13

Outline

• Correlation between popularity of a domain nameand size of its neighbourhood

• Presence of “typosquatters friendly” registrars in the neighbourhood of popular domains

anyYdomain

anyA-domain

Aany-domain

NEIGHBOURHOOD

amazon

bbc

nominet

yahoo

REST-OF-THE-REGISTRY

ebay

theregisteraol

ANY-DOMAIN (.co.uk)

Page 14: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

14

Distribution of Registrars

• Fraction of domain names owned by each registrar

Page 15: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

15

Experimental Setting

• Consider only domains in Band A’s neighbourhood

– i.e., any domain at dist=1 from at least one domain in Band A

• Compute the number of domains owned by each of registrars (distribution)

• For each registrar, compute the percent increase wrt to the previous distribution

I%=FracDom(BandA) − FracDom(registry)

FracDom(regisry)⋅100

Page 16: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

16

Distribution of Registrars (Band A)

Presence of a registrar in the Band A neighbourhood

0%

2000%

4000%

6000%

8000%

10000%

Registrars

Increase percent

Registered domain names (total)

1

100

10000

Registrars

Num. domains

Page 17: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

17

Discussion

• Analysis (manual) of 25 registrars whose size is between 100 and 1000 domains

– Big registrars are complex to analyse (not present in this chart)

– Small registrars do not contribute to reliable statistics

Presence of a registrar in the Band A neighbourhood

0%

2000%

4000%

6000%

8000%

10000%

Registrars

Increase percent

Registered domain names (total)

1

100

10000

Registrars

Num. domains

Page 18: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

18

Discussion

• One of the big domain names owns the majority of its neighbourhood

• Interesting activity for 6 5 registrars

– A big fraction of their domains syntactically or confusingly similar to popular domain names

• Normal activity for 8 registrars (false positives)

• No relevant findings in the other cases

Page 19: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

19

Further Research Directions

• Insight in the typosquatting phenomenon

– Domain name neighbourhood

– First attempt toward statistical characterisation

• More questions than answers

– Name servers used by typosquatters

– Domain names containing common words

– Content of the website

– …

Page 20: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

20

Bibliography

A. Banerjee, D. Barman, M. Faloutsos, Laxmi Bhuyan. Cyber-Fraud is One Typo Away. INFOCOM, 2008

Y.Wang, D. Beck, J. Wang, C. Verbowski, B. Daniels. Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting. SRUTI (Usenix WS), 2006.

McAfee. What’s In A Name: The State of Typo-Squatting 2007. http://us.mcafee. com/root/identitytheft.asp?id=safe_typo (valid as in Jan. 2008).

WIPO. DNS Developments Feed Growing Cybersquatting Concerns. http://www. wipo.int/pressroom/en/articles/2008/article_0015.html (valid as in May 2008).

Page 21: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

Thank you!!!

Page 22: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

Questions?

Page 23: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

Backup Slides

Page 24: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

24

Length of a domain nameDistrib. of lengths of domain names

0

100

200

300

400

500

600

0 10 20 30 40 50 60

Num. chars

Num domain names (thousands)

0

200

400

40 50 60

• co.uk domains only

• Length always refers to the third level domain

Page 25: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

25

Length of a domain name

• 3- and 4- chars domains not meaningful

• Neighbourhood of 5-chars domains is in the 4-chars space

Fraction of namespace registered

0.01%

0.10%

1.00%

10.00%

100.00%

2 3 4 5 6 7

Num. chars (third level domain)

Domains registered

Domains free

Page 26: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

26

Distance between domain names

• ~100 domains (for each length) compared against whole dataset

• Average number of domains at a given distance

Avg. number of domains at a given distance

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60

distance

Number of domains (x1000)

3 chars5 chars8 chars10 chars

Statistical characterisation

Page 27: A Statistical Approach to Typosquatting Detection DNS Ops Workshop, 4-5 June 2008 Alessandro Linari alessandro@nominet.org.uk and Oxford Brookes University

27

Top-100 (band A) domain names

• ~10 domains (for each length) compared against whole dataset

• Average number of domains at a given distance

Top-100 domain partitioned on length

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60

distance

Num. domain names (x1000)

3 chars

5 chars

8 chars

10 chars

Statistical characterisation