1 uncovering functional networks in internet traffic mark meiss september 25, 2006

1

Uncovering Functional Networks in Internet Traffic

Mark Meiss

September 25, 2006

2

Who am I?

Mark Meiss

• Ph.D. candidate in Computer Science– Committee: Filippo Menczer, Alessandro

Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly

• Researcher at the Advanced Network Management Laboratory (ANML)– http://anml.iu.edu/

http://anml.iu.edu/

http://anml.iu.edu/

http://anml.iu.edu/

4

What’s the agenda?

The subject of today’s story:

• Finding a way to improve security without compromising user privacy

• A case study in applied network science

This work is done with Filippo Menczer and Alessandro Vespignani.

5

There’s what we imagine…

What do people do online?

surfing sending email playing games

6

What do people do online?And there’s what is actually happening…

file sharing worms & viruses porn

7

Not just a value judgment

These applications all affect the health of a data network.

There are legal problems, yes; but also…• Crowding out other applications.

– (Napster was once over 70% of all IUB traffic)

• Compromised computers are used to launch further attacks.

• “Common nuisances” are on the ’Net as well.

8

The bottom line

Network administrators

need to be able to identify

what applications

are being used on the network.

…but this can be very difficult.

9

A crash coursein data networks

We’ll use a running example:• Buddy Bradley wants to read a web page about his

favorite band at Vulgar Entertainment, Inc.

20

Quick summary

• Each network conversation is identified by four pieces of information– Client address and port number– Server address and port number

• The server uses a well-known port number

• The client uses an ephemeral port number

21

So why is it hard to identify applications?

• Well-known ports are a convention, not a rule– Web, e-mail, etc. do have ports assigned by the IANA

– BitTorrent, Gnutella, Napster, etc. do not

• Client and server ports share the same namespace• In practice…

– Any application can use any pair of port numbers

• Our focus: discovering what application is running on a port with no assigned use.

22

The conventional solution

Let’s look inside

all of those packets!

25

Another problem

• Packet inspection doesn’t scale– Modern high-speed networks run at 10 gigabits

per second or faster(that’s one full DVD every few seconds)

– General-purpose computers can’t even copy that data in real time

28

Introducing the “flow”

• We can summarize Buddy’s Web surfing as two flows:– 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes)

– 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes)

29

Where do flows come from?

• Architectural features of Internet routers allow them to export flow data

• Routers can’t summarize all the data– Packets are sampled to construct the flows– Typical sampling rate is around 1:100

30

What can you dowith a flow?

• Usual answer:– Treat a flow as a record in a relational database– Who talked to port 1337?– What proportion of our traffic is on port 80?– Who is scanning for vulnerable systems?– Which hosts are infected with this worm?

• These are useful and valid questions.

31

What can you dowith a flow?

• Our approach:– Treat a flow as a directed, weighted edge– The resulting network describes user behavior

• Hold that thought for now…

32

The Internet2/Abilene network

• TCP/IP network connecting research and educational institutions in the U.S.– Over 200 universities

and corporate research labs

• Also provides transit service between Pacific Rim and European networks

33

Why study Abilene?

• Wide-area network that includes both domestic and international traffic

• Heterogeneous user base including hundreds of thousands of undergraduates

• High capacity network (10-Gbps fiber-optic links) that has never been congested

• Research partnership gives access to (anonymized) traffic data unavailable from commercial networks

34

Flow collection

Flows are exported in Cisco’s netflow-v5 formatand anonymized before being written to disk.

35

Data dimensions

• Observed Abilene on April 14, 2005– About 200 terabytes of data exchanged– This is roughly 25,000 DVDs of information

• 600 million flow records– Almost 28 gigabytes on disk– 15 million unique hosts involved

37

Weighted bipartite digraph

38

M

iCiin ws

1,

N

jjCout ws

1,

39

Multiple digraphs

Port 80 (Web) Port 6346 (Gnutella)

Port 25 (Mail) Port 19101 (???)

40

Application correlation

• Consider the out-strength of a client in the networks for ports p and q:

j

pij

pi ws

j

qij

qi ws

41


• Build a pair of vectors from the distribution of strength values:

),,( ||1pC

p ssp

),,( ||1qC

q ssq

42


• Examine the cosine similarity of the vectors:

• When σ = 0, applications p and q are never used together.

• When σ = 1, applications p and q are always used together, and to the same extent.

qp

qpqp

),(

43

Clustering applications

• We now have σ(p, q) for every pair of ports• Convert these similarities into distances:

• If σ = 0, then d is large; if σ = 1, then d = 0• Now apply Ward’s hierarchical clustering

algorithm

1),(

1),(

qpqpd

46

Classifying unknownapplications

• To classify an unknown application, see what known applications it clusters with

• Our classification experiment– Take 16 unknown ports– Guess function based on similarity data– Validate or invalidate guesses based on external

evidence

47

Example #1

• Port 388 is coupled with FTP and Hotline– FTP is a file transfer application– Hotline is an early file-sharing application– Our guess: traditional file transfer application

• Actual identity: Unidata/LDM– Used for moving large meteorological data sets

48

Example #2

• Port 19101 is coupled with instant messaging and P2P applications– Our guess: a P2P application that relies on

individual contact for file transfers

• Actual identity: Clubbox– Korean file-sharing program– Users trade large files on virtual hard drives

50

Overall results

• For our 16 guesses:– 8 were unambiguously correct– 6 were partially correct

• These turned out to be trojans and malware

• We learned that IRC + P2P = evil afoot

– 2 could not be confirmed or disproven• Ports were in transient use during data collection

51

Implications

• We can identify the type of an application without examining a single packet!– Scalable– Preserves user privacy– Difficult to do with relational view of flow data

57

Broader application

• Generic view of the situation:– Weighted network of entities derived from

activity with labeled classes of interaction– Find the sub-network for each labeled class– Use the network distributions to calculate

similarity scores for the classes– Use the similarity scores to cluster the classes– Classify unknown classes using these clusters

58

Thank you!

• Questions and comments…

1 uncovering functional networks in internet traffic mark meiss september 25, 2006

Documents

port numberthe server

port numberserver address

ephemeral port numberso

widearea network

flow datarouters

pair of port numbersour

server ports

wellknown ports