1 uncovering functional networks in internet traffic mark meiss september 25, 2006
TRANSCRIPT
1
Uncovering Functional Networks in Internet Traffic
Mark Meiss
September 25, 2006
2
Who am I?
Mark Meiss
• Ph.D. candidate in Computer Science– Committee: Filippo Menczer, Alessandro
Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly
• Researcher at the Advanced Network Management Laboratory (ANML)– http://anml.iu.edu/
3
4
What’s the agenda?
The subject of today’s story:
• Finding a way to improve security without compromising user privacy
• A case study in applied network science
This work is done with Filippo Menczer and Alessandro Vespignani.
5
There’s what we imagine…
What do people do online?
surfing sending email playing games
6
What do people do online?And there’s what is actually happening…
file sharing worms & viruses porn
7
Not just a value judgment
These applications all affect the health of a data network.
There are legal problems, yes; but also…• Crowding out other applications.
– (Napster was once over 70% of all IUB traffic)
• Compromised computers are used to launch further attacks.
• “Common nuisances” are on the ’Net as well.
8
The bottom line
Network administrators
need to be able to identify
what applications
are being used on the network.
…but this can be very difficult.
9
A crash coursein data networks
We’ll use a running example:• Buddy Bradley wants to read a web page about his
favorite band at Vulgar Entertainment, Inc.
10
11
12
13
14
15
16
17
18
19
20
Quick summary
• Each network conversation is identified by four pieces of information– Client address and port number– Server address and port number
• The server uses a well-known port number
• The client uses an ephemeral port number
21
So why is it hard to identify applications?
• Well-known ports are a convention, not a rule– Web, e-mail, etc. do have ports assigned by the IANA
– BitTorrent, Gnutella, Napster, etc. do not
• Client and server ports share the same namespace• In practice…
– Any application can use any pair of port numbers
• Our focus: discovering what application is running on a port with no assigned use.
22
The conventional solution
Let’s look inside
all of those packets!
23
24
25
Another problem
• Packet inspection doesn’t scale– Modern high-speed networks run at 10 gigabits
per second or faster(that’s one full DVD every few seconds)
– General-purpose computers can’t even copy that data in real time
26
27
28
Introducing the “flow”
• We can summarize Buddy’s Web surfing as two flows:– 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes)
– 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes)
29
Where do flows come from?
• Architectural features of Internet routers allow them to export flow data
• Routers can’t summarize all the data– Packets are sampled to construct the flows– Typical sampling rate is around 1:100
30
What can you dowith a flow?
• Usual answer:– Treat a flow as a record in a relational database– Who talked to port 1337?– What proportion of our traffic is on port 80?– Who is scanning for vulnerable systems?– Which hosts are infected with this worm?
• These are useful and valid questions.
31
What can you dowith a flow?
• Our approach:– Treat a flow as a directed, weighted edge– The resulting network describes user behavior
• Hold that thought for now…
32
The Internet2/Abilene network
• TCP/IP network connecting research and educational institutions in the U.S.– Over 200 universities
and corporate research labs
• Also provides transit service between Pacific Rim and European networks
33
Why study Abilene?
• Wide-area network that includes both domestic and international traffic
• Heterogeneous user base including hundreds of thousands of undergraduates
• High capacity network (10-Gbps fiber-optic links) that has never been congested
• Research partnership gives access to (anonymized) traffic data unavailable from commercial networks
34
Flow collection
Flows are exported in Cisco’s netflow-v5 formatand anonymized before being written to disk.
35
Data dimensions
• Observed Abilene on April 14, 2005– About 200 terabytes of data exchanged– This is roughly 25,000 DVDs of information
• 600 million flow records– Almost 28 gigabytes on disk– 15 million unique hosts involved
37
Weighted bipartite digraph
38
M
iCiin ws
1,
N
jjCout ws
1,
39
Multiple digraphs
Port 80 (Web) Port 6346 (Gnutella)
Port 25 (Mail) Port 19101 (???)
40
Application correlation
• Consider the out-strength of a client in the networks for ports p and q:
j
pij
pi ws
j
qij
qi ws
41
Application correlation
• Build a pair of vectors from the distribution of strength values:
),,( ||1pC
p ssp
),,( ||1qC
q ssq
42
Application correlation
• Examine the cosine similarity of the vectors:
• When σ = 0, applications p and q are never used together.
• When σ = 1, applications p and q are always used together, and to the same extent.
qp
qpqp
),(
43
Clustering applications
• We now have σ(p, q) for every pair of ports• Convert these similarities into distances:
• If σ = 0, then d is large; if σ = 1, then d = 0• Now apply Ward’s hierarchical clustering
algorithm
1),(
1),(
qpqpd
44
46
Classifying unknownapplications
• To classify an unknown application, see what known applications it clusters with
• Our classification experiment– Take 16 unknown ports– Guess function based on similarity data– Validate or invalidate guesses based on external
evidence
47
Example #1
• Port 388 is coupled with FTP and Hotline– FTP is a file transfer application– Hotline is an early file-sharing application– Our guess: traditional file transfer application
• Actual identity: Unidata/LDM– Used for moving large meteorological data sets
48
Example #2
• Port 19101 is coupled with instant messaging and P2P applications– Our guess: a P2P application that relies on
individual contact for file transfers
• Actual identity: Clubbox– Korean file-sharing program– Users trade large files on virtual hard drives
49
50
Overall results
• For our 16 guesses:– 8 were unambiguously correct– 6 were partially correct
• These turned out to be trojans and malware
• We learned that IRC + P2P = evil afoot
– 2 could not be confirmed or disproven• Ports were in transient use during data collection
51
Implications
• We can identify the type of an application without examining a single packet!– Scalable– Preserves user privacy– Difficult to do with relational view of flow data
52
53
54
55
56
57
Broader application
• Generic view of the situation:– Weighted network of entities derived from
activity with labeled classes of interaction– Find the sub-network for each labeled class– Use the network distributions to calculate
similarity scores for the classes– Use the similarity scores to cluster the classes– Classify unknown classes using these clusters
58
Thank you!
• Questions and comments…