cs 443 advanced os fabián e. bustamante, spring 2005 automated worm fingerprinting sumeet singh,...

CS 443 Advanced OS

Fabián E. Bustamante, Spring 2005

Automated Worm Fingerprinting

Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage

Presenter: Yi Qiao

2

Outline

Introduction

Background

Worm Behavior and Worm Signatures

Practical Content Sifting

Implementation and Evaluation

Limitations

Conclusions

3

Introduction

Internet worms– Small programs that exploit software vulnerability in popular

network service, seize control of program execution, and send a copy of themselves to other susceptible hosts

– Bigger threat and damage• Software homogeneity, Internet’s unrestricted communication model• Increased speed, virulence and sophistication of new generations of

worms and viruses• Different mechanisms and consequences

– Little advance for worm detection, characterization and containment• Detection: intrusion detection + administrator legwork• Manual characterization of worm signature• Contain infections through anti-virus software and network filtering• Inefficient, expensive, and slow

– Hours and days to complete

– Effective worms containment can require a reaction time of sixty seconds!

4

Introduction

What this work has done– Two observations

• Some portions of the content in existing worms is invariant

• Spreading dynamics of a worm is atypical of Internet applications

– Content sifting to identify new worms and their precise signatures

– A prototype system based on the content sifting approach, Earlybird, for real-time worm detection and containment

5

Background

Empirical analyses of the CodeRed worm outbreak– The operational repair rate averaged under 2 percent per day

• Fully automated intervention is necessary to manage outbreaks

Analysis of the Slammer outbreak– All Internet address space was scanned under 10 minutes

• The need for fast and automated reactions

Different granularity of containment mechanisms – signature based VS IP address based– Signature-based methods can be an order of magnitude more

effective• halt all spreading once a signature is identified

– Signatures must be generated quickly to offer effective containment• Slammer may require signature operation under 5 minutes or even 60

seconds

6

Existing Techniques

Worm Detection– Scan detection

• A worm can be highly unusual in the number, frequency and distribution of addresses it scans

– Network telescopes – passively monitors for large ranges of unused yet routable address space

• Not suited for non-random spread worms (e.g, email viruses, worms via IM or p2p communications)

• IP-based detection – less responsive

– Honeypots• Monitored idle hosts with untreated vulnerabilities to isolate and

analyze a worm• Honeypots have to be infected + slow manual analysis

– Host-based behavioral detection• Dynamic analysis of system call patterns for anomalous activity• Expensive to manage and deploy• Hard to infer large-scale outbreak

7

Existing Techniques

Characterization – the process of analyzing and identifying a new worm or exploit– A priori vulnerability signatures

• Known exploitable vulnerabilities in deployed software• Can be deployed before new worm outbreaks• Relies on well-known vulnerabilities

– Automation for signature extraction by Kephart and Arnold• Identify invariant code strings through decoy program infection • Assumes controlled environment and a know instance of virus

– Kim and Karp’s Autograph system• Use network-level data to infer worm signatures• Difference with this work

– A prefiltering step that identifies flows with suspicious scanning activity» Cannot detect email borne worms, UDP-based worms, or worms through P2P

– Extensive support and active coordination between multiple sensors

– Offline system only evaluated through traces

8

Existing Techniques

Containment – mechanism to slow or stop the spread of an active worm– Host quarantine

• The act of preventing an infected host from communicating with other hosts

– String-matching containment• Match network traffic against particular strings and drop

associated packets• Approach used in the work

– Connection throttling• Proactively limit the rate of all outgoing connections

– Slowdown but not stop of the spread of any worm

9

Worm Behavior and Signatures

Content invariance– Some or all of the worm program is invariant across every

body• Some has limited polymorphism, but key portions are still

invariant

Content prevalence– Invariant portion of a worm’s content will appear frequently

on the network as it spreads or attempts to spread

Address dispersion– The number of distinct hosts infected in a worm grows over

time, and the distribution of infected addresses will be far more uniform than typical traffic

10

Worm Behavior and Signatures

Worms must– Generate significant traffic to spread– Traffic contains common substrings– Directed between a variety of difference sources and

destinations

Content sifting– Sifting out network content which is not prevalent or not

widely dispersed, leaving only the worm-like content• Prevalence table – catch packet strings that are seen often

• Address lists – strings coming from enough sources and going to enough destinations

• Substrings left after sifting can be used as signature to filter out worms

11

Content Sifting

12


Scale to high-speed links

Estimating content prevalence– Table indexed by payload can use up all memory in no time

• 1 GByte table exhausted in 10 seconds on a 1Gbps link

• Indexing the table using a fixed size hash of the packet payload

– Multi-stage filters with conservative update• Multiple hash tables

– Hash content using different hash functions in different hash tables, and increment corresponding table entry counter in each table

• Record the content string if all hashed counters are above certain threshold

• Dramatically reduces memory requirement

13


Estimating content prevalence– Append the destination port and protocol to the content

before hashing• Effectively exclude large amounts of prevalent content not

generated by worms (potential false positives)

– Invariant content could be a string much smaller than a single packet, and occurs at different offsets

• Detecting repeating strings with a small fixed length ß

• A variant of Rabin fingerprints is used to all possible substrings of a certain length

14


Estimating address dispersion– Count the distinct source and destination IP addresses

associated with each suspected content string– Critical for avoiding false positives among the prevalence

content strings– Efficient solutions needed due to large number of suspected

content strings• Scaled bitmap

– Accurately estimate address dispersion using small amount of memory

– Hash each content source or destination to a bitmap– Subsampling the range of the hash space

» Allow the storage of the bitmap to remain constant across an enormous range of counts

15


Estimating address dispersion– Recycle the bitmap covering the largest fraction of

the hash space when it is filled up– Clear and map it to the largest uncovered portion

of the hash space, which is half of the portion covered by the rightmost bitmap

16


CPU scaling– Payload string requires significant CPU processing

• Large number of substrings in each packet payload• Overload the CPU during high traffic load

– If ß=40, a 1000-byte packet requires processing 960 Rabin fingerprints

• Traffic surges make the problem even worse

– Solution• Dynamic sampling of substrings

– Value sampling – only choose substrings for which the fingerprint matches a certain pattern

– Assume a sample fraction of f, a worm substring length of ß and a worm signature length of x, the miss probability

– Sample value f – tradeoff between processing overhead and probability of missing a worm

– X>=400 for all current worms - when f=1/64, the probability of false negatives is at most 0.36%

)1(1)1()( xfxmiss efxp

17


Summary– Content prevalence table

• A high-pass filter for frequent content

• Four independent hash functions – 4 counter arrays updated using conservative update optimization

– Address dispersion table• Typically fewer values – only those strings exceeding the

prevalence threshold

– Both tables need to be cleared regularly • 60 seconds for content prevalence table, hours for address

dispersion table

– Modest memory requirement, no deployment restrictions, can be implemented in either hardware or software

18


19

Implementation and Evaluation

System design– EarlyBird system built and run at UCSD campus

for over eight months– Two major components

• Sensors– Sifts through traffic on configurable address space zones

and reports anomalous signatures

• Aggregators– Coordinates updates from sensors, coalesces related

signatures and activates blocking services, administrative reporting and control

– Automatically generates and deploys precise content-based signatures to automatically block outbreaks

20

Implementation and Environment

Earlybird sensor on a 1.6Ghz AMD Opteron 242 1U server configured with standard linux 2.6 kernel– Single-threaded application executes at user-level– 5,000 lines of code

Sifts over 1TB of traffic per day and keeps up with over 200 Mbps of continuous traffic– Sampling probability of 1/64– Monitor all inbound and outbound traffic– The router manages traffic to/from 5000 hosts

21

Parameter tuning

Content prevalence threshold– Use a value of 3 (on a 60 second measurement interval)– Over 97% of all signatures repeat two or fewer times and

94.5% percent are only observed once– Enormous number of content strings are removed from

consideration in content prevalence test

22

Parameter tuning

Address dispersion threshold– As the dispersion threshold increases, the number of strings

detected decreases dramatically• With a threshold of 30, only 5 or 6 prevalent strings meet the

dispersion criteria – either worms or strings can be post-filtered by a whitelist

• Tradeoff between detecting speed and false positives

23

Parameter tuning

Garbage collection– The elapsed time before an entry in the address dispersion table is

garbage collected• With a timeout value of 100 seconds, 60 percent of all signatures are

garbage collected before a subsequent update occurs, preventing the signature from meeting the dispersion threshold and being reported

• With a timeout of 1000 second, the percentage reduces to 20%• A timeout of several hours is chosen since the dispersion table is small

24

Performance

Processing time– Count elapsed CPU cycles for each component

– Most significant operations• Initial Rabin fingerprint, accessing the multistage filter and creating a

new address dispersion table entry• Considering the 1/64 sampling rate, the effective per byte processing

time is 0.042 microseconds– Can sustain a 200Mbps load

25

Performance

Memory consumption– Major memory hog – the content prevalence table

• 4 stage filters, each stage 524,288 bins, each bin 8 bits – a total of 2MB memory

– Other memory usage• The address dispersion table

– 5K and 25K entries of 28 bytes each – under 1 MB of memory

– Total memory consumption of EarlyBird• 4MB• Can be further reduced if using higher prevalence

threshold– Potential on-chip implementation possible

26

Trace-based verification

False positives– The prevalence of different signatures over time that meet

the dispersion threshold of 10• Two most active signatures – the Slammer and Opaserv worms

• A pervasive string on TCP port 455 and the Blaster worm

• Others– Likely worms– Distributed scans and some particular protocol structures

– Two principal sources of false positives• Common protocol headers – can be easily whitelisted

• Unsolicited bulk email (SPAM) – harder to be whitelisted, yet their interdiction is far more benign

– One source of false positives that defies easy analysis• Many-to-many download profile of BitTorrent

27

Trace-based verification

False negatives– Impossible to quantitatively demonstrate the absence of

false negatives– Every worm outbreak reported on public mailing lists was

detected by EarlyBird– No false negatives when compared with the snort-signature

mailing list

28

Performance

Inter-packet signatures– An attacker can evade detection by splitting an

invariant string into pieces one byte smaller than smaller than ß

– Content sifting algorithm to detect such simple evasions at the cost of per flow state management

Live experiences of EarlyBird– EarlyBird detected signatures for variants of

CodeRed, MyDoom mail worm and the recently Sasser and Kibvu.B worm

– Sasser and Vibvu.B signatures were reported long before the public reports of the worm’s spread

29

Limitations and extensions

Variant content– Worms with little or no invariant content

• Instruction sequence mutation, semantically equivalent but textually distinct code

• More complex analysis for content sifting is needed

– Compression • Common code sequence reuse – lead to lots of false positives

– Vulnerabilities in popular implementations of encrypted session protocols such as SSH can be exploited by worms

• Problems cannot be handled by current techniques

Network evasion– Evade monitoring through traditional IDS evasion techniques

30

Limitations and extensions

Extensions– Sensitivity study of parameters and “autotune” capacity for

EarlyBird’s content sifting parameters in different environments

– Handle slow worms• Maintaining triggering data across multiple time scales• Hybrid system combined with host-based intrusion detection or

honeypots

Containment– Rate-limit first before final traffic block

• Tradeoff between detection speed and false positives

– Malicious worm detection trigger • Denial-of-service on legitimate traffic carrying a specific string

Coordination– Share a given signature across deployment at different sites

• Related issues of trust, validation and policy

31

Conclusions

An approach for real-time detection of unknown worms and automated extraction of unique content signatures

Content sifting algorithm efficiently analyses network traffic for prevalent and widely dispersed content strings– Moderate memory and computational requirements

EarlyBird is able to detect and extract signatures of all contemporary worms and also for new worms

Underlying methodology can be used for some other detections– Bulk email (SPAM), peer-to-peer system activity

Feasibility of sophisticated wide-spread network security– Signature learning at Gigabit speeds is viable

cs 443 advanced os fabián e. bustamante, spring 2005 automated worm fingerprinting sumeet singh,...

Documents

worm honeypots

realtime worm detection

codered worm

containment detection

new worm outbreaks

containment slide

new worms

existing worms