ramses (regeneration and immunity services): a cognitive immune system

12/18/06

RAMSES (Regeneration And iMmunity SErviceS):

A Cognitive Immune System

Mark CornwellJames JustNathan Li

Robert SchragGlobal InfoTek, Inc

R. Sekar Stony Brook University

Self Regenerative Systems18 December 2007

12/18/06

Outline Overview Efficient content-based taint identification Syntax and taint-aware policies Memory attack detection and response Testing Red Team suggestions Questions

Demo

12/18/06

RAMSES Attack ContextAttack target: “program” mediating

access to protected resources/servicesAttack approach: use maliciously crafted

input to exert unintended control over protected resource operations

Resource or service uses:Well-defined APIs to access

OS resourcesCommand interpretersDatabase serversTransaction servers,… …

Internal interfacesData structures and functions within program

Used by program components to talk to each other

Incoming requests

(Untrusted input)

Program

Outgoing requests(Security-sensitive

operations)

12/18/06

$command=“gpg –r nobody; rm –rf * 2>&1”

popen($command)Attack: Removes all removable files in web server document tree

sendto=“nobody; rm –rf *”

Example 1: SquirrelMail Command Injection

popen($command)

Input Interface

Program

$send_to_list = $_GET[‘sendto’]

$command = “gpg -r $send_to_list 2>&1”

“Output” Interface

12/18/06

$sql= “SELECT p.post_id FROM POSTS_TABLE WHERE p.topic_id = -1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3”

topic=“-1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3”

Example 2: phpBB SQL Injection

sql_query($sql)

Input Interface

Program

$topic_id=$_GET[‘topic’]

$sql = “SELECT p.post_id FROM POSTS_TABLE

WHERE p.topic_id = $topic_id”

“Output” Interfacesql_query($sql)Attack: Steal another user’s password

Attack Space of Interest (CVE 2006)

Generalized InjectionAttacks

12/18/06

Detection ApproachAttack: use maliciously crafted

input to exert unintended control over output operations

Detect “exertion of control”Based on “taint:” degree towhich output depends on input

Detect if control is intended:Requires policies (or training)

Application-independent policies are preferable

Input Interface(Untrusted

input)

Program

“Output” Interface:

(Security-sensitive operations)

12/18/06

RAMSES Goals and Approach Taint analysis: develop efficient and

non-invasive alternativesAnalyze observed inputs and outputs

Needs no modifications to programLanguage-neutral

Leverage learning to speed up analysis Attack detection: develop framework to detect

a wide range of attacks, while minimizing policy development effort and FP/FNs“Structure-aware policies:” leverage interplay

between taint and structural changes to output requestsUse Address-Space Randomization (ASR) for memory

corruptionASR: efficient, in-band, “positive” tainting for pointer-valued data

Immunization: filter out future attack instancesOutput filters: drop output requests that violate taint-based

policiesInput filters: “Project” policies on outputs to those on inputs

Relies on learning relationships between input and output fieldsNetwork-deployable

Input Interface(Untrusted

input)

Program

“Output” Interface

12/18/06

Efficient Content-Based Taint Identification

12/18/06

StepsDevelop efficient algorithms for inferring flow of

input data into outputsCompare input and output valuesAllow for parts of input to flow into parts of output

Tolerate some changes to inputChanges such as space removal, quoting, escaping, case-folding are common in string-based interfaces

Based on approximate substring matchingLeverage learning to speed up taint inference

Even the “efficient” content-matching algorithms are too expensive to run on every input/output

Same learning techniques can be used for detecting attacks using anomaly detection

12/18/06

Weighted Substring Edit Distance Algorithm

Maintain a matrix D[i][j] of minimum edit distance between p[1..i] and s[1..j]

D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]), D[i-1][j] + DeleteCost(p[i]), D[i][j-1] + InsertCost(s[j])}

D[0][j] = 0 (No cost for omitting any prefix of s) D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i]) Matches can be reconstructed from the D matrix Quadratic time and space complexity

Uses O(|p|*|s|) memory and time

12/18/06

Improving performance Quadratic complexity algorithms can be

too expensive for large s, e.g., HTML outputs Storage requirements are even more problematic

Solution: Use linear-time coarse filtering algorithmApproximate D by FD, defined on substrings of s of length |p|

Let P (and S) denote a multiset of characters in p (resp., s)FD(p, s) = min(|P-S|, |S-P|)

Slide a window of size |p| over s, compute FD incrementallyProve: D(p, r) < t FD(p, r) < t for all substrings r of s

Result: O(|p|2) space and time complexity in practice Implementation results

Typically 30x improvement in speed200x to 1000x reduction in spacePreliminary performance measurements: ~40MB/sec

12/18/06

Efficient online operationWeighted edit-distance algorithms are still too

expensive if applied to every input/outputNeed to run for every input parameter and output

Key idea:Use learning to construct a classifier for outputs

Each class consists of similarly tainted outputs taint identified quickly, once the class is known

Classifying strings is difficultOur technique operates on parse trees of outputFor ease of development, generality, and tolerance to syntax errors, we use a “rough” parser

Classifier is a decision tree that inspects parse tree nodes in an order that leads to good decisions

12/18/06

Decision Tree ConstructionExamines the nodes of syntax tree in some orderThe order of examination is a function of the set

of syntax treesChooses nodes that are present in all candidate syntax trees

Avoids tests on tainted data, as they can varyAvoids tests that don’t provide significant degree of discrimination“similar-valued” fields will be collected together and generalized, instead of storing individual values

Incorporates a notion of “suitability” for each field or subtree in the syntax treeTakes into account approximations made in parsing

12/18/06

Example of a Decision Tree1. SELECT * FROM phpbb_config 2. SELECT u.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE

s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id 3. SELECT * FROM phpbb_themes WHERE themes_id=1 4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE

f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order 5. SELECT * FROM phpbb_forums ORDER BY cat_id,forum_order

switch (1) { case ROOT : switch (1.1) { case CMD : switch (1.1.2) { case c FINAL {@1.1.1:SELECT

@1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order }

case u FINAL {@1.1.1:SELECT @1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id }

case * FINAL {@1.1.1:SELECT @1.1.3:FROM phpbb_?????? }

} }}

12/18/06

Implementation Status and Next Steps“Rough” parsers implemented for

HTML/XMLShell-like languages (including Perl/PHP)SQL

Preliminary performance measurementsConstruction of decision trees: ~3MB/secClassification only: ~15MB/sec

Significant improvements expected with some performance tuning

Next stepsDevelop better clustering/classification algorithms based on tree edit-distanceCurrent algorithm is based entirely on a top-down

traversal, and fails to exploit similarities among subtrees

12/18/06

Syntax and taint-aware policies

12/18/06

Leverage structure+taint to simplify/generalize policyPolicy structure mirrors that of parse trees

And-Or “trees” with cyclesCan specify constraints on values (using regular expressions) and taint associated with a parse tree node

Most attacks detected using one basic policyControlling “commands” vs command parametersControlling pointers vs data

ELEMENT

NAME = “script” OR

PARAM ELEM_BODY

PARAM_NAME=“src” PARAM_VALUE

Overview of Policies

12/18/06

Controlling “commands” Vs “parameters”

Observation: parameters don’t alter syntactic structure of victim’s requests

Policy: Structure of parse tree for victim’s request should not be controlled by untrusted input (“tainted data”)

Alternate formulation: tainted data shouldn’t span multiple “fields” or “tokens” in victim’s request

root

cmd

name param param

gpg -r [email protected]

root

cmd

name param param

gpg -r nobody

cmd

separator

;

name param param

rm -rf *

12/18/06

Policy prohibiting structure changesDefine “structure change” without using a reference

Avoids need for training and associated FP issuesPolicy 1

Tainted data cannot span multiple nodes for binary data, it should not span multiple fields

Policy 2Tainted data cannot straddle multiple subtrees

Tainted data spans two adjacent subtrees, and at least one of them is not fully taintedTainted data “overflowed” beyond the end of one subtree and resulted in a second subtree

Both policies can be further refined to constrain the node types and children subtrees of the nodes

12/18/06

Memory corruption attack overflowing stack buffer For binary data, we talk about message fields

rather than parse trees

…..

Violation: tainted data spans multiple stack “fields”

Heap overflows involve tainted data spanning across multiple heap blocks

Commands Vs parameters: Example 2

Stack frame 1

Return Address

Stack frame 2

Return Address

Stack frame 2

12/18/06

Attacks Detected by “No structure change” Policy

Various forms of script or command injectionSQL injectionXPath injectionFormat string attacksHTTP response splittingLog injectionStack overflow and heap overflow

12/18/06

Application-specific policiesNot all attacks have the flavor of “command

injection”Develop application-specific policies to detect

such attacksPolicy 3: Cross-site scripting: no tainted scripts in HTML data

Policy 4: Path traversal: tainted file names cannot access data outside of a certain document tree

…Other examples

Policy 5: No tainted CMD_NAME or CMD_SEPARATOR nodes in shell or SQL commands

12/18/06

Implementation statusFour test applications

phpBBSquirrelMailPHP/XMLRPCWebGoat (J2EE)

Detects following attacks without FPsCommand injection (Policies 1, 2, 5)SQL injection (1, 2, 5)XSS (3)HTTP Response splitting (2)Path traversal (4)Memory corruption detected using ASR

Should be able to detect many other attacks easilyXPATH injection (1,2), Format-string (1, 2), Log injection (1,2)

12/18/06

Memory Attack Discussion

12/18/06

Memory Error Based Remote Attack

Attacker’s goal:Overwrite target of interest to take over instruction execution

Attacker’s approach:Propagate attacker controlled input to target of interest

Violate certain structural constraints in the propagation process

12/18/06

Stack Frame Structural Violation

High

Low

A’s stack frameFunction arguments

Return address Previous stack frameException Registration Record

B’s stack frameFunction argumentsReturn address( to A)

Previous stack frame

C’s stack frame

Function argumentsReturn address (to B)

Previous stack frameException Registration Record

Local variables

Local variables

Local variables

EBPFS:0

ESP

12/18/06

Happens when removing free block from double-linked list:

Ability to write 4 bytes into any address, usually well known address, like function pointer, return address, SEH etc.

Heap Block Structural Violation

BLink

FLink

Size Previous Size

SegmentIndex

Flags Unused Tag Index

Windows Free Heap Block Header Structure

12/18/06

ASLR randomizes the addresses of targets of interest

Memory attack using the original address will miss and cause crash (exception).

Crash analysis tracks back to vulnerability, which enables accurate signature generationStructural information usually retrievable at runtime, thanks to enhanced debugging technology

Crash analysis aided with JIT(Just In-time Tracing)JIT triggered at certain events:

“Suspicious” network inputs, e.g. sensitive JMP address

Attach/detach JIT monitor at event of interestMemory dump can be dumped in the right granularity,

log info from a few KB to a 2GB

ASLR and Crash Analysis

12/18/06

Crash Root Cause Analysis

Root Cause Analysis

Stack Corruption Heap Corruption

Read Access Violation

Bad EIP(Corrupted ReturnAddress or SEH)

ReadAccess ViolationBad Deference

(Corrupted LocalVariables/passing

parameters)

WriteAccess Violation

(Address to write,Value to write )

Exception Record/Context,Faulting thread/Instructions/Registers

Stack trace/Heap/Module/Symbols

12/18/06

Stack-based Overflow Analysis

“Target” driven analysisThe goal of attack string is to overwrite target of interest on stack, e.g., return address, SEH handler.

Start matching target values from crash dump to input, like EIP, EBP and SEH handlerMore efficient than pattern match in the whole address

spaceIf any targets are matched in input, expand in both directions to find LCS

A match usually indicates the input size needed to overflow certain targets

12/18/06

SEH Overflow and AnalysisA unique approach for Windows exploit

SEH stands for Structured Exception HandlerWindows put EXCEPTION_REGISTRATION_RECORD chain

on stack with SEH in the record.More reliable and powerful than overwrite return

addressMore JMP address to use (pop/pop/ret)An exception (accidental/intentional) is desiredCan bypass /GS buffer check

SEH crash analysis:Catch the first exception as well as the second one

(caused by ASR)Locate the SEH chain head from first dump, usually

overwritten by inputUsually first exception is enough, second exception can be

used for confirmation

12/18/06

Heap Overflow AnalysisHow to analyze heap overflow attack?

Exploit happens in free blocks unlinkMultiple ways to trigger

Write Access Violation with ASRwith overwriting in invalid address

Overwrite 4 bytes value in arbitrary addressInterested targets include return address, SEH, PEB and UEF

Exploit contains the pair: (Address To Write, Value to Write)Appeared in the overflowed heap blocks Usually contained in registersShould be provided from input by attackerMatch found in synthetic heap exploits

The value pairs need to be in fixed offsetFor a given heap overflow vulnerability To enable overwrite the right address with the right value

desired

12/18/06

Case Studies

Vulnerability Exploit

IIS ISAPI Extension synthetic stack buffer overflow

Overwrite return address

IIS ISAPI Extension synthetic stack buffer overflow

Overwrite Structure Exception Handler

IIS w3who.dll stack buffer overflow(CVE-2004-1134)

Overwrite Structure Exception Handler

Microsoft RPC DCOM Interface stack buffer overflow(CVE-2003-0352)

Overwrite return address and Structure Exception Handler

Synthetic Heap Overflow Overwrite function pointer inside PEB structure

12/18/06

Case Study: RPC DCOM Step 1: Exception Analysis

FAULTING_IP: +18759f ExceptionCode: c0000005 (Access violation)Attempt to read from address 0018759fPROCESS_NAME: svchost.exeFAULTING_THREAD: 00000290PRIMARY_PROBLEM_CLASS: STACK_CORRUPTION

Step 2: Target – Input correlation:StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000Begin analyze on Target Overwrite and Input Correlation:Analyze crash EIP:

Find EIP pattern at socket input: Bytes size to overwrite EIP= 128

Analyze crash EIP done!Analyze SEH:

Find SEH byte at socket input: Bytes size to overwrite SEH handler= 1588

Analyze SEH done!

12/18/06

Signature Generation

Signature generation: Signature captures the vulnerability characteristics

Minimum size to overwrite certain target(s)

Use contexts to reduce false positive:Using incoming input calling stack

Stack offset can uniquely identify the context

Using incoming input semantic context:Message format like HTTP url/parameterBinary message field

12/18/06

Protected Application

RAMSESCrash Monitor:* Catch interested

exception only•Snapshots for a

given period* Self healer

RAMSESCrash Analyzer

•Fault type detection•Security oriented

analysis•Feedback

WindowsDebugEngine

Crash Dump*

Crash(Exception)

Generate

Uses

UsesProvide Input History

AnalyzeSignature

1

2

3

45

Infrastructure:Save Crash Dump

Extract Relevant InfoSearch/MatchDisassemble

Components & Implementation

* Crash Dump provides the same interface as LIVE process, so Crash Analyzer actually

does NOT have to work on saved crash dump file.

12/18/06

Testing

12/18/06

Test Attacks & ApplicationsAttack Vulnerability Target App App Lang Exploited Lang TargetsphpBB SQL Injection CAN-2003-0486 phpBB PHP SQL DatabaseSquirrelMail Command Injection CAN-2003-0990 SquirrelMail PHP cmd/shell ServerSquirrelMail XSS Attack CAN-2002-1341 phpBB PHP JavaScript 3rd party clientsPHP XML-RPC CAN-2005-1921 PHP Library PHP XMLHTTP Splitting CR LF escapes WebGoat Java HTTP Request ServerHTTP Splitting Cache Poisoning tainted expiration field WebGoat Java HTTP Request Server page cachePath Based Access Control tainted file open WebGoat Java file path ServerXpath injection tainted xpath string WebGoat Java Xpath Library ServerJSON injection flawed architecture WebGoat Java JSON Server ApplicationXML inject flawed architecture WebGoat Java XML Server Application

Baseline Applications• phpBB (php)• squirrelMail (php)• WebGoat (java)• hMailServer (C++)

Many “sub languges”SQL, XML, JavaScript,HTML, HTTP, JSON, shell, cmd, path

12/18/06

Possible Testbed Configurations

WebServer

(IIS/Apache)

SQLDatabase(MySQL)

Protected System

MailServer

Attacker

Can extend protected system to include Mail Serve

WebApps

files

Protected System

MailServer

Attacker

Protect Mail server exposed as a service.

WebServer

(IIS/Apache)

SQLDatabase(MySQL)

Protected System

MailServer

Attacker

Baseline testbed setup

WebApps

files

WebServer

(IIS/Apache)

SQLDatabase(MySQL)

Protected System

MailServer

Attacker

Protect just mail server in context of Web service.

WebApps

files

12/18/06

Traffic Generation Purpose

Coverage of legitmate structural variation in monitored structuresSQL, command strings, call parameters

Stress of log complexity for practicalityMultiple users, multiple sessions

Performance measurementsProgram performance metricsQuantify performance impact

12/18/06

Traffic Generation to Web SitesApproaches

Simple Record/Playback (basic) with minor substitutions (cookies, ips) shell scripts, netcat, MaxQ (jython based

Custom DOM/Ajax scripting (learning) Can access dynamically generated browser content

after(during) client side script eval Automated site crawls of URLS Automated form contents (site specific metadata)

COTS tools Load testing and metrics

12/18/06

12/18/06

Red Team Suggestions

12/18/06

Suggested Red Team ROEsInitial telecons held in FallClaim: RAMSES will defeat most generalized

injection attacks on protected applicationsRed Team should target our current and planned

applications rather than new ones (unless new application, sample attacks and complete traffic generator can be provided to RAMSES far enough in advance for learning and testing)Remote network access to the targeted applicationAttack designated application suite

Required instrumentation yet to be determinedRed Team exercise start 15 April or later……

12/18/06

RAMSES Project Schedule

Baseline Tasks

1. Refine RAMSES Requirements

2. Design RAMSES

3. Develop Components

4. Integrate System

5. Analyze & Test RAMSES

6. Coordinate & Rept

Prototypes

Optional Tasks

O.3 Cross-Area Exper

CY06 CY07 CY08

Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1Q3

1 2 3

Q2

CY09

Q3

Today: 11 September 2007

Red Team Exercise

12/18/06

Next Steps

12/18/06

PlansDevelop input filters from output policiesExtend memory error analyzerDemonstrate RAMSES on more applications

and attack typesNative C/C++ app (most likely app is hMail server)

JavaIntegrate componentsPerformance and false positive testingRed Team exercise

12/18/06

Questions?

12/18/06

Backup

12/18/06

Tokenizing and ParsingFocus on “rough” parsing that reveals approximate

structure, but not necessarily all the details Accurate parsers are time-consuming to write More important: may not gracefully handle errors (common in HTML) or language extensions and variations (different shells, different flavors of SQL)

Implemented using Flex/Bison Currently done for SQL and shell command languages

Parse into a sequence of statements, each statement consisting of a “command name” and “parameters”

Incorporates a notion of confidence to deal with complex language features, e.g., variable substitutions in shell

Modest effort for adding additional languages, but substantially simplifies subsequent learning tasks

Don’t anticipate significant additions to this language list (other than HTML/XML)

12/18/06

Taint inference Vs Taint-trackingDisadvantages of learning

False negatives if inputs transformed before useLow likelihood for most web apps

False positives due to coincidenceMitigated using statistical information

Plan to evaluate these experimentallyBenefits of learning

Low performance overheadSome significant implicit flows handled without incurring high false positives

Can address attacks multi-step attacks where tainted data is first stored in a file/database before useMore generally, in dealing with information flow that

crosses module boundaries

12/18/06

Attack Coverage 2004

(Stack-smashing, heap overflow, integer overflow, data attacks)

Other logic errors22%

Format string4%

Memory errors27%

I nput validation/

DoS9%

Directory traversal

10%

Cross-site scripting

4%

Command injection

15%

SQL injection2%

Tempfile4%Config errors

3%

CVE Vulnerabilities (Ver. 20040901)

Generalized Injection Attacks

12/18/06

RAMSES Interceptors

RAMSES System Concept

Key research problemsLearn taint propagation

Identify tainted components in output, generate filtering criteriaLearn input/output transformation

Use transformation to project output filters to input

WebServer

(IIS/Apache)

WebApp

(PHP/ASP)

SQLDatabase(MySQL)

OSDLLs

ApplicationDLLs

Network DLLs

Protected SystemN

etw

ork/

App

Fire

wal

l (e.

g. m

od_s

ecur

ity)

RAMSES Components

Attack Detector• Address-space

randomization• Taint-based policies,

anomalies

Event Collector• parse/decode/normalize HTTP requests, parameters, cookies, …

Filter Generator• Output filter• Input filter

Inte

rnet

12/18/06

Advantages of RAMSES FiltersFilters easily sharable

Complements Application Community focus on end user applications

Filters are human readableFilter generation algorithms can be enhanced to address privacy concerns wrt sharing

12/18/06

Filter typesFilter Criteria

Correlative filtersEquality-based filterStructure-based filterStatistical filter

Causal filtersFiltering criteria

derived from attack detection criteria (policy or anomaly)

Filter Location Input filter

Easier to deploy but harder to synthesize

Output filter (precedes sensitive operation)Easier to synthesize than

input filter, but deployment needs deeper instrumentation

May be too late for some attacks (memory corruption)

Note: All filters evaluated using large number of benign samples and 1 attack sample

ramses (regeneration and immunity services): a cognitive immune system

Documents