ramses (regeneration and immunity services): a cognitive immune system
DESCRIPTION
RAMSES (Regeneration And iMmunity SErviceS): A Cognitive Immune System. Self Regenerative Systems 18 December 2007. Mark Cornwell James Just Nathan Li Robert Schrag Global InfoTek, Inc. R. Sekar Stony Brook University. Outline. Overview Efficient content-based taint identification - PowerPoint PPT PresentationTRANSCRIPT
12/18/06
RAMSES (Regeneration And iMmunity SErviceS):
A Cognitive Immune System
Mark CornwellJames JustNathan Li
Robert SchragGlobal InfoTek, Inc
R. Sekar Stony Brook University
Self Regenerative Systems18 December 2007
12/18/06
Outline Overview Efficient content-based taint identification Syntax and taint-aware policies Memory attack detection and response Testing Red Team suggestions Questions
Demo
12/18/06
RAMSES Attack ContextAttack target: “program” mediating
access to protected resources/servicesAttack approach: use maliciously crafted
input to exert unintended control over protected resource operations
Resource or service uses:Well-defined APIs to access
OS resourcesCommand interpretersDatabase serversTransaction servers,… …
Internal interfacesData structures and functions within program
Used by program components to talk to each other
Incoming requests
(Untrusted input)
Program
Outgoing requests(Security-sensitive
operations)
12/18/06
$command=“gpg –r nobody; rm –rf * 2>&1”
popen($command)Attack: Removes all removable files in web server document tree
sendto=“nobody; rm –rf *”
Example 1: SquirrelMail Command Injection
popen($command)
Input Interface
Program
$send_to_list = $_GET[‘sendto’]
$command = “gpg -r $send_to_list 2>&1”
“Output” Interface
12/18/06
$sql= “SELECT p.post_id FROM POSTS_TABLE WHERE p.topic_id = -1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3”
topic=“-1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3”
Example 2: phpBB SQL Injection
sql_query($sql)
Input Interface
Program
$topic_id=$_GET[‘topic’]
$sql = “SELECT p.post_id FROM POSTS_TABLE
WHERE p.topic_id = $topic_id”
“Output” Interfacesql_query($sql)Attack: Steal another user’s password
Attack Space of Interest (CVE 2006)
Generalized InjectionAttacks
12/18/06
Detection ApproachAttack: use maliciously crafted
input to exert unintended control over output operations
Detect “exertion of control”Based on “taint:” degree towhich output depends on input
Detect if control is intended:Requires policies (or training)
Application-independent policies are preferable
Input Interface(Untrusted
input)
Program
“Output” Interface:
(Security-sensitive operations)
12/18/06
RAMSES Goals and Approach Taint analysis: develop efficient and
non-invasive alternativesAnalyze observed inputs and outputs
Needs no modifications to programLanguage-neutral
Leverage learning to speed up analysis Attack detection: develop framework to detect
a wide range of attacks, while minimizing policy development effort and FP/FNs“Structure-aware policies:” leverage interplay
between taint and structural changes to output requestsUse Address-Space Randomization (ASR) for memory
corruptionASR: efficient, in-band, “positive” tainting for pointer-valued data
Immunization: filter out future attack instancesOutput filters: drop output requests that violate taint-based
policiesInput filters: “Project” policies on outputs to those on inputs
Relies on learning relationships between input and output fieldsNetwork-deployable
Input Interface(Untrusted
input)
Program
“Output” Interface
12/18/06
Efficient Content-Based Taint Identification
12/18/06
StepsDevelop efficient algorithms for inferring flow of
input data into outputsCompare input and output valuesAllow for parts of input to flow into parts of output
Tolerate some changes to inputChanges such as space removal, quoting, escaping, case-folding are common in string-based interfaces
Based on approximate substring matchingLeverage learning to speed up taint inference
Even the “efficient” content-matching algorithms are too expensive to run on every input/output
Same learning techniques can be used for detecting attacks using anomaly detection
12/18/06
Weighted Substring Edit Distance Algorithm
Maintain a matrix D[i][j] of minimum edit distance between p[1..i] and s[1..j]
D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]), D[i-1][j] + DeleteCost(p[i]), D[i][j-1] + InsertCost(s[j])}
D[0][j] = 0 (No cost for omitting any prefix of s) D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i]) Matches can be reconstructed from the D matrix Quadratic time and space complexity
Uses O(|p|*|s|) memory and time
12/18/06
Improving performance Quadratic complexity algorithms can be
too expensive for large s, e.g., HTML outputs Storage requirements are even more problematic
Solution: Use linear-time coarse filtering algorithmApproximate D by FD, defined on substrings of s of length |p|
Let P (and S) denote a multiset of characters in p (resp., s)FD(p, s) = min(|P-S|, |S-P|)
Slide a window of size |p| over s, compute FD incrementallyProve: D(p, r) < t FD(p, r) < t for all substrings r of s
Result: O(|p|2) space and time complexity in practice Implementation results
Typically 30x improvement in speed200x to 1000x reduction in spacePreliminary performance measurements: ~40MB/sec
12/18/06
Efficient online operationWeighted edit-distance algorithms are still too
expensive if applied to every input/outputNeed to run for every input parameter and output
Key idea:Use learning to construct a classifier for outputs
Each class consists of similarly tainted outputs taint identified quickly, once the class is known
Classifying strings is difficultOur technique operates on parse trees of outputFor ease of development, generality, and tolerance to syntax errors, we use a “rough” parser
Classifier is a decision tree that inspects parse tree nodes in an order that leads to good decisions
12/18/06
Decision Tree ConstructionExamines the nodes of syntax tree in some orderThe order of examination is a function of the set
of syntax treesChooses nodes that are present in all candidate syntax trees
Avoids tests on tainted data, as they can varyAvoids tests that don’t provide significant degree of discrimination“similar-valued” fields will be collected together and generalized, instead of storing individual values
Incorporates a notion of “suitability” for each field or subtree in the syntax treeTakes into account approximations made in parsing
12/18/06
Example of a Decision Tree1. SELECT * FROM phpbb_config 2. SELECT u.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE
s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id 3. SELECT * FROM phpbb_themes WHERE themes_id=1 4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE
f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order 5. SELECT * FROM phpbb_forums ORDER BY cat_id,forum_order
switch (1) { case ROOT : switch (1.1) { case CMD : switch (1.1.2) { case c FINAL {@1.1.1:SELECT
@1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order }
case u FINAL {@1.1.1:SELECT @1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id }
case * FINAL {@1.1.1:SELECT @1.1.3:FROM phpbb_?????? }
} }}
12/18/06
Implementation Status and Next Steps“Rough” parsers implemented for
HTML/XMLShell-like languages (including Perl/PHP)SQL
Preliminary performance measurementsConstruction of decision trees: ~3MB/secClassification only: ~15MB/sec
Significant improvements expected with some performance tuning
Next stepsDevelop better clustering/classification algorithms based on tree edit-distanceCurrent algorithm is based entirely on a top-down
traversal, and fails to exploit similarities among subtrees
12/18/06
Syntax and taint-aware policies
12/18/06
Leverage structure+taint to simplify/generalize policyPolicy structure mirrors that of parse trees
And-Or “trees” with cyclesCan specify constraints on values (using regular expressions) and taint associated with a parse tree node
Most attacks detected using one basic policyControlling “commands” vs command parametersControlling pointers vs data
ELEMENT
NAME = “script” OR
PARAM ELEM_BODY
PARAM_NAME=“src” PARAM_VALUE
Overview of Policies
12/18/06
Controlling “commands” Vs “parameters”
Observation: parameters don’t alter syntactic structure of victim’s requests
Policy: Structure of parse tree for victim’s request should not be controlled by untrusted input (“tainted data”)
Alternate formulation: tainted data shouldn’t span multiple “fields” or “tokens” in victim’s request
root
cmd
name param param
gpg -r [email protected]
root
cmd
name param param
gpg -r nobody
cmd
separator
;
name param param
rm -rf *
12/18/06
Policy prohibiting structure changesDefine “structure change” without using a reference
Avoids need for training and associated FP issuesPolicy 1
Tainted data cannot span multiple nodes for binary data, it should not span multiple fields
Policy 2Tainted data cannot straddle multiple subtrees
Tainted data spans two adjacent subtrees, and at least one of them is not fully taintedTainted data “overflowed” beyond the end of one subtree and resulted in a second subtree
Both policies can be further refined to constrain the node types and children subtrees of the nodes
12/18/06
Memory corruption attack overflowing stack buffer For binary data, we talk about message fields
rather than parse trees
…..
Violation: tainted data spans multiple stack “fields”
Heap overflows involve tainted data spanning across multiple heap blocks
Commands Vs parameters: Example 2
Stack frame 1
Return Address
Stack frame 2
Return Address
Stack frame 2
12/18/06
Attacks Detected by “No structure change” Policy
Various forms of script or command injectionSQL injectionXPath injectionFormat string attacksHTTP response splittingLog injectionStack overflow and heap overflow
12/18/06
Application-specific policiesNot all attacks have the flavor of “command
injection”Develop application-specific policies to detect
such attacksPolicy 3: Cross-site scripting: no tainted scripts in HTML data
Policy 4: Path traversal: tainted file names cannot access data outside of a certain document tree
…Other examples
Policy 5: No tainted CMD_NAME or CMD_SEPARATOR nodes in shell or SQL commands
12/18/06
Implementation statusFour test applications
phpBBSquirrelMailPHP/XMLRPCWebGoat (J2EE)
Detects following attacks without FPsCommand injection (Policies 1, 2, 5)SQL injection (1, 2, 5)XSS (3)HTTP Response splitting (2)Path traversal (4)Memory corruption detected using ASR
Should be able to detect many other attacks easilyXPATH injection (1,2), Format-string (1, 2), Log injection (1,2)
12/18/06
Memory Attack Discussion
12/18/06
Memory Error Based Remote Attack
Attacker’s goal:Overwrite target of interest to take over instruction execution
Attacker’s approach:Propagate attacker controlled input to target of interest
Violate certain structural constraints in the propagation process
12/18/06
Stack Frame Structural Violation
High
Low
A’s stack frameFunction arguments
Return address Previous stack frameException Registration Record
B’s stack frameFunction argumentsReturn address( to A)
Previous stack frame
C’s stack frame
Function argumentsReturn address (to B)
Previous stack frameException Registration Record
Local variables
Local variables
Local variables
EBPFS:0
ESP
12/18/06
Happens when removing free block from double-linked list:
Ability to write 4 bytes into any address, usually well known address, like function pointer, return address, SEH etc.
Heap Block Structural Violation
BLink
FLink
Size Previous Size
SegmentIndex
Flags Unused Tag Index
Windows Free Heap Block Header Structure
12/18/06
ASLR randomizes the addresses of targets of interest
Memory attack using the original address will miss and cause crash (exception).
Crash analysis tracks back to vulnerability, which enables accurate signature generationStructural information usually retrievable at runtime, thanks to enhanced debugging technology
Crash analysis aided with JIT(Just In-time Tracing)JIT triggered at certain events:
“Suspicious” network inputs, e.g. sensitive JMP address
Attach/detach JIT monitor at event of interestMemory dump can be dumped in the right granularity,
log info from a few KB to a 2GB
ASLR and Crash Analysis
12/18/06
Crash Root Cause Analysis
Root Cause Analysis
Stack Corruption Heap Corruption
Read Access Violation
Bad EIP(Corrupted ReturnAddress or SEH)
ReadAccess ViolationBad Deference
(Corrupted LocalVariables/passing
parameters)
WriteAccess Violation
(Address to write,Value to write )
Exception Record/Context,Faulting thread/Instructions/Registers
Stack trace/Heap/Module/Symbols
12/18/06
Stack-based Overflow Analysis
“Target” driven analysisThe goal of attack string is to overwrite target of interest on stack, e.g., return address, SEH handler.
Start matching target values from crash dump to input, like EIP, EBP and SEH handlerMore efficient than pattern match in the whole address
spaceIf any targets are matched in input, expand in both directions to find LCS
A match usually indicates the input size needed to overflow certain targets
12/18/06
SEH Overflow and AnalysisA unique approach for Windows exploit
SEH stands for Structured Exception HandlerWindows put EXCEPTION_REGISTRATION_RECORD chain
on stack with SEH in the record.More reliable and powerful than overwrite return
addressMore JMP address to use (pop/pop/ret)An exception (accidental/intentional) is desiredCan bypass /GS buffer check
SEH crash analysis:Catch the first exception as well as the second one
(caused by ASR)Locate the SEH chain head from first dump, usually
overwritten by inputUsually first exception is enough, second exception can be
used for confirmation
12/18/06
Heap Overflow AnalysisHow to analyze heap overflow attack?
Exploit happens in free blocks unlinkMultiple ways to trigger
Write Access Violation with ASRwith overwriting in invalid address
Overwrite 4 bytes value in arbitrary addressInterested targets include return address, SEH, PEB and UEF
Exploit contains the pair: (Address To Write, Value to Write)Appeared in the overflowed heap blocks Usually contained in registersShould be provided from input by attackerMatch found in synthetic heap exploits
The value pairs need to be in fixed offsetFor a given heap overflow vulnerability To enable overwrite the right address with the right value
desired
12/18/06
Case Studies
Vulnerability Exploit
IIS ISAPI Extension synthetic stack buffer overflow
Overwrite return address
IIS ISAPI Extension synthetic stack buffer overflow
Overwrite Structure Exception Handler
IIS w3who.dll stack buffer overflow(CVE-2004-1134)
Overwrite Structure Exception Handler
Microsoft RPC DCOM Interface stack buffer overflow(CVE-2003-0352)
Overwrite return address and Structure Exception Handler
Synthetic Heap Overflow Overwrite function pointer inside PEB structure
12/18/06
Case Study: RPC DCOM Step 1: Exception Analysis
FAULTING_IP: +18759f ExceptionCode: c0000005 (Access violation)Attempt to read from address 0018759fPROCESS_NAME: svchost.exeFAULTING_THREAD: 00000290PRIMARY_PROBLEM_CLASS: STACK_CORRUPTION
Step 2: Target – Input correlation:StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000Begin analyze on Target Overwrite and Input Correlation:Analyze crash EIP:
Find EIP pattern at socket input: Bytes size to overwrite EIP= 128
Analyze crash EIP done!Analyze SEH:
Find SEH byte at socket input: Bytes size to overwrite SEH handler= 1588
Analyze SEH done!
12/18/06
Signature Generation
Signature generation: Signature captures the vulnerability characteristics
Minimum size to overwrite certain target(s)
Use contexts to reduce false positive:Using incoming input calling stack
Stack offset can uniquely identify the context
Using incoming input semantic context:Message format like HTTP url/parameterBinary message field
12/18/06
Protected Application
RAMSESCrash Monitor:* Catch interested
exception only•Snapshots for a
given period* Self healer
RAMSESCrash Analyzer
•Fault type detection•Security oriented
analysis•Feedback
WindowsDebugEngine
Crash Dump*
Crash(Exception)
Generate
Uses
UsesProvide Input History
AnalyzeSignature
1
2
3
45
Infrastructure:Save Crash Dump
Extract Relevant InfoSearch/MatchDisassemble
Components & Implementation
* Crash Dump provides the same interface as LIVE process, so Crash Analyzer actually
does NOT have to work on saved crash dump file.
12/18/06
Testing
12/18/06
Test Attacks & ApplicationsAttack Vulnerability Target App App Lang Exploited Lang TargetsphpBB SQL Injection CAN-2003-0486 phpBB PHP SQL DatabaseSquirrelMail Command Injection CAN-2003-0990 SquirrelMail PHP cmd/shell ServerSquirrelMail XSS Attack CAN-2002-1341 phpBB PHP JavaScript 3rd party clientsPHP XML-RPC CAN-2005-1921 PHP Library PHP XMLHTTP Splitting CR LF escapes WebGoat Java HTTP Request ServerHTTP Splitting Cache Poisoning tainted expiration field WebGoat Java HTTP Request Server page cachePath Based Access Control tainted file open WebGoat Java file path ServerXpath injection tainted xpath string WebGoat Java Xpath Library ServerJSON injection flawed architecture WebGoat Java JSON Server ApplicationXML inject flawed architecture WebGoat Java XML Server Application
Baseline Applications• phpBB (php)• squirrelMail (php)• WebGoat (java)• hMailServer (C++)
Many “sub languges”SQL, XML, JavaScript,HTML, HTTP, JSON, shell, cmd, path
12/18/06
Possible Testbed Configurations
WebServer
(IIS/Apache)
SQLDatabase(MySQL)
Protected System
MailServer
Attacker
Can extend protected system to include Mail Serve
WebApps
files
Protected System
MailServer
Attacker
Protect Mail server exposed as a service.
WebServer
(IIS/Apache)
SQLDatabase(MySQL)
Protected System
MailServer
Attacker
Baseline testbed setup
WebApps
files
WebServer
(IIS/Apache)
SQLDatabase(MySQL)
Protected System
MailServer
Attacker
Protect just mail server in context of Web service.
WebApps
files
12/18/06
Traffic Generation Purpose
Coverage of legitmate structural variation in monitored structuresSQL, command strings, call parameters
Stress of log complexity for practicalityMultiple users, multiple sessions
Performance measurementsProgram performance metricsQuantify performance impact
12/18/06
Traffic Generation to Web SitesApproaches
Simple Record/Playback (basic) with minor substitutions (cookies, ips) shell scripts, netcat, MaxQ (jython based
Custom DOM/Ajax scripting (learning) Can access dynamically generated browser content
after(during) client side script eval Automated site crawls of URLS Automated form contents (site specific metadata)
COTS tools Load testing and metrics
12/18/06
12/18/06
Red Team Suggestions
12/18/06
Suggested Red Team ROEsInitial telecons held in FallClaim: RAMSES will defeat most generalized
injection attacks on protected applicationsRed Team should target our current and planned
applications rather than new ones (unless new application, sample attacks and complete traffic generator can be provided to RAMSES far enough in advance for learning and testing)Remote network access to the targeted applicationAttack designated application suite
Required instrumentation yet to be determinedRed Team exercise start 15 April or later……
12/18/06
RAMSES Project Schedule
Baseline Tasks
1. Refine RAMSES Requirements
2. Design RAMSES
3. Develop Components
4. Integrate System
5. Analyze & Test RAMSES
6. Coordinate & Rept
Prototypes
Optional Tasks
O.3 Cross-Area Exper
CY06 CY07 CY08
Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1Q3
1 2 3
Q2
CY09
Q3
Today: 11 September 2007
Red Team Exercise
12/18/06
Next Steps
12/18/06
PlansDevelop input filters from output policiesExtend memory error analyzerDemonstrate RAMSES on more applications
and attack typesNative C/C++ app (most likely app is hMail server)
JavaIntegrate componentsPerformance and false positive testingRed Team exercise
12/18/06
Questions?
12/18/06
Backup
12/18/06
Tokenizing and ParsingFocus on “rough” parsing that reveals approximate
structure, but not necessarily all the details Accurate parsers are time-consuming to write More important: may not gracefully handle errors (common in HTML) or language extensions and variations (different shells, different flavors of SQL)
Implemented using Flex/Bison Currently done for SQL and shell command languages
Parse into a sequence of statements, each statement consisting of a “command name” and “parameters”
Incorporates a notion of confidence to deal with complex language features, e.g., variable substitutions in shell
Modest effort for adding additional languages, but substantially simplifies subsequent learning tasks
Don’t anticipate significant additions to this language list (other than HTML/XML)
12/18/06
Taint inference Vs Taint-trackingDisadvantages of learning
False negatives if inputs transformed before useLow likelihood for most web apps
False positives due to coincidenceMitigated using statistical information
Plan to evaluate these experimentallyBenefits of learning
Low performance overheadSome significant implicit flows handled without incurring high false positives
Can address attacks multi-step attacks where tainted data is first stored in a file/database before useMore generally, in dealing with information flow that
crosses module boundaries
12/18/06
Attack Coverage 2004
(Stack-smashing, heap overflow, integer overflow, data attacks)
Other logic errors22%
Format string4%
Memory errors27%
I nput validation/
DoS9%
Directory traversal
10%
Cross-site scripting
4%
Command injection
15%
SQL injection2%
Tempfile4%Config errors
3%
CVE Vulnerabilities (Ver. 20040901)
Generalized Injection Attacks
12/18/06
RAMSES Interceptors
RAMSES System Concept
Key research problemsLearn taint propagation
Identify tainted components in output, generate filtering criteriaLearn input/output transformation
Use transformation to project output filters to input
WebServer
(IIS/Apache)
WebApp
(PHP/ASP)
SQLDatabase(MySQL)
OSDLLs
ApplicationDLLs
Network DLLs
Protected SystemN
etw
ork/
App
Fire
wal
l (e.
g. m
od_s
ecur
ity)
RAMSES Components
Attack Detector• Address-space
randomization• Taint-based policies,
anomalies
Event Collector• parse/decode/normalize HTTP requests, parameters, cookies, …
Filter Generator• Output filter• Input filter
Inte
rnet
12/18/06
Advantages of RAMSES FiltersFilters easily sharable
Complements Application Community focus on end user applications
Filters are human readableFilter generation algorithms can be enhanced to address privacy concerns wrt sharing
12/18/06
Filter typesFilter Criteria
Correlative filtersEquality-based filterStructure-based filterStatistical filter
Causal filtersFiltering criteria
derived from attack detection criteria (policy or anomaly)
Filter Location Input filter
Easier to deploy but harder to synthesize
Output filter (precedes sensitive operation)Easier to synthesize than
input filter, but deployment needs deeper instrumentation
May be too late for some attacks (memory corruption)
Note: All filters evaluated using large number of benign samples and 1 attack sample