cs 290c: formal models for web software lectures 17: analyzing input validation and sanitization in...

56
CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Upload: edwin-dawson

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

CS 290C: Formal Models for Web Software

Lectures 17: Analyzing Input Validation and Sanitization in Web Applications

Instructor: Tevfik Bultan

Page 2: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Vulnerabilities in Web Applications

• There are many well-known security vulnerabilities that exist in many web applications. Here are some examples:– Malicious file execution: where a malicious user

causes the server to execute malicious code– SQL injection: where a malicious user executes SQL

commands on the back-end database by providing specially formatted input

– Cross site scripting (XSS): causes the attacker to execute a malicious script at a user’s browser

• These vulnerabilities are typically due to – errors in user input validation or – lack of user input validation

Page 3: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Related VulnerabilitiesString related web application vulnerabilities as a percentage of all vulnerabilities (reported by CVE)

OWASP Top 10 in 2007:1.Cross Site Scripting2.Injection Flaws

OWASP Top 10 in 2010:1.Injection Flaws2.Cross Site Scripting

Page 4: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Why Is Input Validation Error-prone?

• Extensive string manipulation:– Web applications use extensive string manipulation

• To construct html pages, to construct database queries in SQL, etc.

– The user input comes in string form and must be validated and sanitized before it can be used

• This requires the use of complex string manipulation functions such as string-replace

– String manipulation is error prone

Page 5: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Related Vulnerabilities

String related web application vulnerabilities occur when: a sensitive function is passed a malicious string

input from the user This input contains an attack It is not properly sanitized before it reaches the

sensitive function

String analysis: Discover these vulnerabilities automatically

Page 6: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

XSS Vulnerability

A PHP Example:

1:<?php2: $www = $_GET[”www”];3: $l_otherinfo = ”URL”;4: echo ”<td>” . $l_otherinfo . ”: ” . $www . ”</td>”;5:?>

The echo statement in line 4 is a sensitive function It contains a Cross Site Scripting (XSS) vulnerability

<script ...

Page 7: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Is It Vulnerable? A simple taint analysis can report this segment vulnerable

using taint propagation

1:<?php2: $www = $_GET[”www”];3: $l_otherinfo = ”URL”;4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;5:?>

echo is tainted → script is vulnerable

tainted

Page 8: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

How to Fix it? To fix the vulnerability we added a sanitization routine at

line s Taint analysis will assume that $www is untainted and

report that the segment is NOT vulnerable1:<?php2: $www = $_GET[”www”];3: $l_otherinfo = ”URL”;s: $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;5:?>

tainted

untainted

Page 9: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Is It Really Sanitized?

1:<?php2: $www = $_GET[”www”];3: $l_otherinfo = ”URL”;s: $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;5:?>

<script …>

<script …>

Page 10: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Sanitization Routines can be Erroneous

The sanitization statement is not correct!

ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www); – Removes all characters that are not in { A-Za-z0-9 .-

@:/ }– .-@ denotes all characters between “.” and “@”

(including “<” and “>”)– “.-@” should be “.\-@”

This example is from a buggy sanitization routine used in MyEasyMarket-4.1 (line 218 in file trans.php)

Page 11: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Analysis

String analysis determines all possible values that a string expression can take during any program execution

Using string analysis we can identify all possible input values of the sensitive functions

Then we can check if inputs of sensitive functions can contain attack strings

How can we characterize attack strings? Use regular expressions to specify the attack patterns

Attack pattern for XSS: Σ <scriptΣ∗ ∗

Page 12: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Vulnerabilities Can Be Tricky

• Input <!sc+rip!t ...> does not match the attack pattern – but it matches the vulnerability signature and it can

cause an attack1:<?php2: $www = $_GET[”www”];3: $l_otherinfo = ”URL”;s: $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;5:?>

<!sc+rip!t …>

<script …>

Page 13: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Analysis

If string analysis determines that the intersection of the attack pattern and possible inputs of the sensitive function is empty

then we can conclude that the program is secure

If the intersection is not empty, then we can again use string analysis to generate a vulnerability signature

characterizes all malicious inputs

Given Σ <scriptΣ∗ ∗ as an attack pattern: The vulnerability signature for $_GET[”www”] is

Σ <α sα cα rα iα pα tΣ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗where α∉ { A-Za-z0-9 .-@:/ }

Page 14: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Automata-based String Analysis

• Finite State Automata can be used to characterize sets of string values

• We use automata based string analysis– Associate each string expression in the program with an

automaton– The automaton accepts an over approximation of all

possible values that the string expression can take during program execution

• Using this automata representation symbolically execute the program, only paying attention to string manipulation operations

Page 15: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Input Validation Verification Stages

Parser/Parser/Taint AnalysisTaint Analysis

Vulnerability AnalysisVulnerability Analysis

Signature GenerationSignature Generation

AttackPatterns

Application/Scripts

VulnerabilitySignature

(Tainted) Dependency Graphs

Patch SynthesisPatch Synthesis

Sanitization Statements

Reachable Attack Strings

Page 16: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Combining Forward & Backward Analyses

Convert PHP programs to dependency graphs Combine symbolic forward and backward symbolic

reachability analyses Forward analysis

Assume that the user input can be any string Propagate this information on the dependency graph When a sensitive function is reached, intersect with

attack pattern Backward analysis

If the intersection is not empty, propagate the result backwards to identify which inputs can cause an attack

Page 17: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Dependency Graphs

Given a PHP program,

first construct the:

Dependency graph

1:<?php2: $www = $ GET[”www”];3: $l_otherinfo = ”URL”;4: $www = ereg_replace(

”[^A-Za-z0-9 .-@://]”,””,$www);

5: echo $l_otherinfo . ”: ” .$www;

6:?>

echo, 5

str_concat, 5

$www, 4

“”, 4

preg_replace, 4

[^A-Za-z0-9 .-@://], 4 $www, 2

$_GET[www], 2

“: “, 5$l_otherinfo, 3

“URL”, 3

str_concat, 5

Dependency Graph

Page 18: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Forward Analysis

• Using the dependency graph we conduct vulnerability analysis

• Automata-based forward symbolic analysis that identifies the possible values of each node– Each node in the dependency graph is associated with a

DFA• DFA accepts an over-approximation of the strings

values that the string expression represented by that node can take at runtime

• The DFAs for the input nodes accept Σ∗– Intersecting the DFA for the sink nodes with the DFA for

the attack pattern identifies the vulnerabilities

Page 19: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Forward Analysis

• Forward analysis uses post-image computations of string operations:– postConcat(M1, M2)

returns M, where M=M1.M2– postReplace(M1, M2, M3)

returns M, where M=replace(M1, M2, M3)

Page 20: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Forward Analysis

echo, 5

str_concat, 5

$www, 4

“”, 4

preg_replace, 4

[^A-Za-z0-9 .-@://], 4 $www, 2

$_GET[www], 2

“: “, 5$l_otherinfo, 3

“URL”, 3

str_concat, 5

Forward = URL: [A-Za-z0-9 .-@/]*

Forward = URL: [A-Za-z0-9 .-@/]*

Forward = [A-Za-z0-9 .-@/]*

Forward = [A-Za-z0-9 .-@/]*

Forward = Σ*Forward = εForward = [^A-Za-z0-9 .-@/]

Forward = Σ*

Forward = :

Forward = URL

Forward = URL

Forward = URL:

L(URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*)

Attack Pattern = Σ*<Σ*

≠ Ø

L(URL: [A-Za-z0-9 .-@/]*) =L(Σ*<Σ*)

Page 21: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Result Automaton

U

R

L

:

Space

<

[A-Za-z0-9 .-;=-@/]

URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*

[A-Za-z0-9 .-@/]

Page 22: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Symbolic Automata Representation

• Compact Representation:– Canonical form and– Shared BDD nodes

• Efficient MBDD Manipulations:– Union, Intersection, and Emptiness Checking– Projection and Minimization

Page 23: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Symbolic Automata Representation

Explicit DFArepresentation

Symbolic DFArepresentation

Page 24: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Widening

• String verification problem is undecidable

• The forward fixpoint computation is not guaranteed to converge in the presence of loops and recursion

• We want to compute a sound approximation– During fixpoint we compute an over approximation of the

least fixpoint that corresponds to the reachable states

• We use an automata based widening operation to over-approximate the fixpoint– Widening operation over-approximates the union

operations and accelerates the convergence of the fixpoint computation

Page 25: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Widening

Given a loop such as

1:<?php2: $var = “head”;3: while (. . .){4: $var = $var . “tail”;5: }6: echo $var7:?>

Our forward analysis with widening would compute that the value of the variable $var in line 6 is (head)(tail)*

Page 26: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Backward Analysis

• A vulnerability signature is a characterization of all malicious inputs that can be used to generate attack strings

• We identify vulnerability signatures using an automata-based backward symbolic analysis starting from the sink node

• Pre-image computations on string operations:– preConcatPrefix(M, M2)

returns M1 and where M = M1.M2– preConcatSuffix(M, M1)

returns M2, where M = M1.M2– preReplace(M, M2, M3)

returns M1, where M=replace(M1, M2, M3)

Page 27: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Backward Analysis

echo, 5

str_concat, 5

$www, 4

“”, 4

preg_replace, 4

[^A-Za-z0-9 .-@://], 4 $www, 2

$_GET[www], 2

“: “, 5

$l_otherinfo, 3

“URL”, 3

str_concat, 5

Forward = URL: [A-Za-z0-9 .-@/]*

Backward = URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*

Forward = URL: [A-Za-z0-9 .-@/]*

Backward = URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*

Forward = [A-Za-z0-9 .-@/]*

Backward =[A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*

Forward = [A-Za-z0-9 .-@/]*

Backward = [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*

Forward = Σ*

Backward = [^<]*<Σ*

Forward = ε

Backward = Do not care

Forward = [^A-Za-z0-9 .-@/]

Backward = Do not care

Forward = Σ*

Backward = [^<]*<Σ*

Forward = :

Backward = Do not care

Forward = URL

Backward = Do not care

Forward = URL

Backward = Do not care

Forward = URL:

Backward = Do not care

node 3 node 6

node 10

node 11

node 12

Vulnerability Signature = [^<]*<Σ*Vulnerability Signature = [^<]*<Σ*

Page 28: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Vulnerability Signature Automaton

[^<]*<Σ*

<

[^<]

Σ

Non-ASCII

Page 29: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Vulnerability Signatures

• The vulnerability signature is the result of the input node, which includes all possible malicious inputs

• An input that does not match this signature cannot exploit the vulnerability

• After generating the vulnerability signature– Can we generate a patch based on the vulnerability

signature?

The vulnerability signature automaton for the running example

[^<]

<

Σ

Page 30: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Patches from Vulnerability Signatures

• Main idea: – Given a vulnerability signature automaton, find a cut that

separates initial and accepting states– Remove the characters in the cut from the user input to

sanitize

• This means, that if we just delete “<“ from the user input, then the vulnerability can be removed

[^<]

<

Σ

min-cut is {<}

Page 31: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Patches from Vulnerability Signatures

• Ideally, we want to modify the input (as little as possible) so that it does not match the vulnerability signature

• Given a DFA, an alphabet cut is – a set of characters that after ”removing” the edges that

are associated with the characters in the set, the modified DFA does not accept any non-empty string

• Finding a minimal alphabet cut of a DFA is an NP-hard problem (one can reduce the vertex cover problem to this problem)– We can use a min-cut algorithm instead – The set of characters that are associated with the edges

of the min cut is an alphabet cut • but not necessarily the minimum alphabet cut

Page 32: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Automatically Generated Patch

Automatically generated patch will make sure that no string that matches the attack pattern reaches the sensitive function

<?php

if (preg match(’/[^ <]*<.*/’,$ GET[”www”]))

$ GET[”www”] = preg replace(<,””,$ GET[”www”]); $www = $_GET[”www”]; $l_otherinfo = ”URL”; $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www); echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;?>

Page 33: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Experiments

• Application of this approach to five vulnerable input sanitization routines from three open source web applications:

(1) MyEasyMarket-4.1: A shopping cart program

(2) BloggIT-1.0: A blog engine

(3) proManager-0.72: A project management system

• We used the following XSS attack pattern:

Σ∗<scriptΣ∗

Page 34: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Forward Analysis Results

• The dependency graphs of these benchmarks are simplified based on the sinks– Unrelated parts are removed using slicing

Input Results

#nodes #edges #sinks #inputs Time(s) Mem (kb) #states/#bdds

21 20 1 1 0.08 2599 23/219

29 29 1 1 0.53 13633 48/495

25 25 1 2 0.12 1955 125/1200

23 22 1 1 0.12 4022 133/1222

25 25 1 1 0.12 3387 125/1200

Page 35: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Backward Analysis Results

• We use the backward analysis to generate the vulnerability signatures– Backward analysis starts from the vulnerable sinks

identified during forward analysis

Input Results

#nodes #edges #sinks #inputs Time(s) Mem (kb) #states/#bdds

21 20 1 1 0.46 2963 9/199

29 29 1 1 41.03 1859767 811/8389

25 25 1 2 2.35 5673 20/302,20/302

23 22 1 1 2.33 32035 91/1127

25 25 1 1 5.02 14958 20/302

Page 36: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Alphabet Cuts

• We generate alphabet cuts from the vulnerability signatures using a min-cut algorithm

• Problem: When there are two user inputs the patch will block everything and delete everything

– Overlooks the relations among input variables (e.g., the concatenation of two inputs contains < SCRIPT)

Input Results

#nodes #edges #sinks #inputs AlphabetCut

21 20 1 1 {<}

29 29 1 1 {S,’,”}

25 25 1 2 Σ , Σ

23 22 1 1 {<,’,”}

25 25 1 1 {<,’,”}

Vulnerabilitysignaturedepends ontwo inputs

Page 37: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Relational String Analysis

• Instead of using multiple single-track DFAs use one multi-track DFA– Each track represents the values of one string variable

• Using multi-track DFAs:– Identifies the relations among string variables– Generates relational vulnerability signatures for multiple

user inputs of a vulnerable application– Improves the precision of the path-sensitive analysis– Proves properties that depend on relations among string

variables, e.g., $file = $usr.txt

Page 38: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Multi-track Automata

• Let X (the first track), Y (the second track), be two string variables

• λ is a padding symbol• A multi-track automaton that encodes X = Y.txt

(a,a), (b,b) …

(t,λ) (x,λ) (t,λ)

Page 39: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Relational Vulnerability Signature

• We perform forward analysis using multi-track automata to generate relational vulnerability signatures

• Each track represents one user input– An auxiliary track represents the values of the current

node– We intersect the auxiliary track with the attack pattern

upon termination

Page 40: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Relational Vulnerability Signature

• Consider a simple example having multiple user inputs

<?php

1: $www = $_GET[”www”];

2: $url = $_GET[”url”];

3: echo $url. $www;

?>

• Let the attack pattern be Σ < Σ ∗ ∗

Page 41: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Relational Vulnerability Signature

• A multi-track automaton: ($url, $www, aux)• Identifies the fact that the concatenation of two inputs

contains <

(a,λ,a),(b,λ,b),… (<,λ,<)

(λ,a,a),(λ,b,b),…

(λ,a,a),(λ,b,b),…

(λ,<,<)

(λ,<,<)

(λ,a,a),(λ,b,b),…

(λ,a,a),(λ,b,b),…

(a,λ,a),(b,λ,b),…

Page 42: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Relational Vulnerability Signature

• Project away the auxiliary variable• Find the min-cut• This min-cut identifies the alphabet cuts {<} for the first

track ($url) and {<} for the second track ($www)

(a,λ),(b,λ),… (<,λ)

(λ,a),(λ,b),…

(λ,a),(λ,b),…

(λ,<)

(λ,<)

(λ,a),(λ,b),…

(λ,a),(λ,b),…

(a,λ),(b,λ),…

min-cut is {<},{<}

Page 43: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Patch for Multiple Inputs

• Patch: If the inputs match the signature, delete its alphabet cut

<?php

if (preg match(’/[^ <]*<.*/’, $ GET[”url”].$ GET[”www”]))

{

$ GET[”url”] = preg replace(<,””,$ GET[”url”]);

$ GET[”www”] = preg replace(<,””,$ GET[”www”]);

}

1: $www = $ GET[”www”];

2: $url = $ GET[”url”];

3: echo $url. $www;

?>

Page 44: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Technical Issues

• To conduct relational string analysis, we need to compute “intersection” of multi-track automata– Intersection is closed under aligned multi-track

automata• λs are right justified in all tracks, e.g., abλλ instead of

aλbλ– However, there exist unaligned multi-track automata that

are not describable by aligned ones– We propose an alignment algorithm that constructs

aligned automata which over or under approximate unaligned ones

Page 45: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Other Technical Issues

• Modeling Word Equations:– Intractability of X = cZ:

• The number of states of the corresponding aligned multi-track DFA is exponential to the length of c.

– Irregularity of X = YZ:• X = YZ is not describable by an aligned multi-track

automata

• Use a conservative analysis– Construct multi-track automata that over or under-

approximate the word equations

Page 46: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Composite Analysis

• What I have talked about so far focuses only on string contents– It does not handle constraints on string lengths– It cannot handle comparisons among integer variables

and string lengths

• String analysis techniques can be extended to analyze systems that have unbounded string and integer variables

• Need to use a composite static analysis approach that combines string analysis and size analysis

Page 47: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Size Analysis

• Size Analysis: The goal of size analysis is to provide properties about string lengths– It can be used to discover buffer overflow vulnerabilities

• Integer Analysis: At each program point, statically compute the possible states of the values of all integer variables.– These infinite states are symbolically over-approximated

as linear arithmetic constraints that can be represented as an arithmetic automaton

• Integer analysis can be used to perform size analysis by representing lengths of string variables as integer variables.

Page 48: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

An Example

• Consider the following segment:

1: <?php

2: $www = $ GET[”www”];

3: $l otherinfo = ”URL”;

4: $www = ereg replace(”[^A-Za-z0-9 ./-@://]”,””,$www);

5: if(strlen($www) < $limit)

6: echo ”<td>” . $l otherinfo . ”: ” . $www . ”</td>”;

7:?>

• If we perform size analysis solely, after line 4, we do not know the length of $www

• If we perform string analysis solely, at line 5, we cannot check/enforce the branch condition.

Page 49: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Composite Analysis

• We need a composite analysis that combines string analysis with size analysis.– Challenge: How to transfer information between string

automata and arithmetic automata?

• A string automaton is a single-track DFA that accepts a regular language, whose length forms a semi-linear set– For example: {4, 6} {2 + 3k | k ≥ 0}∪

• The unary encoding of a semi-linear set is uniquely identified by a unary automaton

• The unary automaton can be constructed by replacing the alphabet of a string automaton with a unary alphabet

Page 50: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Arithmetic Automata

• An arithmetic automaton is a multi-track DFA, where each track represents the value of one variable over a binary alphabet

• If the language of an arithmetic automaton satisfies a Presburger formula, the value of each variable forms a semi-linear set

• The semi-linear set is accepted by the binary automaton that projects away all other tracks from the arithmetic automaton

Page 51: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Connecting the Dots

• There are algorithms to convert unary automata to binary automata and vice versa

• Using these conversion algorithms we can conduct a composite analysis that subsumes size analysis and string analysis

StringAutomata

Unary LengthAutomata

Binary LengthAutomata

ArithmeticAutomata

Page 52: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Case Study

Schoolmate 1.5.4 Number of PHP files: 63 Lines of code: 8181

Forward Analysis results

Actual Vulnerabilities False Positives

105 48

Time Memory Number of XSS sensitive sinks

Number of XSS Vulnerabilities

22 minutes 281 MB 898 153

Page 53: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Case Study – False Positives

Why false positives?– Path insensitivity: 39

Path to vulnerable program point is not feasible– Un-modeled built in PHP functions : 6– Unfound user written functions: 3

– PHP programs have more than one execution entry point

– We can remove all these false positives by extending the analysis to a path sensitive analysis and modeling more PHP functions

Page 54: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

Case Study - Sanitization

After patching all actual vulnerabilities by adding automated sanitization routines we can run string analysis again

When string analysis is used on the automatically generated patches, it shows that the patches are correct with respect to the attack pattern

Page 55: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Analysis

• String analysis based on context free grammars: [Christensen et al., SAS’03] [Minamide, WWW’05]

• String analysis based on symbolic/concolic execution: [Bjorner et al., TACAS’09], [Saxena et al., S&P’10]

• Bounded string analysis : [Kiezun et al., ISSTA’09]• Automata based string analysis: [Xiang et al.,

COMPSAC’07] [Shannon et al., MUTATION’07], [Balzarotti et al., S&P’08], [Yu et al., SPIN’08, CIAA’10], [Hooimeijer et al., Usenix’11]

• Application of string analysis to web applications: [Wassermann and Su, PLDI’07, ICSE’08] [Halfond and Orso, ASE’05, ICSE’06]

• String analysis for JavaScript [Saxena et al. S&P’10], [Alkhalaf et al., ICSE’12, ISSTA’12]

Page 56: CS 290C: Formal Models for Web Software Lectures 17: Analyzing Input Validation and Sanitization in Web Applications Instructor: Tevfik Bultan

String Analysis

• Size Analysis– Size analysis: [Hughes et al., POPL’96] [Chin et al.,

ICSE’05] [Yu et al., FSE’07] [Yang et al., CAV’08]– Composite analysis: [Bultan et al., TOSEM’00] [Xu et al.,

ISSTA’08] [Gulwani et al., POPL’08] [Halbwachs et al., PLDI’08], [Yu et al. TACAS’09]

• Vulnerability Signature Generation– Test input/Attack generation: [Wassermann et al.,

ISSTA’08] [Kiezun et al., ICSE’09]– Vulnerability signature generation: [Brumley et al.,

S&P’06] [Brumley et al., CSF’07] [Costa et al., SOSP’07]– Vulnerability signature generation and patch generation

[Yu et al., ASE’09, ICSE’11]