intrusion detection and malware analysis - malware analysis · intrusion detection and malware...

Intrusion Detection and Malware AnalysisMalware analysis

Pavel LaskovWilhelm Schickard Institute for Computer Science

What do we want to know about malware?

Is it recognized by existing antivirus products?What is its functionality?

How is malware distributed?What other harmful functions does malware carry out?

What are relationships between various classes of malware?Do they share common techniques? How do they evolve?

How much of malware is unknown?

Experiment:Current instances of malware were collected from aNepenthes honeypot.Files were scanned with Avira AntiVir.

Results:First scan:

detected76%

undetected 24%

Second scan:

detected85%

undetected15%

After four weeks 15% of malware instances were stillnot recognized!

Traditional malware analysis tools

Binary disassemblers (IDA Pro, OllyDbg)(+) precise understanding of malware behavior(−) extensive manual effort

Static flow analysers (BinDiff, BinNavi)(+) visual representation of malware program flow(−) prone to obfuscation

Modern malware analysis tools

SandboxesBehavior analysersClustering and classification methodsSignature generation tools

CWSandbox: key ideas

API hooking: intercept WindowsAPI calls by overwriting APIfunctions’ code loaded in theprocess memory.DLL injection: insertion of hookfunctions into the runningprocesses.Recording of persistent systemchanges collected by a hook in anXML report.

Malware

leads to the hook function.5. To complete the process, hook the LoadLibrary

and LoadLibraryEx API functions, which allowthe explicit binding of DLLs.

If an application loads the DLL dynamically at runtime,you can use this same procedure to overwrite the func-tion entry points. The CWSandbox carries out thesesteps in the initialization phase to set up the hookingfunctions.

Figure 1a shows the original function code for Cre-ateFileA, which is located in kernel32.dll. Theinstructions are split into two blocks: the first marks theblock that we’ll overwrite to delegate control to ourhook function; the second block includes the instruc-tions that the API hook won’t touch. Figure 1b shows thesituation after we installed the hook. We overwrite the

first six bytes of each to-be-analyzed function with a JMPinstruction to our hook code. In the hook function, wecan save the called API function’s parameters or modifythem if necessary. Then, we execute the bytes that weoverwrote in the first phase and then JMP back to executethe rest of the API function. There’s no need to call itwith a JMP instruction: the hook function can call theoriginal API with a CALL operation and regain controlwhen the called API function performs the RET. Thehook function then analyzes the result and modifies it ifnecessary. Holy Father offers one of the most popular anddetailed descriptions of this approach,2 and Microsoft alsooffers a library called Detours for this purpose (www.research.microsoft.com/sn/detours).

To offer a complete hooking overview, we must men-tion system service hooking, which occurs at a lowerlevel within the Windows operating system and isn’t con-sidered to be API hooking. Two additional possibilitiesexist for rerouting API calls: we can modify an entry inthe IDT such that Int 0x2e, which is used for invokingsystem calls, points to the hooking routine, or we can ma-nipulate the entries in the SSDT so that the system callscan be intercepted depending on the service IDs. Wedon’t use these techniques because API hooking is mucheasier to implement and delivers more accurate results. Inthe future, we might extend CWSandbox to use kernelhooks because they’re more complicated to detect.

On a side note, programs that directly call the kernelto avoid using the Windows API can bypass API hookingtechniques. However, this is rather uncommon becausethe malware author must know the target operating sys-tem, its service pack level, and other information in ad-vance. Our results show that most malware authorsdesign autonomous-spreading malware to attack largeuser bases, so they commonly use the Windows API.

DLL code injectionDLL code injection lets us implement API hooking in amodular and reusable way. However, API hooking withinline code overwriting makes it necessary to patch theapplication after it has been loaded into memory. To besuccessful, we must copy the hook functions into thetarget application’s address space so they can be calledfrom within the target—this is the actual code injec-tion—and bootstrap the API hooks in the target appli-cation’s address space using a specialized thread in themalware’s memory.

How can we insert the hook functions into theprocess running the malware sample? It depends on thehooking method we use. In any case, we have to manip-ulate the target process’s memory—changing the appli-cation’s import address table (IAT), changing the loadedDLLs’ export address table (EAT), or directly overwritingthe API function code. In Windows, we can implant andinstall API hook functions by accessing another process’s

34 IEEE SECURITY & PRIVACY ■ MARCH/APRIL 2007

Figure 1. In-line code overwriting. (a) shows the original functioncode. In (b), the JMP instruction overwrites the API function’s firstblock (1) and transfers control to our hook function whenever theto-be-analyzed application calls the API function. (2) The hookfunction performs the desired operations and then calls theoriginal API function’s saved stub. (3) The saved stub performs theoverwritten instructions and branches to the original API function’sunmodified part.

Kernel32.dll-CreateFileA (*without* Hook):

77E8C1F777E8C1F877E8C1FA77E8C1FD77E8C20277E8C1FD…77E8C226

PUSH ebpMOV ebp, espPUSH SS:[ebp+8]

CALL +$0000d265TEST eax, eaxJNZ…RET

Kernel32.dll-CreateFileA (*with* Hook):

77E8C1F777E8C1FD77E8C20277E8C1FD…77E8C226

CALL +$0000d265TEST eax, eaxJNZ +$05…RET

JMP [CreatFileA-Hook]

PUSH ebpMOV ebp, espPUSH SS:[ebp+8]

JMP $77E8C1FD

Application.CreatFileA-Hook:

2005EDB7…2005EDF0

-custom hook code-…JMP [CreatFileA-SavedStub]

Application.CreatFileA-SavedStub:

21700000217000012170000321700006

1

2

3

(a)

(b)

CWSandbox: system architectureMalware

virtual memory and executing code in a differentprocess’s context.

Windows kernel32.dll offers the API functionsReadProcessMemory and WriteProcessMemory,which lets the CWSandbox read and write to an arbi-trary process’s virtual memory, allocating new memoryregions or changing an already allocated memory re-gion’s using the VirtualAllocEx and Virtual-ProtectEx functions.

It’s possible to execute code in another process’s con-text in at least two ways:

• suspend one of the target application’s running threads,copy the to-be-executed code into the target’s addressspace, set the resumed thread’s instruction pointer to thecopied code’s location, and then resume the thread; or

• copy the to-be-executed code into the target’s addressspace and create a new thread in the target process withthe code location as the start address.

With these building blocks in place, it’s now possible toinject and execute code into another process.

The most popular technique is DLL injection, inwhich the CWSandbox puts all custom code into a DLLand the hook function directs the target process to loadthis DLL into its memory. Thus, DLL injection fulfillsboth requirements for API hooking: the custom hookfunctions are loaded into the target’s address space, andthe API hooks are installed in the DLL’s initialization rou-tine, which the Windows loader calls automatically.

The API functions LoadLibraryor LoadLibrary-Ex perform the explicit DLL linking; the latter allowsmore options, whereas the first function’s signature isvery simple—the only parameter it needs is a pointer tothe DLL name.

The trick is to create a new thread in the targetprocess’s context using the CreateRemoteThreadfunction and then setting the code address of the APIfunction LoadLibrary as the newly created thread’sstarting address. When the to-be-analyzed applicationexecutes the new thread, the LoadLibrary function iscalled automatically inside the target’s context. Becausewe know kernel32.dll’s location (always loaded atthe same memory address) from our starter application,and know the LoadLibrary function’s code location,we can also use these values for the target application.

CWSandbox architectureWith the three techniques we described earlier set up, wecan now build the CWSandbox system that’s capable ofautomatically analyzing a malware sample. This systemoutputs a behavior-based analysis; that is, it executes themalware binary in a controlled environment so that wecan observe all relevant function calls to the WindowsAPI, and generates a high-level summarized report from

the monitored API calls. The report provides data foreach process and its associated actions—one subsectionfor all accesses to the file system and another for all net-work operations, for example. One of our focuses is onbot analysis, so we spent considerable effort on extractingand evaluating the network connection data.

After it analyzes the API calls’ parameters, the sand-box routes them back to their original API functions.Therefore, it doesn’t block the malware from integratingitself into the target operating system—copying itself tothe Windows system directory, for example, or addingnew registry keys. To enable fast automated analysis, weexecute the CWSandbox in a virtual environment so thatthe system can easily return to a clean state after complet-ing the analysis process. This approach has some draw-backs—namely, detectability issues and slowerexecution—but using CWSandbox in a native environ-ment such as a normal commercial off-the-shelf systemwith an automated procedure that restores the system to aclean state can help circumvent these drawbacks.

The CWSandbox has three phases: initialization, exe-cution, and analysis. We discuss each phase in more detailin the following sections.

Initialization phaseIn the initialization phase, the sandbox, which consists ofthe cwsandbox.exe application and the cwmoni-tor.dll DLL, sets up the malware process. This DLLinstalls the API hooks, realizes the hook functions, andexchanges runtime information with the sandbox.

The DLL’s life cycle is also divided into three phases:initialization, execution, and finishing. The DLL’s mainfunction is to handle the first and last phases; the hookfunctions handle the execution phase. DLL operationsare executed during the initialization and finishing

www.computer.org/security/ ■ IEEE SECURITY & PRIVACY 35

Figure 2. CWSandbox overview. CWSandbox.exe creates a newprocess image for the to-be-analyzed malware binary and then injectsthe cwmonitor.dll into the target application’s address space. Withthe help of the DLL, we perform API hooking and send all observedbehavior via the communication channel back to cwsandbox.exe. Weuse the same procedure for child or infected processes.

Malware application child

cwmonitor.dll

Executes

Communication

Executes

Malware application

cwmonitor.dll

cwsandbox.exe

Communication

During the initialization of a malware binary cwmonitor.dll

is injected into its memory to carry out API hooking.DLL intercepts all API calls and reports them to CWSandbox.The same procedure is repeated for any child or infectedprocess.

Case study: Malware classification

Goals:Detect variations of known malware families.Detect previusly unknown families.Determine essential features of specific families.

Main ideas:Use AV scanners to assign labels to malware binaries.Use machine learning to classify binaries among knownmalware families.

Wild Collectionexploits

Monitoringbinary report

Classification

AV scanner Learninglabel

model

label

Malware classification: an overview

Acquisition of training data: creation ofbalanced data sets for multi-classclassification.Feature extraction: embedding of reportsin a feature space.Training: optimizaiton and modelselection.Classification: combining results of singlefamily classifiers.

Report corpus

Acquisition oftraining data

Feature extraction

Training

Classification

label

Malware classification: data acquisision

Binaries are collected from Nepenthesand from spam-traps, and are analyzedby CWSandbox.Labels are assigned by running AviraAntiVir. Unrecognized binaries arediscarded.14 classes are considered for training: 1backdoor, 2 trojans and 11 worms.

Report corpus


Feature extraction

Training

Classification

label

Malware classification: feature extraction

Operational features: a set of all stringscontained between delimeters “<” and “>”.“Wildcarding”: removal of potentiallyrandom attributes.

<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe"

dstfile="C:\WINDOWS\system32\urdvxc.exe"

creationdistribution="CREATE_ALWAYS" desiredaccess="FILE_ANY_ACCESS"

flags="SECURITY_ANONYMOUS" fileinformationclass="FileBasicInformation"/>

<set_value key="HKEY_CLASSES_ROOT\CLSID\{3534943...2312F5C0&}"

data="lsslwhxtettntbkr"/>

<create_process commandline="C:\WINDOWS\system32\urdvxc.exe /start"

targetpid="1396" showwindow="SW_HIDE"

apifunction="CreateProcessA" successful="1"/>

<create_mutex name="GhostBOT0.58b" owned="1"/>

<connection transportprotocol="TCP" remoteaddr="XXX.XXX.XXX.XXX"

remoteport="27555" protocol="IRC" connectionestablished="1" socket="1780"/>

<irc_data username="XP-2398" hostname="XP-2398" servername="0"

realname="ADMINISTRATOR" password="r0flc0mz" nick="[P33-DEU-51371]"/>

Report corpus


Feature extraction

Training

Classification

label

Malware classification: training

For each known malware family, train aSupport Vector Machine for separatingthis family for the others:

minw,b

12||w||2+C

M

∑i=0

ξi

s.t. yi((w·xi) + b) ≥ 1−ξi, i = 1, . . . , M.ξi ≥ 0

Determine the optimal parameter C bycross-validation

Classification of unknown malware binaries

Maximum distance: a label is assigned toa new report based on the highest scoreamong the 14 classfiers:

d(x) = (w · x) + b

Maximum “likelihood”: estimateconditional probability of class “+1” as:

P(y = +1 | d(x)) =1

1 + exp(Ad(x) + B)

where parameters A and B are estimatedby a logistic regression fit on anindependent training data set.

Report corpus


Feature extraction

Training

Classification

label

Results: known malware instances

Test binaries are drawn from the same 14 familiesrecognized by AntiVir.

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.2

0.4

0.6

0.8

1

Malware families

Acc

urac

y pe

r fam

ily

Accuracy of classification

(a) Accuracy per malware familyTr

ue m

alw

are

fam

ilies

Predicted malware families

Confusion matrix for classification

1 2 3 4 5 6 7 8 9 10 11 12 13 14

123456789

1011121314

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(b) Confusion of malware families

Average accuracy: 88%.

Results: unknown malware instances

Test binaries are drawn from the same 14 familiesrecognized by AntiVir four weeks later.

1 2 3 4 7 8 9 10 11 13 140

0.2

0.4

0.6

0.8

1

Malware families

Acc

urac

y pe

r fam

ily

Accuracy of prediction

(c) Accuracy per malware familyTr

ue m

alw

are

fam

ilies

Predicted malware families

Confusion matrix for prediction

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

3

4

7

8

9

10

11

13

140.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(d) Confusion of malware families

Average accuracy: 69%.

Botzilla: analysis of malware communication

Operational goals:Generation of signatures for malware communicatioinDetection of “drop-zones” and compromized machines

Malware binary Signature

Internet

Repetitve execution(Sandnet)

SignaturegenerationBot master

Regular traffic

Execution of malware in a sandnet

A sandnet is a monitored network of sandbox machines withvarious network configuration.Captured malware binaries are executed multiple times in asandnet using various network settings (IP addresses,operating systems, date and time).All outgoing communication is captured in a tcpdump.Security precautions:

Redirection of SMTP traffic to a network sinkBlocking of DCOM and SMB requests (known vulnerabilities)

Signature generation

SignatureToken extraction Signature assembly

Regular traffic X−Malware traffic X+

Natural clustering: multiple executions of the same binaryExtraction of (quasi)-distinct tokens using suffix treesSignature assembly: false positive check on regular traffic(real or simulated)Signature filtering: elimination of redundant signatures fromsimilar binaries

Offline evaluation of Botzilla

Procedure:Signatures are generated for 43 classes of malware collectedfrom a honeypot.Test traffic is generated by running the same malware againand merging it with normal traffic from two sources (DARPAand University of Erlangen).

Results:DARPA Erlangen

Average detection rate 0.81 0.72Average false alarm rate 0 0.51 · 10−6

Online evaluation of Botzilla

Procedure:Generated signatures were deployed on a large network at theUniversity of Erlangen (ca. 50,000 hosts) using a Vermontnetwork monitorin sensor.Traffic statistics: ca. 840M flows in 4 days.Alerts were manually evaluated.

Results:

# malicious flows detected 219# false alarms 295Detection rate 0.43False alarm rate 3.51 · 10−7

Signature examples

Banload keylogger:

Storm worm:

Lessons learned

Malware analysis significantly facilitates malwareunderstanding and the development of protectionmechanisms.The main technique of malware analysis is execution ofmalware in a specially instrumented environment.Further analysis techniques require extensive machinelearning and string matching instrumentation.

Recommended reading

Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and PavelLaskov.Learning and classification of malware behavior.In Detection of Intrusions and Malware, and Vulnerability Assessment, Proc.of 5th DIMVA Conference, pages 108–125, 2008.

Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz, and PavelLaskov.Botzilla: Detecting the phoning homeöf malicious software.In Proc. of 25th ACM Symposium on Applied Computing (SAC), March2010.(to appear).

Carsten Willems, Thorsten Holz, and Felix Freiling.CWSandbox: Towards automated dynamic binary analysis.IEEE Security and Privacy, 5(2):32–39, 2007.

intrusion detection and malware analysis - malware analysis · intrusion detection and malware...

Documents