bioinformatics techniques for metamorphic malware analysis and detection: grijesh

Upload: grijesh-chauhan

Post on 18-Oct-2015

102 views

Category:

Documents


2 download

DESCRIPTION

ABSTRACT : -------------------- Modern malware that are metamorphic or polymorphic in nature mutate their code by employing code obfuscation and encryption methods to thwart detection. Thus, conventional signature based scanners fail to detect these malware. In order to address the problems of detecting known variants of metamorphic malware, we propose a method using bioinformatics techniques effectively used for Protein and DNA matching. Instead of using exact signature matching methods, more sophisticated signature(s) are extracted using multiple sequence alignment (MSA). The results show that the proposed method is capable of identifying malware variants with minimum false alarms and misses. Also, the detection rate achieved with our proposed method is better compared to commercial antivirus products used in the study. Status: ---------- This work has been accepted by 8th IEEE International Conference on Innovations in Information Technology (Innovations'12). Link: ------- http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6207739&url=http://ieeexplore.ieee.org/iel5/6203543/6207707/06207739.pdf?arnumber=6207739 e-mail: [email protected]

TRANSCRIPT

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    1/60

    A

    M.Tech DISSERTATION REPORT

    on

    BioInformatics Techniques for MetamorphicMalware Analysis and Detection

    Submitted for partial fulfillment for the degree of

    Master of Technology

    (Computer Engineering)

    in

    Department of Computer Engineering

    (June-2011)

    Supervisors: By:

    Dr. Vijay Laxmi Grijesh Chauhan

    Dr. Manoj Singh Gaur (2009PCP116)

    MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY JAIPUR

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    2/60

    Department of Computer Engineering

    Malaviya National Institute of Technology Jaipur

    Rajasthan - 302017

    CERTIFICATE

    This is to certify that the Dissertation Report on BioInformatics Techniques

    for Metamorphic Malware Detection, by Grijesh Chauhan is the work

    completed under my supervision, hence approved for submission in partial ful-

    fillment for the Master of Technology in Computer Engineering during academic

    session 2009-2011.

    (Dr.Vijay Laxmi) (Dr. M.S.Gaur)

    Reader and Head of Department Professor

    Date : Date:

    M.N.I.T., Jaipur M.N.I.T.,Jaipur

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    3/60

    Declaration

    I, Grijesh Chauhan, declare that this Dissertation titled, BioInformatics Tech-

    niques for Metamorphic Malware Analysis and Detection and the work presented

    in it are my own. I confirm that:

    This work was done wholly or mainly while in candidature for a M.Tech.

    degree at MNIT.

    Where any part of this Dissertation has previously been submitted for a

    degree or any other qualification at MNIT or any other institution, this has

    been clearly stated.

    Where I have consulted the published work of others, this is always clearly

    attributed.

    Where I have quoted from the work of others, the source is always given.

    With the exception of such quotations, this Dissertation is entirely my own

    work.

    I have acknowledged all main sources of help.

    Signed:

    Date:

    i

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    4/60

    Abstract

    Modern malware which are metamorphic or polymorphic in nature mutates their

    code by employing code obfuscation and encryption methods to thwart detection.

    Conventional signature based scanners fail to detect these malware. Also, signa-

    ture based scanner requires frequent updates and size of data base also increases

    exponentially. In order to address the problems of detecting known variants of

    metamorphic malware, we proposed a method known as MetamOrphic Malware

    Exploration Techniques using MSA (MOMENTUM) using Biometrics techniques

    for Protein and DNA matching. Instead of using fixed signature more sophisticated

    signature(s) extracted using multiple sequence alignment (MSA). Experiments are

    conducted over obfuscated malware data set collected from VX Heavens,tools and

    user agencies and benign samples gathered from fresh installation of Windows XP

    operating system,Cygwin etc. Experiment are performed by segregating the data

    set into two parts one for modeling signature and other is reserved for testing. The

    results shows that the proposed method is capable of identifying malware variants

    with minimum false alarms and misses.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    5/60

    Acknowledgements

    I take immense pleasure to express my deep and sincere gratitude to my esteemed

    guide, Dr. Vijay Laxmi, (Head of the Department, Department of Computer En-

    gineering, Malaviya National Institute of Technology), and Dr. Manoj Singh Gaur

    (Professor, Department of Computer Engineering, Malaviya National Institute of

    Technology) for their invaluable guidance, and spending precious hours for my

    work. Their excellent cooperation and suggestion through stimulating and bene-

    ficial discussions provided me with an impetus to work and made the completion

    of work possible.

    My sincere thanks to all faculty members of Department of Computer Engineering,MNIT Jaipur, for their constant support, imparting best knowledege in M.Tech

    course.

    I would like to thank all non-teaching staff members of Department of Computer

    Engineering, Malaviya National Institute of Technology, Jaipur and all those peo-

    ple whose lovely sense of favors I have received for completing this Dissertation

    work.

    I would always be indebted to the support and prayers of my parents in com-

    pleting this work successfully. I thank my friends who have directly or indirectly

    contributed by giving their valuable suggestions.

    Signed:

    Date:

    iii

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    6/60

    Contents

    Declaration i

    Abstract ii

    Acknowledgements iii

    List of Figures vi

    List of Tables vii

    1 Introduction 1

    1.1 Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Contributions of Thesis. . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Malware and Types 7

    2.1 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.1.1 Virus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.2 Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.3 Trojans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.4 Backdoors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.5 Logic Bombs . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.1.6 Adware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Polymorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Metamorphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.3.1 Dead Code Insertion . . . . . . . . . . . . . . . . . . . . . . 11

    2.3.2 Reorder Instruction using Jump . . . . . . . . . . . . . . . . 12

    2.3.3 Equivalent Instruction Substitution . . . . . . . . . . . . . . 14

    2.3.4 Subroutine In lining and Outlining . . . . . . . . . . . . . . 14

    2.3.5 Independent Instruction Permutation . . . . . . . . . . . . . 16

    2.4 Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4.1 Static Detection. . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Dynamic Detection . . . . . . . . . . . . . . . . . . . . . . . 17

    iv

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    7/60

    Contents v

    2.4.3 Heuristic Detection . . . . . . . . . . . . . . . . . . . . . . . 17

    3 Bioinformatics Techniques 18

    3.1 Global Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.1.1 NeedlemanWunsch Method . . . . . . . . . . . . . . . . . . 193.1.2 Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . 21

    3.2 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.3 Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.4 Multiple Sequence Alignment Method. . . . . . . . . . . . . . . . . 23

    3.4.1 Iterative Alignment . . . . . . . . . . . . . . . . . . . . . . . 23

    3.4.2 Progressive Alignment . . . . . . . . . . . . . . . . . . . . . 24

    4 Metamorphic Malware Exploration Technique Using MSA (MO-MENTUM) 26

    4.1 Data acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Analysis of metamorphism in Tools/Real malware . . . . . . . . . . 28

    4.2.1 Type of obfuscation. . . . . . . . . . . . . . . . . . . . . . . 29

    4.2.2 Indentification of Base Malware . . . . . . . . . . . . . . . . 30

    4.3 Signature Modeling and Testing . . . . . . . . . . . . . . . . . . . . 30

    4.3.1 Single Signature. . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.3.2 Group Signature . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.3.3 Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    5 Result and Inferences 34

    5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Intra Family Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.3 Inter Family Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    5.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5.5 Testing with Signature . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.6 Comparative Analysis with Antiviruses . . . . . . . . . . . . . . . . 39

    6 Conclusions and Future Work 41

    A Executable Unpacking 43

    A.1 Symptoms of Packed Malicious Executables . . . . . . . . . . . . . 44

    A.2 Manual Unpacking of Packed Executable . . . . . . . . . . . . . . . 45

    A.3 Executable Unpacking using Ether . . . . . . . . . . . . . . . . . . 46

    Bibliography 49

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    8/60

    List of Figures

    2.1 Metamorphic malware variants using obfuscation and embedded withmetamorphic engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2 Subroutine In lining and Subroutine Outlining . . . . . . . . . . . . . . . 15

    2.3 Subroutine Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.1 Global Alignment for DNA Sequences . . . . . . . . . . . . . . . . . . . . 203.2 Local Alignment for DNA Sequences . . . . . . . . . . . . . . . . . . . . . 22

    3.3 Phylogentic tree and alignment of sequences. . . . . . . . . . . . . . . . . 22

    3.4 Multiple Aligned opcode sequences corresponding to malware samples. . . 24

    3.5 Progressive Alignement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.1 Brief Outline of Method for Metamorphic Malware Detection . . . . . . . 27

    4.2 Method for Investigation of Metamorphism. . . . . . . . . . . . . . . . . . 29

    4.3 Sum of Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.4 Signature Modeling and Testing. . . . . . . . . . . . . . . . . . . . . . . . 31

    4.5 Extraction of single signature.. . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.6 Wildcard based representation of Group signature. . . . . . . . . . . . . . 33

    5.1 Intra Family Analysis of malware (Synthetic and Real). . . . . . . . . . . 36

    5.2 Inter Family Analysis of malware (Synthetic and Real). . . . . . . . . . . 37

    5.3 Detection rate of antiviruses compared with different type of constructedsignature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    A.1 Portable Executable Unpacking Procedure . . . . . . . . . . . . . . . . . . 44

    A.2 Userspace Unpacking using Ether . . . . . . . . . . . . . . . . . . . . . . . 48

    vi

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    9/60

    List of Tables

    2.1 Different types of Junk code instructions used by metamorphic engine. . . 13

    2.2 Dictionary of equivalent instructions. . . . . . . . . . . . . . . . . . . . . . 15

    4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.2 Instruction Replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    5.1 Comparative Analysis of Malware Samples. . . . . . . . . . . . . . . . . . 37

    5.2 Evaluation Metrics for different types of signatures.. . . . . . . . . . . . . 38

    vii

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    10/60

    Chapter 1

    Introduction

    The advent of Internet has increased the appearance of malware in the digital world.

    Majority of the transactions are performed online by nave users which have increased

    the threat of stolen password, transaction credentials or personal informations. The

    term malware generally refers to all software which have illicit intentions. They are

    categorized into computer viruses, worms, Trojan, backdoors, rootkits etc. Basically,

    malware can be categorized based on the mode of propagation as mobile malware whichare worms, spyware, botnets etc. or static malware like viruses. The focus of these

    malicious softwares are to replicate be exploiting system vulnerabilities.

    Conventionally malware scanners are based on matching signatures of known samples

    for detection. The signature based scanners are fast but imposes certain limitations

    like (a) failure to detect unseen malware (b) lacks semantic knowledge of the samples

    (c) failure to detect obfuscated or encrypted instances. Minor change in the code of

    malicious samples would thwart detection.

    Antivirus companies have evolved with better methods for identifying malware but mal-

    ware writing is getting sophistication and challenging scanners. Identification of poly-

    morphic and metamorphic malware is difficult as a simple change in the byte pattern

    significantly changes the signature of the samples. Maintaining the signature for each

    malware results in (a) increase of malware data base and (b) system may be infected by

    new samples by the time signature is created. Basically, the detection process can be

    categorized as (a) static analysis and (b) dynamic analysis. Malware can be analyzed by

    1

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    11/60

    Chapter 1. Introduction 2

    checking the structure (content) of the assembly code without the executing the samples.

    Thus, the system is not infected and maliciousness is derived by either constructing the

    control flow graph or frequencies of opcodes. In dynamic analysis each malware sam-

    ple is executed in a controlled environment. The impact of infection is monitored by

    inspecting the strains left by malware samples (system registry, processor register etc.).

    The method gives refined output but is expensive with respect to running time.

    1.1 Motivation

    Metamorphic malware mutate its code on each replication preserving functionality of thecode. The code is mutated with the help of a small mutation engine called as metamor-

    phic engine. Metamorphic malware uses different obfuscation mechanisms to evade the

    conventional signature based scanner based on exact string matching techniques. Meta-

    morphic engine is a prime element which keeps it hidden from the antivirus products.

    Also, size of metamorphic engine is designed to be small so as to bypass the detection [8].

    This indicates that metamorphic engine performs structural transformation to the code

    with limited set of replacement. As total change in the code is impossible since the

    functionality of malware variant would suffer a change and might loose its maliciousness

    by producing an unnecessary code. Malicious programs compared to benign are less

    diverse since maliciousesness is preserved for infection and propagation.

    DNA/ proteins mutate from one generation to another inheriting some functional, struc-

    tural similarity with the ancestors. In this implementation work it was assumed that

    metamorphic malware like the DNA/protein sequence transforms the code with mod-

    ification in the opcode sequence. The mismatches in the opcode sequence from one

    generation to another may be considered as the point of mutation. Thus, exact string

    matching techniques would fail to detect new malware variants. At this point we shift

    from the general area of exact matching and exact pattern discovery to the general area

    of inexact, approximate matching, and sequence alignment. Bioinformatics sequence

    alignment method is used in this work which aligns the sequence based on the evolu-

    tionary relationship and is found to be better for signature extraction and detection of

    variants of malware.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    12/60

    Chapter 1. Introduction 3

    1.2 Objective

    Motivated by Bioinformatics techniques the objective of this thesis is to detect meta-

    morphic malware. Using the sequence alignment method for each malware family two

    types of signature(s) are constructed which are (a) group and (b) signature. Unseen

    malware is tested with extracted signature(s). Also the obfuscation and metamorphism

    in malware constructors and real malware is explored to identify the types of prominent

    instructions used for mutating the malware.

    1.3 Related Work

    In their proposed work, authors [14] and [15] created a rewriting engine for detecting

    morphed malware variants. The analysis of variants of malware is based on syntactic as

    well as semantic structure of a program. Signatures of malware are represented in the

    form of a control flow graph. Signature matching technique is based on tree automaton.

    Krugel et al [16] proposed a method based code analysis to identify structural similarity

    between malicious code (worms). The proposed method is based on the CFG generated

    for worms which describes a fingerprint for worm. Their system is found to be resilient

    against common code transformation techniques.

    Authors in [17] proposed a novel method for analyzing malware based on code graph.

    Each malware executable was inspected and instructions corresponding to system call

    sequence were represented in the form of a topological graph. The proposed code graph

    system was used to differentiate malware and benign programs by checking the applica-

    bility of specific system call.

    In their proposed work [9], authors proposed a semantic based approach for detecting

    variants of malware. This method is based on the functionality of system call executed

    by malware samples. The main focus is to identify all instructions and its parameters

    which are used for calling a system call. They propose a pattern matching technique

    which is able to identify semantically equivalent parts of code. The method is capable

    of identifying programs that are related to each other and the ones that are totally

    dissimilar. Rachit et al [13]created a malware normalizer making use of term rewritingrules. The method was applied on virus named as Win32.Evol. The main objective of

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    13/60

    Chapter 1. Introduction 4

    their proposed work was to convert program variants into smaller number of variants i.e

    to convert all programs into a normal program.

    In Hunting for metamorphic engines [10], Hidden Markov Models (HMMs) were used to

    represent statistical properties of a set of metamorphic virus variants. The metamorphic

    virus data set was generated from metamorphic engines: Second Generation virus gener-

    ator (G2), Next Generation Virus Construction Kit (NGVCK), Virus Creation Lab for

    Win32 (VCL32) and Mass Code Generator (MPCGEN). HMM is trained on a family of

    metamorphic viruses and determines whether a given program is similar to the viruses

    the HMM represents.

    In[11], the critical API calls were extracted statically using IDA-Pro [6]. Thus, all thelatebounded API calls that are made using GetProcAddress, LoadLibraryEx, etc. are

    not taken into account. On top of this approach did not work for packed malware.

    The authors in [1] proposed a phylogeny model, particularly used in areas of bioin-

    formatics, for extracting information in genes, proteins or nucleotide sequences. The

    ngram feature extraction technique was proposed and fixed permutation was applied

    on the code to generate new sequences, called n-perms. Since new variants of malware

    evolve by incorporating permutations, the proposednperm model was developed to cap-

    ture instruction and block permutations. The experiment was conducted on a limited

    data set consisting of 9 benign samples and 141 worms collected from VX Heavens [ 2].

    The proposed method showed that similar variants appeared closer in the phylogenetic

    treewhere each node represented a malware variant. The method did not depict how the

    nperm model would behave if the instructions in a block of code are replaced by equiv-

    alent instructions which could either expand or shrink the size of blocks (with respect

    to number of instructions in a block).

    1.4 Contributions of Thesis

    In this thesis work a novel method to detect metamorphic malware variants is proposed.

    The method is based on static analysis where the unpacked samples are disassembled

    and the opcode sequences of samples are used for comparison. In [7] proposed that

    the opcode sequence there is large difference in the opcode sequence of malicious and

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    14/60

    Chapter 1. Introduction 5

    benign sample. Thus, opcode could be used to create sequence of malware samples. A

    evolutionary tree also known as Phylogenetic treeis constructed for a family of malware.

    Threshold within the family is computed and unseen samples are detected using this

    threshold. Two types of signatures called as (a) group signature and (b) single signature

    for a family is constructed. In order to extract single and group signature multiple se-

    quence alignment (MSA) is used which is primarily used in area of bioinformatics. Our

    experiments shows some promising results and shows the effectiveness of the method for

    detecting known samples of metamorphic malware with less false alarms. Experiments

    have been conducted on obfuscated malware data set collected from VX Heavens [2]

    and some from user agencies. Malware variants are also prepared using the constructors

    like NGVCK, MPCGEN, G2, PSMPC. Through our experiment we have found that theobfuscation is minimal in samples created using the constructors. Primarily the obfus-

    cation is simple instruction replacement, junk code insertion which is reordered using

    the jump instructions. Also, most of the families of the malware generated using the

    constructors overlaps depicting minimal obfuscation of the code from one generation to

    other generation.

    1.5 Outline

    In Chapter 2, an introduction to malware and different types of malcode is given. The

    chapter discusses infection and propagation modes used by the malicious software. Then,

    polymorphic malware is briefly introduced with detailed explanation to metamorphic

    malware is covered. Later in the chapter malware detection techniques are described.

    Chapter 3 discusses various bioinformatics techniques used in DNA/protein sequence

    alignment. In this chapter two types of sequence alignment method known as global

    and local alignment is described. Phylogenetic tree used for evolutionary relationship

    is explained with brief outline of the construction techniques. During the end of this

    chapter Multiple Sequence Alignment (MSA) is described in detail, this method is used

    for aligning more than two sequences. Methods for constructing MSA which are iterative

    and progressive method is also introduced.

    Chapter 4 describes the proposed and implementation method known as MetamorphicMalware Exploration Technique Using MSA(MOMENTUM). This chapter explains in

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    15/60

    Chapter 1. Introduction 6

    detail the dataset preprocessing which involves unpacking and classification into different

    families. This chapter describes different steps involved in exploring metamorphism on

    synthetic and real malware data and highlights the prominent opcode sequence used by

    malware. Signature modeling is explained in detail along with testing unseen samples

    with extracted signature to validate the hypothesis for detection.

    Chapter 5 give details of experiments conducted along with the analysis of results.

    Finally, conclusions and future work is discussed in Chapter 6.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    16/60

    Chapter 2

    Malware and Types

    Malware can be defined as programs with unethical intentions. They contain instruc-

    tions which tries to find vulnerabilities of computer systems in an unauthorized manner

    to infect or steal valuable information from machines. Once installed, some malware

    provide access of user machines to remote attackers. All malicious software can be cat-

    egorized as computer viruses, worms, Trojans, backdoor, adware, spyware etc. Many

    malicious softwares are distributed along with free wares or open source software withthe motive of making money. They are primarily installed on computer systems while

    browsing sites from which games, movies, web browsers, music etc. are downloaded. The

    compromised machines exposes useful information of the system and user to the attack-

    ers machine which could be either (a) credit card number (b) root password or (c) use

    the compromised system to launch attacks or sending spam messages to other systems.

    Once the system is infected it tries to delete system files, change registry entry, hides

    task manager, launch spying software which can monitor user key logging activities.

    2.1 Types of Malware

    Malware can be classified based on their mode of infection and propagation mechanism.

    Modern malware are more sophisticated in terms of their complexity in behaviour and

    appearance of code. Present day malware are employing antidebugging, antivirtual

    machine checks to stay dormant in order to evade detection. As antivirus products

    7

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    17/60

    Chapter 2. Malware and Types 8

    are becoming more powerful malware writing is becoming more complex and challeng-

    ing than the antivirus products. Brief outline of various types of malware is given in

    subsequent subsections.

    2.1.1 Virus

    A computer virus is a program which infect the system by replication. They use a host

    program for infection and are propagated only by human intervention. The virus would

    be activated only if infected program keeps on executing. Viruses can be harmful and

    some are written for fun. Harmful viruses could delete system files or freeze computer

    by occupying volume of hard disk space. Harmless computer virus displays messages

    to attract users but replicate by creating their clones. Normally, computer viruses

    targets autorun files, executable system files, macros of document files for the purpose

    of replication. Computer viruses have basically four function (a) Asearch routinewhich

    locates a program or file with specific file extension to infect. Once the file is found

    it marks each such file to avoid over infection or avoid searching infected files (b) copy

    routine which copies the malicious code to a host file. This malicious code could be

    prepended, appended or added at different locations of the host file (c) antidetection

    mechanism to evade detection by antivirus products. These mechanism could be either

    encryption, code morphing or interrupt vector table modification etc. (d)payload which

    is primarily is the main part of any virus used for self replication.

    2.1.2 Worms

    Worms are malicious program which are also selfreplicating program like computer

    virus but use Internet to spread. The most striking feature of a worm is that it does

    not require human intervention to spread. Worm exploits two fundamental vulnerability

    (a) software bug and (b) security holes to propagate. Software bug could be either the

    buffer overflow vulnerability which appears in program by using functions like strcpy

    instead of safe function likestrncpy, allows the attack to allocate oversized memory and

    copy malicious code as with well known program finger. Similar type of software bug

    is found in a program like sendmail which deliver message to programs residing in thelocal or remote machine. The recipient program executes a script in a new shell which

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    18/60

    Chapter 2. Malware and Types 9

    is present in the body of the message. Worm attempts to scan open ports to launch

    different types of attacks. It also spreads through email by sending spam messages to

    contact list of a particular user account. In most cases user is indirectly forced to open

    or download attachments for triggering malicious activities of worm. Basically once a

    vulnerable system is located, worm scans /etc/passwd file for encrypted password and

    possibly cracks it by making multiple attempts. Thus, once username and password is

    fetched any malicious code could be remotely executed by worm using utility like rexec.

    2.1.3 Trojans

    A Trojan Horse is a nonself replicating program and enters the computer in an unno-

    ticeable manner and is usually disguised as a legitimate application. Once the system is

    infected by Trojan it allows unrestricted access of the user system to attacker sitting

    in the remote location. These malicious software require a host program in which they

    hide. The basic component of a Trojan Horse is a server and client program. The server

    launch a program which attracts the user which exists in the form of games, images,

    videos etc. in which the malicious program hides. After these applications are down-

    loaded in the system, machine gets infected and Trojan (client program) performs spying

    activity.

    2.1.4 Backdoors

    Backdoor is a program which is created to bypass network security checks to create a

    channel for the attacker to control, spy or interact with the victim machine. Backdoors

    are planted in softwares (open source or free ware) before their distribution. When these

    softwares are installed and executed backdoor open the channel, connect to the remote

    machine to leak valuable information concerning the user and computer system. Some

    of the backdoors are created for legitimate purpose in order to avoid time consuming

    authentication performed for debugging network server [18]. Sometimes backdoor make

    use of Trojans for compromising a computer system. The user machine is victimized

    when a image of video consisting of backdoor is downloaded. Many backdoors are

    installed if an ActiveX is installed in the user system while browsing certain sites. Most

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    19/60

    Chapter 2. Malware and Types 10

    of the browsers prompts the user when they download ActiveX control to prevent their

    machines from attacks.

    2.1.5 Logic Bombs

    This category of malware can exist stand alone or could be interleaved inside legitimate

    program. They do not replicate and have two basic component (a) payload: which is

    capable of performing malicious activities like formating harddisk or deleting system

    files (b) trigger: which make it more dangerous as the logic bombs would stay dormant

    for a specific event to occur to deliver its malicious payload.

    2.1.6 Adware

    It forces unsolicited advertisements when user is browsing the Internet. Adware gathers

    browsing behaviour, planted by many companies by creating interest to shop by popping

    up too many advertisements. Sometimes adware are very dangerous as they redirect to

    unsolicited site which requires users to fill in their information like password for email,

    credit card or cvv numbers which logs keystrokes to gather all valuable information.

    Most of the popular malware today employ encryption and obfuscation to evade

    detection. Such malware are called as polymorphic and metamorphic malware

    they are described in subsequent subsections.

    2.2 Polymorphic

    Polymorphic malware encrypting their code with random key to avoid detection.

    Each polymorphic virus have a polymorphic engine colled virus decryption routine

    (VDR), which generate new keys and contains decryption module for decrypting

    the encrypted malicious body responsible for infecting applications and system.

    Once executed, the virus is re-encrypted and added to another vulnerable host

    application. Thus, when an antivirus scans the malware for signature it find

    different pattern (as keys are different) and thus thwart detection.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    20/60

    Chapter 2. Malware and Types 11

    Malware scanner perform in memory scanning of each suspicious sample for de-

    tection. Ultimately a malware needs to execute for infecting the machine hence

    should reside in the main memory. Thus, the antivirus scans though all samples

    in the memory and match all patterns against the signatures in the repository.Another major problem found with the polymorphic malware are its decryption

    algorithm. If the scanner could locate the decryption algorithm then this could

    become a signature for identification of polymorphic malware. Malware authors

    scrambles statements or replace some registers with unused register to obtain dif-

    ferent byte pattern to avoid detection. Another approach could be to prepare a

    dictionary of some binary code and its equivalent replacement with other binary

    patterns. Using this table the polymorphic engine could automatically identify bi-

    nary pattern, map these pattern using the dictionary to replace it with equivalentcode to generate new malware variants.

    2.3 Metamorphic

    Metamorphic malware are very sophisticated in nature as it completely modifies

    the code upon each replicate to generate a new malware variant. This make the

    antivirus products very difficult to identify metamorphic malware using signature

    matching techniques. Metamorphic malware constitutes a engine normally re-

    ferred to as metamorphic engine which mutates the code from one generation to

    other. Normally the size of metamorphic engine is kept too small in order to avoid

    detection. A metamorphic engines alters the program by applying various obfus-

    cation technique like (a) junk code (b) instruction permutation by reordering the

    control flow using jump instructions (c) equivalent instruction replacement and

    (d) subroutine in lining and outlining. Figure 2.1 shows metamorphic malware

    embedded with metamorphic engine using obfuscation transformation.

    2.3.1 Dead Code Insertion

    In this technique some garbage code or NOP is inserted to the actual code. Ba-

    sically this is the simplest of the obfuscation as it does not reorder the program

    code. Garbage code is inserted to confuse the scanner by increasing irrelevant

    byte pattern in the malicious samples to avoid detection. Dead code insertion isillustrated by all instruction written in boldface in the following code snippet.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    21/60

    Chapter 2. Malware and Types 12

    Figure 2.1: Metamorphic malware variants using obfuscation and embeddedwith metamorphic engine.

    mov eax, 020H

    mov eax, eax ;Garbage Codemov ebx, 0ABH

    add eax, ebx

    add eax, 00H ;Garbage Code

    push eax

    pop ebx

    push eax ;Garbage Code

    pop eax ;Garbage Code

    nop ;Garbage Code

    add eax, ebx

    add eax, 00H ;Garbage Code

    mul ecx

    mov [esi], ebx

    Some of the junk code used are listed in Table 2.1. The left hand side of the

    Table depicts the instructions and the right hand side depicts the meaning of each

    instruction.

    2.3.2 Reorder Instruction using Jump

    This virus adds jump instruction and garbage code in each mutant. The Win95/Zperm

    is an example of this technique. Since the virus body is not constant, string based

    detection is not possible. Consider the following piece of code without any jump

    instructions

    instruction 1 ; entry point

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    22/60

    Chapter 2. Malware and Types 13

    Table 2.1: Different types of Junk code instructions used by metamorphicengine.

    Instructions Meaning

    NOP No OperationCLD No Operation

    PUSHFD POPFD No OperationPUSHAD POPAD No OperationMOV REG, REG REG := REG

    ADD REG, 0 REG := REG + 0OR REG, 0 REG := REG|0

    AND REG, -1 REG := REG & -1PUSH REG POP REG No Operation

    XCHG REG, REG No OperationXOR REG, 0 No Operation

    SUB REG, 0 No OperationSBB REG, 0 No OperationADC REG, 0 No OperationSHL REG, 0 No OperationSHR REG, 0 No OperationAND REG, 1 REG := REG & 1

    instruction 2

    instruction 2

    .

    .

    .

    instruction n

    In later generation the virus body is modified by the engine by inserting jump

    instructions at random positions which is shown below.

    instruction 2

    jump 3

    instruction 4

    jump n

    instruction 1 ;entry point

    jump 2

    instruction 3

    jump 4

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    23/60

    Chapter 2. Malware and Types 14

    .

    .

    .

    instruction n

    2.3.3 Equivalent Instruction Substitution

    Some malware like Win95Zperm [21] and Win32.Evol [8] make use of equivalent

    instruction substitution as an obfuscation mechanism. In our proposed code mor-

    pher, we make use of a dictionary of instructions which can be possibly replaced by

    equivalent instructions. Instruction replacement can either expand or shrink the

    size of code of offspring. Our morpher basically increase the size of the generated

    variants. Table 2.2 depicts the instruction and their equivalent set of instructions.

    2.3.4 Subroutine in Lining and Outlining

    Subroutine in liningis a method in which the call to subroutine is replaced by its

    definition. It is a form of program obfuscation which replaces some/all calls to the

    subroutine with their code definitions. Code outliningdivides a block of code into

    subroutine (s) and add subroutine call for the newly created subroutine (s). The

    Figure 2.2 shows an example of subroutine in lining for two subroutine call S1()

    and S2() and outlining of code to create a new subroutine S12().

    S2: mul ecx

    ret

    mov edx, eaxret

    ...

    move eax, ebxadd eax, 12hpush eax

    mul ecx

    mov edx, eax...

    ...

    Call S1

    Call S2...

    S1: move eax, ebx

    add eax, 12h

    push eax

    ...move eax, ebx

    add eax, 12h

    push eax

    mul ecx

    ...

    mov edx, eax

    call S12

    mov edx, eax

    move eax, ebx

    ...

    ...

    S12: push eax

    add eax, 12h

    mul ecx

    ret

    Figure 2.2: Subroutine In lining and Subroutine Outlining

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    24/60

    Chapter 2. Malware and Types 15

    Table 2.2: Dictionary of equivalent instructions.

    Instructions Equivalent InstructionsADD REG, -1 NEG REG; NOT REGor NOT REG; NEG REGADD REG, 0 NOP

    ADD REG, 1 INC REG or NOT REG; NEG REGor NEG REG; NOT REGAND REG, -1 NOP

    XOR Reg,-1 NOT Reg

    XOR Mem,-1 NOT Mem

    MOV Reg,Reg NOP

    SUB Reg,Imm ADD Reg,-Imm

    SUB Mem,Imm ADD Mem,-Imm

    AND REG, 0 MOV REG, 0

    AND REG, REG CMP REG, 0

    JMP REG PUSH REG; RET

    MOV REG, REG NOP

    AND Mem,0 MOV Mem,0

    XOR Reg,Reg MOV Reg,0

    SUB Reg,Reg MOV Reg,0

    OR Reg,Reg CMP Reg,0AND Reg,Reg CMP Reg,0

    MOV REG1, REG2 PUSH REG2; POP REG1 or XCHG REG1, REG2NOP PUSHFD; POPFDor PUSHAD; POPAD or PUSH REG; POP REGXOR Reg,0 MOV Reg,0

    XOR Mem,0 MOV Mem,0

    ADD Reg,0 NOP

    ADD Mem,0 NOP

    OR Reg,0 NOP

    OR Mem,0 NOP

    AND Reg,-1 NOP

    AND Mem,-1 NOP

    AND Reg,0 MOV Reg,0

    TEST Reg,Reg CMP Reg,0

    LEA Reg,[Imm] MOV Reg,ImmLEA Reg,[Reg+Imm] ADD Reg,Imm

    LEA Reg1,[Reg2] MOV Reg1,Reg2

    LEA Reg1,[Reg1+Reg2] ADD Reg1,Reg2

    MOV Reg,Reg NOP

    Subroutine Permutation: Some metamorphic viruses make use of permutation

    of subroutines. If a virus code consists ofnsubroutine, it is possible to have n

    generations. Figure 2.3 shows few permutations of the virus code consisting of 5

    subroutines.

    5

    EP 1

    2

    3

    4

    5

    1

    2

    3

    4

    EP

    Figure 2.3: Subroutine Permutation

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    25/60

    Chapter 2. Malware and Types 16

    2.3.5 Independent Instruction Permutation

    Transposition or instruction permutation modifies the instruction execution order

    if they are not interdependent. Consider two instructions op R1, R2 followed byop R3, R4. These two instructions can be swapped provided R1, R2, R3, R4

    are different. For example, the instructions mov ecx, imm and inc eax are not

    interdependent hence they can be swapped.

    ...

    mov ecx, imm

    inc eax

    .....

    is equivalent to

    ...

    inc eax

    mov ecx, imm

    2.4 Detection Techniques

    Malware detection deals with the different mechanism for filtering out malicious

    programs. The detection mechanisms can be broadly classified as static, dynamic

    and heuristic methods.

    2.4.1 Static Detection

    Static analysis deals with detection of malcode without executing them on com-

    puter system. The disassembled code is scanned for malicious by examining either

    the import address table (IAT), opcode patterns, byte ngram. Signature in the

    form of byte patterns are extracted from each malicious samples and checked

    against a repository. Static detection mechanism using control flow graphs as

    signatures is also used to flag maliciousness.

    The main advantage of static detection mechanism is that the system is not in-

    fected by malcode. The detection approach is fast as surface scanning of malware

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    26/60

    Chapter 2. Malware and Types 17

    program is performed. This method lacks detection of encrypted malware as the

    actual malicious payload is released during execution.

    2.4.2 Dynamic Detection

    Dynamic analysis is used to mine maliciousness by executing malware samples in

    controlled environment. The controlled environment is used so as to keep the host

    machine unaffected. Dynamic analysis is particularly useful when dealing with

    encrypted malware. Code emulation might result in appropriate detection but

    this mechanism when used alone may sometimes defeat the detection process as

    the decryption may consume much of the time. In order to thwart detection some

    malware use multiple jump instruction to defeat dynamic scanners.

    2.4.3 Heuristic Detection

    Heuristic detection mechanism can be used along with static or dynamic tech-

    niques. The scanner primarily use heuristics for detecting unseen malware sam-

    ples. Some of the heuristics for detection of malicious code are (a) presence of

    entry point in last section (b) suspicious section names (c) large data sections or

    (d) small import table size. Heuristic detector are prone to too many false alarms

    where the benign samples are incorrectly identified as malware.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    27/60

    Chapter 3

    Bioinformatics Techniques

    Bioinformatic is the application of computer science on biological data. In bioin-

    fomatics biological informations are extracted to gain better understanding about

    different biological species. Sequence alignment is an elementary method used

    in any biological study to compare two or more biological sequences (protein or

    DNA). The alignment method attempt to find regions of high similarities as a

    whole or parts to deduce evolutionary relationship among sequences. Metamor-

    phic malware like proteins or nucleotide have some fragments of code which areinherited from their base malware. These segments of code is partially subjected

    to change from one generation to subsequent generations. Malcode is transformed

    by a metamorphic engine to conceal the malicious payload so that maliciousness

    is not revealed. Fundamentally code obfuscation is performed by metamorphic

    engine to thwart detection.

    The structure of metamorphic variants are different but they share common func-

    tionalies. Difference in variants of the same base malware cannot be too large

    hence, techniques used in bioinformatics can be applied for its detection. It can

    be assumed that genes in DNA can be thought as opcode sequence in malware.

    The size of the metamorphic engine is usually small to hide it from detection.

    Each malware sample is represented as a sequence of mnemonic pattern (opcode

    sequence) without considering the operands. Initially the approach might appear

    to be trivial but metamorphic malware variants cannot undergo total transforma-

    tion. Our assumption is that there may be replacement of some opcode(s) with

    equivalent opcode(s) but complete change is impossible in order to maintain pre-

    serve functionality. It can be inferred that variants preserve some base malicious

    18

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    28/60

    Chapter 3. Bioinformatics Techniques 19

    code which is transformed by the engine to produce new variant(s). Thus, using

    sequence alignment techniques opcode sequences are arranged:

    To determine similarity amongst malware samples.

    To explore frequent occurring patterns in a family of malware. These pat-

    terns depict maliciousness.

    To store, retrieve and compare malicious opcode sequences.

    The basic approach to sequence alignment can be broadly categorized as:

    1. Global Sequence Alignment

    2. Local Sequence Alignment

    Global alignment technique aligns sequences over complete length. This method

    is particularly useful when the sequences are more or less of similar length. On the

    other hand, local sequence alignment attempts to compare segments of all possible

    lengths to optimize the similarity measure. Local alignment mainly used when

    the query sequences have dissimilar size. Multiple sequence alignment (MSA) is

    another form of alignment technique used to align three or more sequences. MSA

    is used in identifying conserved sequence regions across a group of sequences. In

    this work using evolutionary relationship among sequences progressive MSA is

    implemented. In the following sections sequence alignment methods (global, local,

    MSA) is introduced.

    3.1 Global Alignment

    Global Alignment is used to align sequences end to end. Figures 3.1 shows global

    alignments for two sequence X and Y. The alignment of two DNA sequence in

    the Figure 3.1 shows match, mismatch and gaps introduced by global alignment

    methods. Two well known methods of global alignment are (a)NeedlemanWunsch

    and (b) Levenshtein or Edit distance. These methods are briefly discussed in

    following subsections.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    29/60

    Chapter 3. Bioinformatics Techniques 20

    Figure 3.1: Global Alignment for DNA Sequences

    3.1.1 NeedlemanWunsch Method

    NeedlemanWunsch method [20] determines global optimal alignment between the

    two sequenceXandY. Following are some basic steps involved in aligning opcode

    sequence:

    Initialization: In this step a score and trace back matrix of size (M+ 1)

    (N + 1) is created where M and N are the length of two instances. Let

    the score and trace back matrix be S(M+ 1, N+ 1) and T(M+ 1, N+ 1).Initially the first row and first column of score and trace back matrix is filled

    with 0.

    Populate Score Matrix: The score of each cell S(i, j) is determined by the

    scores of neighboring three cells i.e. (top, diagonal and left). In addition to

    filling the score matrix the trace back matrix is populated with the directions

    like left(L), diagonal(D) and up(U). The trace back matrix depicts the

    direction of cell with maximum value in the score matrix which contributes

    for the score of new cell S(i, j). Thus, S(i, j) is computed as follows:

    S(i, 0) =i

    S(0, j) =j

    S(i, j) =max(S(i 1, j1) + (X[i], Y[i]), S(i 1, j) + , S(i, j1) + ))

    where (X[i], Y[i]) indicate match/mismatch score while aligning character X[i], Y[i]

    and is gap penality.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    30/60

    Chapter 3. Bioinformatics Techniques 21

    Traceback: Traceback step recover to the alignment from the trace back matrix.

    Traceback start at bottom-right cell T(M+ 1, N+ 1) until the first row or column

    is encountered. Each cell with direction Ddepicts match and cells with directions

    of L, Udepicts the gap introduced in the sequence.

    3.1.2 Levenshtein distance

    The Levenshtein distance also known as edit distance algorithm is an approx-

    imate string matching algorithm used to find the occurence of a subtring of

    a pattern in a text. This method is used to determine the similarity between

    two sequences. Edit distance determines the minimum number of opera-

    tions required to transform one opcode sequence into to other. One of thecommon way of implementing the edit distance method is using a dynamic

    programming approach. The Levenshtein distance algorithm for two strings

    string1, string2of length m and nis shown below:

    1. Create a distance matrix consisting ofm rows and ncolumns.

    2. Initialize the first row and column as [0 m] and [0 n].

    3. For each of the symbol ofstring1and string2

    Ifstring1[i]= string2[j], the costis 0.

    Ifstring1[i]!= string2[j], then the costis 1.

    The value of cell distanceMatrix[i, j]is minimum of

    distanceMatrix[i-1, j] + 1,distanceMatrix[i, j-1] + 1,

    or d[i-1, j-1] + cost.

    3.2 Local Alignment

    Simth Waterman [22] is a local sequence alignement method which can be

    used to align sequences of arbitarary length. The score and trace back matrix

    in case ofSmith Waterman alignment method is computed in similar way

    the NeedlemanWunsch method execept that zero is included to prevent

    calculated negative similarity. This state of the cell indicates no similarity.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    31/60

    Chapter 3. Bioinformatics Techniques 22

    For any two sequenceXandYthe score matrix is populated using equation

    given below:

    S(i, j) =max(S(i1, j1) + (X[i], Y[i]), S(i1, j) + , S(i, j1) + ), 0)

    where S is the score matrix, is score corresponding to match and repre-

    sents the gap penlaty. The regions of high similarity is estimated by finding

    maximum score from the score matrix. Aligned sequences are retrived by

    reading the trace back matrix follwing the direction starting from the cell

    having maximum value. Figure 3.2 depict local alignment of DNA sequences.

    Figure 3.2:Local Alignment for DNA Sequences

    3.3 Multiple Sequence Alignment Method

    The multiple sequence alignment (MSA) method is used to align more than

    two sequences at a time. MSA can be build up by repeatedly applying

    global/local on two sequences and later on align subsequent alignments and

    sequences. In the proposed methodology (MOMENTUM), MSA in partic-ularly is used to determine related functional, structural aspects of opcode

    sequences in terms of signature(s).

    Given a set of k malware samples with opcode sequences M1, M2, Mk,

    gaps are inserted while aligining the opcode sequence so that all opcode

    sequence have same length. This similar opcode sequences are conserved

    and the number of gaps is minimized. Figure 3.3 depicts the MSA of five

    malware sequences. Two common methods of implementing MSA are:

    1. Iterative method

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    32/60

    Chapter 3. Bioinformatics Techniques 23

    2. Progressive alignment method

    Iterative method repeatedly realign the initial sequences as well as adding

    new sequences to the growing MSA. Second, is most widely used method to

    building MSA uses a heuristic based progressive technique.

    Figure 3.3: Multiple Aligned opcode sequences corresponding to malwaresamples.

    3.3.1 Iterative Method

    The iterative alignment method builts an initial alignment of sequences.

    They are primarly used to improve overall alignment score. A tree is created

    which depicts the order in which nodes are aligned. The tree is read in a

    bottom up fashion repeatedly by aligning sequences until the root node is

    visited which gives the complete alignment for a family. The main advantage

    of using the iterative alignment method is it fast and scales large number

    of sequences. The iterative alignment method has a limitation that the

    misalignment is preserved and is propogated to all sequences.

    3.3.2 Progressive Alignment

    he hierarchical or tree method), that builds up a final MSA by combining

    pairwise alignments beginning with the most similar pair and progressing to

    the most distantly related

    Progressive Alignment method identifies most similar instances align them

    first. Successively less similar instances are added to the initial alignment.

    This process is repeated until combined results of aligning opcode sequences

    of a malware famliy is obtained. ClustalW[23] is a progressive alignment

    techinque which is based on dynamic programming (DP) approach. Fig-ure 3.4 shows the aligned sequences obtained using progressive alignment

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    33/60

    Chapter 3. Bioinformatics Techniques 24

    method.

    Figure 3.4: Progressive Alignement

    The basic progressive alignment approach involves three steps:

    Compute Distance Matrix unsing pairwise alignment for all pairs of

    malware sequences in a family.

    Construct Phylogenetuc Tree using distance matrix as heuristic. A

    phylogenatic tree illustrate evolutionary relationship among various

    biological species. Figure 3.5 depicts the a phylogenetic tree for five

    different sequences. In this figure set of closely related sequences has

    common root node. NeighbourJoining (NJ) [24] method is used to

    construct tree. The phylogenetic tree use as guide tree defines the

    order in which the sequences are aligned in the next step.

    Figure 3.5: Phylogentic tree.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    34/60

    Chapter 3. Bioinformatics Techniques 25

    Construct MSA by traveling guided tree in bottomup align opcode

    sequences using evolutionary relationship, with similar ones aligned

    first followed by the less similar instances.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    35/60

    Chapter 4

    Metamorphic Malware

    Exploration Technique Using

    MSA (MOMENTUM)

    Metamorphic malware have self modifying and replication ability. It is

    equipped with a metamorphic engine which generates variants using code

    obfuscation techniques. Opcode sequence which represents maliciousness is

    transformed using metamorphic engine to obscure the infection mechanism.

    Sequence alignment methods can be used to determine the conserved regions

    of opcode which might be similar with respect to other opcode sequences.

    Also, the mismatch could be analyzed to determine semantic equivalence

    of instructions. In this chapter, we discuss the applicapability of various

    sequence alignment methods in different phase of proposed Metamorphic

    Malware Exploration Technique Using MSA (MOMENTUM)for detection

    and classification of malware executable. Figure 4.1 briefly outlines the im-

    plemented method.

    4.1 Data acquisition

    Experiments are condcuted on malware and benign samples in Portable Ex-

    ecutable (PE) [25] format. The malware samples are collected from var-

    ied sources which includes synthetic malware created using virus kits like

    NGVCK, MPCGEN, G2, PSMPC and real malware collected from VX Heav-ens and user agencies. Gathered malware samples are scanned using 14

    26

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    36/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 27

    Figure 4.1: Brief Outline of Method for Metamorphic Malware Detection

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    37/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 28

    antiviruses (trial period) and were classified into different families. Benign

    samples are collected fromSystem 32folder of fresh installation of Windows

    XP operating system. Some benign samples are collected from different site

    which includes games, browsers, media players etc. Each benign sample isalso scanned using the antiviruses.

    Since most of the malware collected are packed. Sample are unpacked using

    signature based unpackers like PEiD, GUNPacker[3] and dynamic unpacker

    like EtherUnpack. The details of unpacking is discussed in Appendix A.

    Table 4.1 gives the description of the data set used in the experiment.

    Table 4.1: Dataset Description

    TYPE SOURCE NO. FAMILIES NO. SAMPLESSynthetic NGVCK, G2, 46 1051

    PSMPC, MPCGENReal Malware User Agencies, 57 1330

    Vx HeavensBenign System 32, Cygwin, 1 1064

    ganmes etc.

    4.2 Analysis of metamorphism in Tools/Real

    malware

    In the proposed work the metamorphism amongst the malware samples gen-

    erated with various constructors are analyzed. Similar experiment is con-

    ducted on malware real samples collected from Vx Heavens and user agen-

    cies. Initially, pairwise alignment is found out for all opcode sequences of

    the malware samples using global and local alignment methods. Two type

    of analysis is performed (a) one is the intra family and (b) second is the inter

    family analysis. From the intra family pairwise alignment we obtain distance

    of samples, a base file and the opcode sequence alignments between the mal-

    ware samples. Average distance of samples in a family is computed which is

    useful for investigating the degree of metamorphism in a family of malware.

    With opcode sequence alignments we can determine the types of instructions

    contributing obfuscation. Inter family pairwise alignment between the base

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    38/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 29

    Figure 4.2: Method for Investigation of Metamorphism.

    malware is performed to determine if different malware families overlap. Fig-

    ure 4.2 depicts the method of identification of metamorphism in synthetic

    and real malware. It is also observed in most of the cases mov,pushand pop

    instructions are used.

    4.2.1 Type of obfuscation

    Metamorphic engine make use of instruction substitution or permutation as

    a way of obfuscation. The opcode sequence appear as a mismatch or gap in

    the alignment and depicts a point of mutation. Usually it is in case of mal-

    ware families single and multiple instruction replacement is observed. These

    replacements are incorporated by the metamorphic engine by maintaining

    the functionality of the variants of a family to evade detection. Table 4.2 listout some of the instructions used for obfuscation in the collected malware

    samples.

    4.2.2 Indentification of Base Malware

    The Sum of Pair (SOP) alignment method computes the pairwise alignment

    between every pair of opcode sequence. At a time three sequences could

    be aligned by constructing a cube like structure. This method is imposesconstraint on the system with respect to the memory and space utilization.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    39/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 30

    Table 4.2: Replacement of opcodes for malware generator (NGVCK, G2,PSMPC, MPCGEN). For all generator mov, push, pop and jump instructions

    are replaced.

    NGVCK G2 PSMPC MPCGENadd mov int call jnz loop mov pop

    push mov mov pop - cmp movmov pop lea mov - int movcall mov xor cwd - mov leamov sub mov movsb - jmp intpush add rep movsb - call addmov xor xor mov - add movswand mov cwd mov - lea jmp

    mov jz int inc - movsw movmov cmp movsb movsw - push pop

    To align three sequences the running time complexity is (23 1)n3orO(n3).

    in general for k sequence the running time complexity is O(2k 1)nk or

    O(2knk). Thus, it can be inferred that alignment between two sequence can

    be extended fork sequence but the running time exponentially increases.

    A method known asStar Sequence Alignmentmethod is used to align mul-

    tiple sequences. In this method a malware sample Mc is selected as the

    central or base file. Then, the optimal alignment of all instancesMiwithMc

    is computed, and each new sample is aligned with base file by inserting gaps

    to finally form multiple aligned sequence. Figure 4.3 depicts the pairwise

    alignment of the samples and selection of central file using Sum of Pairs

    method.

    4.3 Signature Modeling and Testing

    In this phase of the method signature(s) are extracted from the data set.

    The data set is initially portioned into train and test set. Signatures for each

    family is extracted from the MSA of signatures of each family. Figure 4.4

    depicts the phase involved in modeling the signatures.

    4.3.1 Single Signature

    Opcode sequence corresponding to each malware family is aligned using

    MSA. From each row of aligned MSA sequence an opcode that appears

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    40/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 31

    Figure 4.3: Malware samples arranged in star like fashion with M2 is basesamples andM1 the closest and M5 the farthest samples from base. The closest

    sample will be more similar to the base malware samples.

    Figure 4.4: Signature Modelling and Testing

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    41/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 32

    in 60% of the samples in a row is preserved. The combination of all such

    opcode sequence from all rows of a MSA is considered as a single signature

    for a family. Figure 4.5 show single signature extracted from MSA of opcode

    sequences.

    Figure 4.5: Extraction of single signature.

    4.3.2 Group Signature

    Each malware family is subdivided into number of smaller groups based on

    Phyogenetic tree. All samples which are close based on the distance are

    grouped to form a subgroup. A subgroup may contain two or more samples,

    opcode sequences are aligned using MSA and single signature for each sub-

    group is extracted. Thus, for k subgroups we obtain k signatures. MSA of

    k signatures are further created and wild card based signature is retained.

    This signature is also referred as group signature. The main advantage of

    representing group signature based on wild card is that it saves time dur-

    ing the testing phase otherwise test sample need to be checked against i

    prominent signatures from k subgroup signature where i < k. Figure 4.6

    shows wildcard representation of group signature and Mtis the malware test

    sample. This

    Figure 4.6: Wildcard based representation of Group signature.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    42/60

    Chapter 4. Metamorphic Malware Exploration Technique Using MSA(MOMENTUM) 33

    4.3.3 Testing

    The last module of MOMENTUM determines the family to which the unseen

    samples (malware/benign) belong. This is determined by aligning the test

    samples against single and group signatures of each family. The unseen

    samples is said to belong to a family if high score value or low values of

    distance by aligning it with signature(s).

    Threshold of each malware family is determined and samples in the test set

    is detected by using three types of signature. For computing the threshold

    corresponding to a family both malware and benign samples in the training

    set is considered. Each variant and benign samples are matched with the

    signature(s) and a score is determined. Higher score represents high matchwith a signature. Threshold thfor a family is determined as follows.

    th=(Bmax+ Mmin)

    2

    where Bmin, Bmax depicts minimum and maximum score corresponding to

    benign samples with signature(s). Similarly Mmin, Mmax represents high-

    est and lowest score of a malware with the signature(s). A test sample t

    is considered as benign if the score obatined by aligning this sample withthe signature if less than threshold th otherwise the sample is flagged as

    malware.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    43/60

    Chapter 5

    Result and Inferences

    The experiments are performed on Intel Core i7 870 processor with 8GB

    RAM installed on the machine. Some tools like IDA Pro disassemble, GUN-

    Packer, Ether are installed in machines which is used for different purpose

    like (a) packed executable analysis (b) to disassemble code. The data set

    consists of malware families synthetic and real malware. Malware samples

    are collected from VX Heavens repository, use agencies and some have been

    constructed using the malware constructors like NGVCK (Next Generation

    Virus Construction Kit), G2, PSMPC, and MPCGEN. Following are differ-ent phases in the experiments.

    1. Dataset preparation: Collected samples of malware and benign exe-

    cutables are scanned using 14 antiviruses. Using the scanned reports of

    the antiviruses, malware executables are separated into different fami-

    lies. The entire data set is divided into two parts one for training and

    other for testing. Executables are disassembled using IDA Pro disas-

    sembler to obtain the assembly code of the executables and mnemonicsare extracted fro each assembly representation of the malicious/benign

    files.

    2. Validation of obfuscation: From each representative malware fam-

    ily a central or base file is selected. Sequence alignment techniques are

    applied within the family to obtain alignments for each pair of sam-

    ples. Alignments depicts point of match and mutations. Total number

    of mutations in malware dataset is estimated.

    34

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    44/60

    Chapter 5. Result and Inferences 35

    3. Metamorphism in Malware Tools: Inter family pairwise analysis

    is performed amongst all base samples selected for each family. If the

    distance between any two base malware is very less then the families

    are considered to overlap.

    4. Signature Modelling: Two types of signature are extracted from

    MSA of each malware family. These signatures are referred as (a)

    single and (b) group. A training model is prepared with malware and

    benign samples in the dataset and threshold for each malware family

    is determined. Unseen samples (of test set) are tested using threshold

    determined during training and evaluation metrics is computed.

    5.1 Evaluation Metrics

    Experimental results are evaluated using evaluation metrics like TPR,TNR,

    FPR, FNR. These metrics are computed using True positives (T P), True

    Negative (T N), False Positive (F P) and False Negative (F N). T P indicates

    the number of samples classified as malware, T Nis the number of correctly

    classified benign instances, F P is the number of benign samples incorrectly

    classified as malware and F N is the malicious samples classified as benign.

    The performance of any detector/scanner can be measured by primarily

    checking theTrue Positive rate (TPR)and True Negative Rate (TNR)which

    are also known as sensitivity and specificityrespectively.

    1. True Positive Rate (TPR):

    T P R= T P/(T P+ F N)

    2. False Positive Rate (FPR):

    F P R= F P/(F P+ T N)

    3. True Negative Rate (TNR):

    T NR= T N/(T N+ F P)

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    45/60

    Chapter 5. Result and Inferences 36

    4. False Negative Rate (FNR):

    F NR= F N/(F N+ T P)

    In case of a protection system, high value ofTPR and TNR along with low

    FPR andFNR is required. This would ascertain that the scanner is capable

    of correctly identifying samples as malware or benign.

    5.2 Intra Family Analysis

    Figures 5.1 shows intra family analysis for malware constructors.

    Figure 5.1: Intra Family Analysis of malware (Synthetic and Real).

    From the graph we can observe the following

    Non zero values indicates presence of metamorphism in synthetic data.

    Levenshtein distance is high due to junk code insertion.

    In spite of high values of global distance, local distances are low in most

    of the samples. This indicates presence of similar regions in code.

    5.3 Inter Family Analysis

    Inter family analysis is performed by comparing the base samples of different

    families. Figure 5.2 shows inter family analysis of malware families.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    46/60

    Chapter 5. Result and Inferences 37

    Figure 5.2: Inter Family Analysis of malware (Synthetic and Real).

    Distance is less than intra family distance. This indicates most of

    malware share some base code and could be detected using commonsignature.

    Levenstein Distance is relatively high in comparison of local and Needle-

    man Wunsch alignments because of variable functionality of the code

    resulting in increase of the number of gaps in alignment.

    5.4 Comparative Analysis

    This section shows comparative analysis among different types of samples

    based on various parameters (a) alignment per samples (b) average sum of

    distance and (c) degree of obfuscation (refer Table 5.1).

    Table 5.1: Comparative Analysis of Malware Samples

    Virus Type Replacement Avg. SoD Obfuscation/Alignment

    NGVCK 47 1.03 Average Simple

    G2 3 1.45 Low SimpleMPCGEN 31 0.61 Average Simple

    PSMPC 1 1.35 Low WeakVx Heavens 122 8.3 Large Complex

    Viruses generated using tools belong to same family.

    Families of real malware are distinct.

    In PSMPC loop and jump instructions contribute for obfuscation thisincreases the distance between samples.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    47/60

    Chapter 5. Result and Inferences 38

    NGVCK viruses overlaps with real malware (Savior).

    mov,add,sub,pushand pophave been replaced most of the times with

    equivalent instructions instructions.

    Obfuscation is primarly single instruction is replacement instead of

    multiple instructions. This is validated by observing the global and

    local alignments of samples. The types of mismatch in global and

    local alignment are same suggesting less complex obfuscation.

    5.5 Testing with Signature

    Malware families created using the scanners are separated into number of

    families. For each malware family two types of signature (single and group)

    are extracted. Single signature is the maximum preserving opcode sequence

    in a multiple aligned sequence of a family of malware. Each row of MSA

    depict match, mismatch and gap corresponding to opcode sequences. Group

    signature is the wildcard representation of signatures of the subfamilies in a

    family. Table 5.2 shows values for evaluation metrics for different types of

    signature.

    Table 5.2: Evaluation Metrics for different types of signatures.

    Types of TPR FNR TNR FPRSignatures

    Single 0.95 0.046 0.48 0.52Group 0.73 0.27 0.99 0.01

    It is observed that the detection rate is approximately 95% with a FPR of46%. This indicates that most of the malware samples are detected but many

    benign samples are incorrectly classified as malware. Since single signature

    is constructed by extracting maximum preserving (55%) opcodes in MSA

    row, opcodes responsible for mutations are lost in signature (they appears

    to be less dominant). Thus, most of the benign samples in test set score well

    with the signature and are detected as malware.

    In case of group signature a detection rate of 73% is obtained with very less

    false positive rate (FPR = 0.1). This indicates that malware samples in the

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    48/60

    Chapter 5. Result and Inferences 39

    test set is detected by wild card representation of signature. The group sig-

    nature actually depicts wildcard representation of signatures of subfamilies

    for a family. Opcode sequence present in this signature is absent in benign

    samples, thus, they could be discriminated from the malware samples.

    5.6 Comparative Analysis with Antiviruses

    Entire dataset was scanned using 14 antiviruses and the detection rate was

    computed from their scan report. Figure5.3depicts the detection rate ob-

    tained from antiviruses and the MOMENTUM. The top five detection rate

    was obtained with antiviruses like Avast, Avira, AVG, GData, Kaspersky

    (arranged in ascending order of detection rate). It was observed that the de-

    tection rate of MOMENTUM is close to the top three commercial antivirus

    product. Some of the malicious files (total 37 malware) were not detected

    by any of the antivirus.

    Figure 5.3: Detection rate of antiviruses compared with different type of

    constructed signature.

    Out of 37 undetected malware executable from different antiviruses, using

    our implementation methodology (MOMENTUM) 30 malware was detected

    with single signature and 20 malcode were detected using group signature

    (wildcard signature). Effectiveness of the method suggests that bioinformat-

    ics sequence alignment methods could used effectively to detect malware.

    Also, these methods could be used for generating malware signatures and in

    assisting scanners for detection purpose.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    49/60

    Chapter 6

    Conclusions and Future Work

    Malicious Software (malware) is a major threat to computer systems. Mal-

    ware detection mechanisms are gaining prominence amongst researchers and

    have turned out to be a topic of research. The number of malware has

    increased at an alarming rate due to the fact that malware writers are de-

    ploying obfuscation methods. The nonsignature based detection methods

    are important as the malware writer are producing metamorphic or poly-

    morphic malware. Thus, a strong signature based methods is required to

    detect these modern malware.

    In this thesis the problem of detection of metamorphic malware is discussed

    using MSA methods. Signature(s) (single and group ) for a malware family

    is extracted and tested using the unseen samples. Metamorphism amongst

    malware constructors and real malware is explored. It was found in this

    investigation that the malware constructors used minimal obfuscation which

    were mainly single, multiple instruction replacement. Primarily the obfus-

    cation found was code reordering.

    The detection rate of the implementation method (MOMENTUM) is also

    compared with that of antiviruses. It was obaserved that the unseen samples

    were detected using signatures with low false positives. Also, the detection

    rate of implementation method is comparable with that of antivirus like

    Avast, Avira, AVG. Some of the undected malware executables from all an-

    tiviruses were detected by MOMENTUM. In continuation to the present

    work some suitable scoring scheme could be devised that could identify un-

    seen samples. This could be initiated by assigning some weights to mnemonic

    40

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    50/60

    Chapter 6. Conclusions and Future Work 41

    pairs that are responsible for mutation. Also, the operands of instructions

    could be considered to improve detection rates.

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    51/60

    Appendix A

    Executable Unpacking

    A packer is program used to encrypt the executable there by reducing its

    size and to avoid the executable from reverse engineering. Most of the pack-

    ers are dependent on specific file format like Portable Executable (PE) or

    Dynamic Link Library (DLL). The packed executable would restore in its

    original form once it is loaded in the memory. Malware authors use packers

    to avoid detection by anti virus products as the malicious code is hidden from

    the scanners. Basically, we can think of packer as a software which place

    an executable inside another executable. Thus, the outer executable is re-sponsible for unpacking the original executable which is hidden by a packer.

    The basic function of packers is to encrypt the code, resources and import

    table. Executable packers insert some random number ofjump instructions

    in order to confuse the disassemblers. Advanced packers also encrypts the

    Portable Executable (PE) sections so that the antivirus virtually fails to

    scan proper malicious code. Static analysis of packed code is not possible as

    the malicious payload is unpacked during runtime. Thus, the antivirus us-

    ing sandbox environment has the capability of unpacking the executable byexecuting each suspicious sample. However, unpacking executable is com-

    putational expensive. If the unpacked malware is analyzed for detection

    then we may basically scan the packer code instead of malicious executable

    code. Unpacking could be performed using the generic unpacker like GUN-

    Packer [3]. The basic problems with these signature based packers are (a)

    packer signatures need to be updated periodically and (b) difficulty in the

    detection of multiple layer packed executables.

    42

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    52/60

    Chapter 6. Conclusions and Future Work 43

    Another way of software unpacking is by using Ether Unpack [4]. The main

    problem using Ether is that it requires dedicated operating system and hard-

    ware. Initially the sample to be unpacked is executed in the guest operating

    system (Windows XP SP2) and Ether tries to locate all memory writes thatare performed by the executing process. Whenever a memory write oper-

    ation is performed the process dump is stored under the images directory.

    Ether considers each memory write operation as the candidate Original En-

    try Point (OEP). FiguresA.1depicts the process of unpacking executables

    (malware/benign) using signature based packers and Ether Unpack.

    Figure A.1: Portable Executable Unpacking Procedure

    A.1 Symptoms of Packed Malicious Executa-

    bles

    Packed PE files can be detected using signature based, heuristics based or

    dynamic unpackers. Native and packed malicious code some difference which

    are listed below

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    53/60

    Chapter 6. Conclusions and Future Work 44

    (i.) Nonstandard section names: Most of the compilers and linkers

    have follow convention for naming the sections. The executable pack-

    ers prepends nostandard section name like .upx0, .upx1 etc. in the

    packed code.

    (ii.) Small Code Section: The packed code contain small code with pop-

    ulated data section. The disassembles also exposes the code of stub

    instead of actual code.

    (iii.) Missing String Table: The string table or symbol table is used by

    most of the compiler to store address of symbols instead of maintaining

    multiple strings in the table. The packer normally encrypts the strings,

    inserts garbage address corresponding to each string in the string table.

    (iv.) Small Import Table size: The native executable have populated en-

    tries in the Import Address Table (IAT) one for each API. The packed

    PE samples have small import table with few imports of common APIs

    like GetProcAddressor LoadLibrary.

    (v.) Execution of Code starts at last section: The PE file is divided

    into logical structures called as sections which are data, code, reloc

    etc. Some of the malware packers hide the original entry point andadd new section possibly at the end of the all sections.

    (vi.) Section Characteristics: The characteristics are the flags for each

    section describing about the permissions alloted to a section. The

    code section has characteristics flag set as executable but lacks write

    permission. The malware packers either have both execute and write

    permission or leave the permissions as 0.

    A.2 Manual Unpacking of Packed Executable

    Packed PE files can be identified using signature based packers which tries

    match executable packer signature with the known signatures of the packers

    stored in the repository. Another way to find a executable as being packed

    using the known packers is to perform entropy analysis of the suspicious file.

    The entropy for complete file or the few bytes from the beginning of the

    file could indicate whether a file is packed or not. Following are the stepsadopted to manually unpack malicious code (refer Figure

  • 5/28/2018 Bioinformatics Techniques for Metamorphic Malware Analysis and Detection: Grijesh

    54/60

    Chapter 6. Conclusions and Future Work 45

    (i.) The preliminary step is to identify the type of packer used to pack an

    executable. Once a packer is known to us we need to locate the original

    entry point of the executable by executing the suspicious sample.

    (ii.) The executable is loaded in OllyDebugger and a break point is set and

    the program is allowed to execute until it stops the execution. At this

    point the memory dump is retrieved. The memory dump contain both

    the unpacked and the unpacking stub code.

    (iii.) The dump executables entry point still points to the starting address

    of the packer. Since it is required that the unpacked data should be first

    executed followed by the unpacker code the entry point is calculated

    asRVA Entry Point = OEP - Base Address

    (iv.) Finally the import table is reconstructed by specify proper RVA Entry

    point. This total would reconstruct the import address ta