research article malware analysis using …cavazos/cisc850-spring2017/...research article malware...

16
Research Article Malware Analysis Using Visualized Image Matrices KyoungSoo Han, 1 BooJoong Kang, 2 and Eul Gyu Im 3 1 Department of Computer and Soſtware, Hanyang University, Seoul 133-791, Republic of Korea 2 Department of Electronics and Computer Engineering, Hanyang University, Seoul 133-791, Republic of Korea 3 Division of Computer Science and Engineering, Hanyang University, Seoul 133-791, Republic of Korea Correspondence should be addressed to Eul Gyu Im; [email protected] Received 14 March 2014; Accepted 19 May 2014; Published 16 July 2014 Academic Editor: Fei Yu Copyright © 2014 KyoungSoo Han et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper proposes a novel malware visual analysis method that contains not only a visualization method to convert binary files into images, but also a similarity calculation method between these images. e proposed method generates RGB-colored pixels on image matrices using the opcode sequences extracted from malware samples and calculates the similarities for the image matrices. Particularly, our proposed methods are available for packed malware samples by applying them to the execution traces extracted through dynamic analysis. When the images are generated, we can reduce the overheads by extracting the opcode sequences only from the blocks that include the instructions related to staple behaviors such as functions and application programming interface (API) calls. In addition, we propose a technique that generates a representative image for each malware family in order to reduce the number of comparisons for the classification of unknown samples and the colored pixel information in the image matrices is used to calculate the similarities between the images. Our experimental results show that the image matrices of malware can effectively be used to classify malware families both statically and dynamically with accuracy of 0.9896 and 0.9732, respectively. 1. Introduction Malware authors have been generating new malware and malware variants through various means, such as reusing modules or using automated malware generation tools. As some modules for malicious behavior are reused in malware variants, malware variants of the same family may have similar binary patterns, and these patterns can be used to detect malware and to classify malware families. Moreover, most antivirus programs focus on malware signatures, that is, string patterns, to detect malware [1]. However, various detec- tion avoidance techniques such as obfuscation or packing techniques are applied to malware variants to avoid detection by signature-based antivirus programs and to make analysis difficult for security analysts [2, 3]. With the help of malware generation techniques, the amount of malware is increasing every year. Although security analysts and researchers have been studying various analysis techniques to deal with malware variants, they cannot be analyzed completely because the malware in which avoidance techniques are applied is expo- nentially increasing. erefore, new malware analysis tech- niques are required to reduce the burden on security analysts. Recently, several malware visualization techniques have been proposed to help security analysts to analyze malware. In this paper, we propose a novel method to analyze malware visually to classify malware families. e proposed method converts the opcode sequences extracted from the malware into images called image matrices and calculates the similarities between each image. In addition, we apply the proposed method to the execution traces extracted through dynamic analysis, so that malware employing detection avoidance techniques such as obfuscation and packing can be analyzed. To reduce the computational overheads, we extract the opcode sequences only from the blocks that are related to staple behaviors, such as functions and application programming interface (API) calls, by using a major block selection technique [2]. Representative images of individual malware families are generated and are used to classify the unknown sample rapidly. Using these image matrices, we Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 132713, 15 pages http://dx.doi.org/10.1155/2014/132713

Upload: nguyenbao

Post on 11-Mar-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleMalware Analysis Using Visualized Image Matrices

KyoungSoo Han1 BooJoong Kang2 and Eul Gyu Im3

1 Department of Computer and Software Hanyang University Seoul 133-791 Republic of Korea2Department of Electronics and Computer Engineering Hanyang University Seoul 133-791 Republic of Korea3 Division of Computer Science and Engineering Hanyang University Seoul 133-791 Republic of Korea

Correspondence should be addressed to Eul Gyu Im imeghanyangackr

Received 14 March 2014 Accepted 19 May 2014 Published 16 July 2014

Academic Editor Fei Yu

Copyright copy 2014 KyoungSoo Han et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper proposes a novel malware visual analysis method that contains not only a visualization method to convert binary filesinto images but also a similarity calculationmethod between these imagesThe proposedmethod generates RGB-colored pixels onimage matrices using the opcode sequences extracted frommalware samples and calculates the similarities for the image matricesParticularly our proposed methods are available for packed malware samples by applying them to the execution traces extractedthrough dynamic analysis When the images are generated we can reduce the overheads by extracting the opcode sequences onlyfrom the blocks that include the instructions related to staple behaviors such as functions and application programming interface(API) calls In addition we propose a technique that generates a representative image for eachmalware family in order to reduce thenumber of comparisons for the classification of unknown samples and the colored pixel information in the image matrices is usedto calculate the similarities between the images Our experimental results show that the image matrices of malware can effectivelybe used to classify malware families both statically and dynamically with accuracy of 09896 and 09732 respectively

1 Introduction

Malware authors have been generating new malware andmalware variants through various means such as reusingmodules or using automated malware generation tools Assome modules for malicious behavior are reused in malwarevariants malware variants of the same family may havesimilar binary patterns and these patterns can be used todetect malware and to classify malware families Moreovermost antivirus programs focus onmalware signatures that isstring patterns to detectmalware [1]However various detec-tion avoidance techniques such as obfuscation or packingtechniques are applied to malware variants to avoid detectionby signature-based antivirus programs and to make analysisdifficult for security analysts [2 3] With the help of malwaregeneration techniques the amount of malware is increasingevery year

Although security analysts and researchers have beenstudying various analysis techniques to deal with malwarevariants they cannot be analyzed completely because the

malware in which avoidance techniques are applied is expo-nentially increasing Therefore new malware analysis tech-niques are required to reduce the burden on security analystsRecently several malware visualization techniques have beenproposed to help security analysts to analyze malware

In this paper we propose a novel method to analyzemalware visually to classify malware families The proposedmethod converts the opcode sequences extracted from themalware into images called image matrices and calculates thesimilarities between each image In addition we apply theproposed method to the execution traces extracted throughdynamic analysis so that malware employing detectionavoidance techniques such as obfuscation and packing canbe analyzed To reduce the computational overheads weextract the opcode sequences only from the blocks that arerelated to staple behaviors such as functions and applicationprogramming interface (API) calls by using a major blockselection technique [2] Representative images of individualmalware families are generated and are used to classify theunknown sample rapidly Using these image matrices we

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 132713 15 pageshttpdxdoiorg1011552014132713

2 The Scientific World Journal

obtain the similarities between the images after the RGB-colored pixel information of the images is vectorized and thepixel similarities are calculated

This paper is composed as follows In Section 2 malwareanalysis-related studies are described In Section 3 malwareanalysis methods using visualized opcode sequences andthe methods to calculate similarity are proposed and theexperimental results are presented in Section 4 Finally inSection 5 conclusions and future directions are provided

2 Related Work

In general malware analysis methods to detect and classifymalware can be categorized as either static or dynamic anal-yses [4] In the static analysis of malware various methodssuch as control flow graph (CFG) analysis [5ndash7] call graphanalysis [8 9] byte level analysis [10] instruction-basedanalysis [11ndash14] and similarity-based analysis [15 16] havebeen proposed

CFGs are generated by dividing the instructions extractedthrough disassembling into blocks and by connecting thedirected edges between the blocks Some malware analysismethods using these CFGs as signatures have been proposedCesare and Xiang [5] proposed a method that defines CFGsas signatures in string form that consist of a list of graphedges for the ordered nodes and thatmeasures the similaritiesamong signatures by using theDice coefficient algorithm [17]Bonfante et al [6] proposed amethod that converts the CFGsinto tree-based finite state machines through syntactic anal-ysis and semantic analysis and then uses them as signaturesBriones and Gomez [7] proposed an automated classificationsystem based on CFGs The CFGs are summarized as threetuples including the number of basic blocks the number ofedges and the number of subcalls and then two functionscan be compared However if the complex information issummarized into a small size high false alarms may occur

There is much research aimed at detectingmalware basedon information such as system-calls functions and API callswhich is used for malware execution in operating systemsShang et al [8] proposed a method that generates function-call graphs which represent the caller and callee relationshipsbetween functions as signatures ofmalware samples and theythen compute the similarities by using those function-callgraph signatures Kinable and Kostakis [9] classifiedmalwareusing the call graph clustering technique Their proposedmethod generated the call graphs against the functionsincluded in the malware samples and they performed theclustering based on the structural similarity scores of the callgraphs calculated through the graph edit distance algorithm

Statistical information regarding the instructionsextracted through disassembling can be used in the staticanalysis of malware Rad and Masrom [11] proposed amethod based on the instruction frequencies in order toclassify metamorphic malware Since instruction frequenciesare mostly not changed even though the obfuscationtechniques are applied to the malware the instructionfrequencies can become the features of malware Thereforetheir proposed method calculated a distance by using theinstruction frequencies extracted from eachmalware sample

and they then classified metamorphic malware by using thedistance value Bilar [12] showed that there were differentinstruction frequencies in different malware Particularlythey showed that rare instructions in malware could becomebetter predictors to classify malware than other instructioncould Han et al [13] proposed a method using instructionfrequencies The proposed method generated instructionsequences that were sorted according to the instructionfrequencies and they showed that the distances betweeninstruction sequences from the same malware family hadlow distance values Santos et al [14] proposed a malwareclassification method using n-gram instruction frequenciesin which n-gram instructions included n-instructions Inthe proposed method they generated the vectors for eachn-instruction sequence and used some of the vectors assignatures

In addition dynamic analysis methods including taint-ing behavior-based methods and API call monitoring havebeen proposed Egele et al [18] proposed a method usingtainting techniques which tracks the behaviors related tothe flow of information that are processed by any browserhelper object (BHO) If the BHO leaks sensitive informationto the outside the BHO is classified as malware Fredriksonet al [19] proposed a method that automatically extractsthe characteristics of behaviors by using graph mining tech-niques Their proposed method made clusters by identifyingcore CFGs for each similar malicious behavior in a malwarefamily and these were then generalized as a significant behav-ior Furthermore methods based on dynamic monitoringtechniques using an emulator have been proposed Vinod etal [20] traced malware API calls via dynamic monitoringwithin an emulator andmeasured their frequencies to extractcritical APIs Miao et al [21] developed a tool called the ldquoAPICapturerdquo that extracts themajor characteristics automaticallysuch as system-call arguments return values and errorconditions by monitoring malware behavior in an emulator

Even though there are many static and dynamic analysesmethods available new techniques that can complementexisting techniques are still needed to improve malwareanalysis performance and conveniences of analysis by secu-rity analysts Recently several visualization methods havebeen proposed to help security analysts to observe thefeatures and behaviors of malware [22] To visualize malwarebehavior Trinius et al [23] proposed amethod that visualizedthe percentages of API calls as well as malware behaviorinto each of two images called a ldquotreemaprdquo and ldquothreadgraphrdquo respectively Saxe et al [24] developed a systemthat generated two types of images One image showedthe system-call sequences extracted from malware system-call behavior logs and the other image showed similaritiesand differences between selected samples Conti et al [25]proposed a visualizing system that shows the images for thebyte information of malware samples such as byte valuesbyte presence and duplicated sequences of bytes containedwithin a sample Anderson et al [26] proposed a methodto show the similarities between malware samples in animage named a ldquoheatmaprdquo Nataraj et al [27] converted thebyte information into gray-scale images and classified themalware using image processing After generating images

The Scientific World Journal 3

Disassemble(static analysis)

PIN(dynamic analysis)

Malware samples

Major block selection

Repetitionfiltering

Parsing

Opcodesequences

Basic blocks blocks

Filteredbasic blocks

Dynamic traces

Major

Representativeimage matrix

generation

Similaritycalculation

Similarities

Step 1Extraction of opcode sequences

Step 2Image matrix generation

Step 3Similarity calculation

Figure 1 Overview of the proposed method

using byte values they applied an abstract representationtechnique for the scene image that is GIST [28 29] tocompute texture features Moreover they proved that thebinary texture analysis techniques using image processingcould classify malware more quickly than existing malwareclassification methods could [30] However since the textureanalysis method has large computational overheads theproposed method has problems in processing a large amountof malware [31]

In this paper we propose a novel analysis method usingimage matrices to represent malware visually so that the fea-tures of themalware can be easily detected and the similaritiesbetween different malware samples can be calculated fasterthan with other visualization methods

3 Our Proposed Method

31 Overview Our proposed visualized malware analysismethod consists of three steps as shown in Figure 1 In Step 1opcode sequences are extracted frommalware binary samplesor dynamic execution traces Then image matrices in whichthe opcode sequences are recorded as RGB-colored pixels aregenerated in Step 2 In Step 3 the similarities between theimage matrices are calculated In the following sections eachstep is explained in detail

32 Extraction of Opcode Sequences Figure 2 shows theprocess to extract opcode sequences from malware binarysamples for Step 1 through static analysis or dynamic analysis

321 Basic Block Extraction To extract opcode sequencesfrom malware binary samples the binary sample files arefirst disassembled and divided into basic blocks using dis-assembling tools such as IDA Pro [32] or OllyDbg [33]However if obfuscation or packing techniques are applied inmalware samples static analysis using a disassembler is notfeasible [34 35] Therefore some malware samples (in whichobfuscation or packing techniques are applied) need to beexecuted in a dynamic analysis environment [36]

In dynamic analysis as shown in Figure 3 some repeatedinstruction sequences are included in the dynamic executiontraces because a program may have some loops or repeatedcalls and these repeated sequences can increase the size

of not only the execution traces but also the processingoverheads Kang et al [37] proposed a repetition filteringmethod for dynamic execution traces Our filtered basicblocks are extracted from the dynamic execution tracesafter the repetition filtering method is applied Finally ifbasic blocks are extracted frommalware samples or dynamicexecution traces then major blocks are selected from thebasic blocks by our proposed technique which is explainedin the next section

322 Major Block Selection The malware analysis methodproposed in this paper does not target all of the basicblocks from the binary disassembling results or dynamicexecution traces If all the basic blocks are used for analysisthen some blocks for binary file execution in an operatingsystem are included in the basic blocks Moreover manymeaningless blocks may be included in the basic blocksextracted from malware samples As a result the number ofbasic blocks that have to be analyzed by the security analystsis increased and distinguishing malware features becomesdifficult In addition the number of comparisons betweenthe basic blocks from two malware samples is also increaseddramatically On the contrary if the number of unnecessaryblocks can be reduced as much as possible in the malwareanalysis the analysis time cost for not only the individualmalware sample but also a large number of malware samplescan be reduced Therefore we selected some blocks relatingto suspicious behaviors and functions from among the entireset of basic blocks

As shown in Figure 4 the blocks selected as major blocksare those that include the CALL instruction which is usedto invoke APIs library functions and other user-definedfunctions This is because not only user-defined functionsbut also various system calls are used to implement thebehaviors and functions of most programs If blocks thatinclude these function invocation instructions are used inmalware analysis malware features can be extracted [2]Through a major block selection technique the image matrixgenerating time is reduced by recording only those selectedblocks in the image matrix

323 Opcode Sequence Extraction To extract malware fea-tures as shown in Figure 5 the opcode sequences in the

4 The Scientific World Journal

Static analysis

(disassemble)

Parsing Opcode sequences

Basic blocks

Repetitionfiltering

Major block

Major block

selection

DynamicDynamic analysis execution (execution) trace

Figure 2 Opcode sequence extraction procedure

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

0042A68B JBE

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS[EDI]AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

MOV ALBYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A696 JMP

0042A5FE ADD EBX EBX

0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A69C

Exploit 0042A5FE

Figure 3 An example of repeated instruction sequences

The Scientific World Journal 5

Major blocks

Basic blocks

MOV MOVMOVLEA MOVMOVMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR

PUSH ebpLEA PUSHPUSH ebxCALLADD POP ebpMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR eax eax

eax eax

JZ

PUSH ebpLEA PUSHPUSH ebxCALLADD esp 8

esp 8

POP ebpMOV eax 1

eax 1 ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

0FFFFFFFFh

0FFFFFFFFh

sub 4019EA

sub 4019EA

dword ptr [edi + ecx lowast 4 + 4]

dword ptr [edi + ecx lowast 4 + 4]

JZ short loc 40192F

ebx [ebp + arg 4]

ebx [ebp + arg 4]

[ebp + var 8] eaxeax [ebp + arg 8][ebp + var 4] eaxeax [ebp + var 8][ebx-4] eaxesi [ebx + 0Ch]edi [ebx+8

short loc 40192F

Figure 4 Major block selection

individual major blocks are used as malware informationFrom each opcode only the first three characters are usedto generate information for the block The reasons for usinga three-character opcode are as follows From the entire setof opcodes used in the Intel x86 assembly language 414 ofthem have three characters and the appearance frequenciesof these opcodes within the binary files are higher thanfor other opcodes On the other hand 288 of opcodeshave four characters 178 have five characters and 52have over six characters respectively thus their appearancefrequencies are relatively low In addition since themeaningsof the individual opcodes are maintained even though theyare reduced to three characters the different opcodes canbe distinguished For example four-character opcodes suchas PUSH are reduced to three characters PUS and two-character opcodes such as OR are expanded by adding ablank character to a three-character opcode Then thesethree-character opcodes are concatenated together and thecharacter string is used to represent the block as an opcodesequence which is used to generate image pixels in an imagematrix in the next step

33 Generation of the Image Matrix Figure 6 shows theprocedure for Step 2 that converts the opcode sequences into

pixels in an image matrix A hash function is used to decidethe119883-119884 coordinates and RGB colors of the pixels

To visualize a binary file as an image matrix both thelength and the width of an image matrix are initialized to 2nwhere 119899 is selected by the users To reduce the probability ofcollisions of the hash function n should be large enough Inour experiments we selected 119899 as 8 to minimize collisions

The coordinate-defining module and the RGB color-defining module are used to generate image matrices Firstthe coordinate-defining module defines the (119909 119910) coordi-nates of pixels on the image matrix of each code blockSecond the RGB color-defining module defines the colorvalues of pixels on the image matrix RGB colors are definedby calculating values of 8 bits each for the red green and bluecolors

SimHash [38] is applied to opcode sequences extracted inStep 1 in order to define both the coordinates and the colorvalues of the pixels SimHash is a local-sensitive hash functionused in the similar sentence detection system which assumesthat if the input values are similar then the output values willalso be similar That is since SimHash tokenizes the inputstrings and generates hash values for each token if a fewtokens are different in two input strings then the generatedhash values are not completely different but are similar

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

2 The Scientific World Journal

obtain the similarities between the images after the RGB-colored pixel information of the images is vectorized and thepixel similarities are calculated

This paper is composed as follows In Section 2 malwareanalysis-related studies are described In Section 3 malwareanalysis methods using visualized opcode sequences andthe methods to calculate similarity are proposed and theexperimental results are presented in Section 4 Finally inSection 5 conclusions and future directions are provided

2 Related Work

In general malware analysis methods to detect and classifymalware can be categorized as either static or dynamic anal-yses [4] In the static analysis of malware various methodssuch as control flow graph (CFG) analysis [5ndash7] call graphanalysis [8 9] byte level analysis [10] instruction-basedanalysis [11ndash14] and similarity-based analysis [15 16] havebeen proposed

CFGs are generated by dividing the instructions extractedthrough disassembling into blocks and by connecting thedirected edges between the blocks Some malware analysismethods using these CFGs as signatures have been proposedCesare and Xiang [5] proposed a method that defines CFGsas signatures in string form that consist of a list of graphedges for the ordered nodes and thatmeasures the similaritiesamong signatures by using theDice coefficient algorithm [17]Bonfante et al [6] proposed amethod that converts the CFGsinto tree-based finite state machines through syntactic anal-ysis and semantic analysis and then uses them as signaturesBriones and Gomez [7] proposed an automated classificationsystem based on CFGs The CFGs are summarized as threetuples including the number of basic blocks the number ofedges and the number of subcalls and then two functionscan be compared However if the complex information issummarized into a small size high false alarms may occur

There is much research aimed at detectingmalware basedon information such as system-calls functions and API callswhich is used for malware execution in operating systemsShang et al [8] proposed a method that generates function-call graphs which represent the caller and callee relationshipsbetween functions as signatures ofmalware samples and theythen compute the similarities by using those function-callgraph signatures Kinable and Kostakis [9] classifiedmalwareusing the call graph clustering technique Their proposedmethod generated the call graphs against the functionsincluded in the malware samples and they performed theclustering based on the structural similarity scores of the callgraphs calculated through the graph edit distance algorithm

Statistical information regarding the instructionsextracted through disassembling can be used in the staticanalysis of malware Rad and Masrom [11] proposed amethod based on the instruction frequencies in order toclassify metamorphic malware Since instruction frequenciesare mostly not changed even though the obfuscationtechniques are applied to the malware the instructionfrequencies can become the features of malware Thereforetheir proposed method calculated a distance by using theinstruction frequencies extracted from eachmalware sample

and they then classified metamorphic malware by using thedistance value Bilar [12] showed that there were differentinstruction frequencies in different malware Particularlythey showed that rare instructions in malware could becomebetter predictors to classify malware than other instructioncould Han et al [13] proposed a method using instructionfrequencies The proposed method generated instructionsequences that were sorted according to the instructionfrequencies and they showed that the distances betweeninstruction sequences from the same malware family hadlow distance values Santos et al [14] proposed a malwareclassification method using n-gram instruction frequenciesin which n-gram instructions included n-instructions Inthe proposed method they generated the vectors for eachn-instruction sequence and used some of the vectors assignatures

In addition dynamic analysis methods including taint-ing behavior-based methods and API call monitoring havebeen proposed Egele et al [18] proposed a method usingtainting techniques which tracks the behaviors related tothe flow of information that are processed by any browserhelper object (BHO) If the BHO leaks sensitive informationto the outside the BHO is classified as malware Fredriksonet al [19] proposed a method that automatically extractsthe characteristics of behaviors by using graph mining tech-niques Their proposed method made clusters by identifyingcore CFGs for each similar malicious behavior in a malwarefamily and these were then generalized as a significant behav-ior Furthermore methods based on dynamic monitoringtechniques using an emulator have been proposed Vinod etal [20] traced malware API calls via dynamic monitoringwithin an emulator andmeasured their frequencies to extractcritical APIs Miao et al [21] developed a tool called the ldquoAPICapturerdquo that extracts themajor characteristics automaticallysuch as system-call arguments return values and errorconditions by monitoring malware behavior in an emulator

Even though there are many static and dynamic analysesmethods available new techniques that can complementexisting techniques are still needed to improve malwareanalysis performance and conveniences of analysis by secu-rity analysts Recently several visualization methods havebeen proposed to help security analysts to observe thefeatures and behaviors of malware [22] To visualize malwarebehavior Trinius et al [23] proposed amethod that visualizedthe percentages of API calls as well as malware behaviorinto each of two images called a ldquotreemaprdquo and ldquothreadgraphrdquo respectively Saxe et al [24] developed a systemthat generated two types of images One image showedthe system-call sequences extracted from malware system-call behavior logs and the other image showed similaritiesand differences between selected samples Conti et al [25]proposed a visualizing system that shows the images for thebyte information of malware samples such as byte valuesbyte presence and duplicated sequences of bytes containedwithin a sample Anderson et al [26] proposed a methodto show the similarities between malware samples in animage named a ldquoheatmaprdquo Nataraj et al [27] converted thebyte information into gray-scale images and classified themalware using image processing After generating images

The Scientific World Journal 3

Disassemble(static analysis)

PIN(dynamic analysis)

Malware samples

Major block selection

Repetitionfiltering

Parsing

Opcodesequences

Basic blocks blocks

Filteredbasic blocks

Dynamic traces

Major

Representativeimage matrix

generation

Similaritycalculation

Similarities

Step 1Extraction of opcode sequences

Step 2Image matrix generation

Step 3Similarity calculation

Figure 1 Overview of the proposed method

using byte values they applied an abstract representationtechnique for the scene image that is GIST [28 29] tocompute texture features Moreover they proved that thebinary texture analysis techniques using image processingcould classify malware more quickly than existing malwareclassification methods could [30] However since the textureanalysis method has large computational overheads theproposed method has problems in processing a large amountof malware [31]

In this paper we propose a novel analysis method usingimage matrices to represent malware visually so that the fea-tures of themalware can be easily detected and the similaritiesbetween different malware samples can be calculated fasterthan with other visualization methods

3 Our Proposed Method

31 Overview Our proposed visualized malware analysismethod consists of three steps as shown in Figure 1 In Step 1opcode sequences are extracted frommalware binary samplesor dynamic execution traces Then image matrices in whichthe opcode sequences are recorded as RGB-colored pixels aregenerated in Step 2 In Step 3 the similarities between theimage matrices are calculated In the following sections eachstep is explained in detail

32 Extraction of Opcode Sequences Figure 2 shows theprocess to extract opcode sequences from malware binarysamples for Step 1 through static analysis or dynamic analysis

321 Basic Block Extraction To extract opcode sequencesfrom malware binary samples the binary sample files arefirst disassembled and divided into basic blocks using dis-assembling tools such as IDA Pro [32] or OllyDbg [33]However if obfuscation or packing techniques are applied inmalware samples static analysis using a disassembler is notfeasible [34 35] Therefore some malware samples (in whichobfuscation or packing techniques are applied) need to beexecuted in a dynamic analysis environment [36]

In dynamic analysis as shown in Figure 3 some repeatedinstruction sequences are included in the dynamic executiontraces because a program may have some loops or repeatedcalls and these repeated sequences can increase the size

of not only the execution traces but also the processingoverheads Kang et al [37] proposed a repetition filteringmethod for dynamic execution traces Our filtered basicblocks are extracted from the dynamic execution tracesafter the repetition filtering method is applied Finally ifbasic blocks are extracted frommalware samples or dynamicexecution traces then major blocks are selected from thebasic blocks by our proposed technique which is explainedin the next section

322 Major Block Selection The malware analysis methodproposed in this paper does not target all of the basicblocks from the binary disassembling results or dynamicexecution traces If all the basic blocks are used for analysisthen some blocks for binary file execution in an operatingsystem are included in the basic blocks Moreover manymeaningless blocks may be included in the basic blocksextracted from malware samples As a result the number ofbasic blocks that have to be analyzed by the security analystsis increased and distinguishing malware features becomesdifficult In addition the number of comparisons betweenthe basic blocks from two malware samples is also increaseddramatically On the contrary if the number of unnecessaryblocks can be reduced as much as possible in the malwareanalysis the analysis time cost for not only the individualmalware sample but also a large number of malware samplescan be reduced Therefore we selected some blocks relatingto suspicious behaviors and functions from among the entireset of basic blocks

As shown in Figure 4 the blocks selected as major blocksare those that include the CALL instruction which is usedto invoke APIs library functions and other user-definedfunctions This is because not only user-defined functionsbut also various system calls are used to implement thebehaviors and functions of most programs If blocks thatinclude these function invocation instructions are used inmalware analysis malware features can be extracted [2]Through a major block selection technique the image matrixgenerating time is reduced by recording only those selectedblocks in the image matrix

323 Opcode Sequence Extraction To extract malware fea-tures as shown in Figure 5 the opcode sequences in the

4 The Scientific World Journal

Static analysis

(disassemble)

Parsing Opcode sequences

Basic blocks

Repetitionfiltering

Major block

Major block

selection

DynamicDynamic analysis execution (execution) trace

Figure 2 Opcode sequence extraction procedure

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

0042A68B JBE

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS[EDI]AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

MOV ALBYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A696 JMP

0042A5FE ADD EBX EBX

0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A69C

Exploit 0042A5FE

Figure 3 An example of repeated instruction sequences

The Scientific World Journal 5

Major blocks

Basic blocks

MOV MOVMOVLEA MOVMOVMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR

PUSH ebpLEA PUSHPUSH ebxCALLADD POP ebpMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR eax eax

eax eax

JZ

PUSH ebpLEA PUSHPUSH ebxCALLADD esp 8

esp 8

POP ebpMOV eax 1

eax 1 ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

0FFFFFFFFh

0FFFFFFFFh

sub 4019EA

sub 4019EA

dword ptr [edi + ecx lowast 4 + 4]

dword ptr [edi + ecx lowast 4 + 4]

JZ short loc 40192F

ebx [ebp + arg 4]

ebx [ebp + arg 4]

[ebp + var 8] eaxeax [ebp + arg 8][ebp + var 4] eaxeax [ebp + var 8][ebx-4] eaxesi [ebx + 0Ch]edi [ebx+8

short loc 40192F

Figure 4 Major block selection

individual major blocks are used as malware informationFrom each opcode only the first three characters are usedto generate information for the block The reasons for usinga three-character opcode are as follows From the entire setof opcodes used in the Intel x86 assembly language 414 ofthem have three characters and the appearance frequenciesof these opcodes within the binary files are higher thanfor other opcodes On the other hand 288 of opcodeshave four characters 178 have five characters and 52have over six characters respectively thus their appearancefrequencies are relatively low In addition since themeaningsof the individual opcodes are maintained even though theyare reduced to three characters the different opcodes canbe distinguished For example four-character opcodes suchas PUSH are reduced to three characters PUS and two-character opcodes such as OR are expanded by adding ablank character to a three-character opcode Then thesethree-character opcodes are concatenated together and thecharacter string is used to represent the block as an opcodesequence which is used to generate image pixels in an imagematrix in the next step

33 Generation of the Image Matrix Figure 6 shows theprocedure for Step 2 that converts the opcode sequences into

pixels in an image matrix A hash function is used to decidethe119883-119884 coordinates and RGB colors of the pixels

To visualize a binary file as an image matrix both thelength and the width of an image matrix are initialized to 2nwhere 119899 is selected by the users To reduce the probability ofcollisions of the hash function n should be large enough Inour experiments we selected 119899 as 8 to minimize collisions

The coordinate-defining module and the RGB color-defining module are used to generate image matrices Firstthe coordinate-defining module defines the (119909 119910) coordi-nates of pixels on the image matrix of each code blockSecond the RGB color-defining module defines the colorvalues of pixels on the image matrix RGB colors are definedby calculating values of 8 bits each for the red green and bluecolors

SimHash [38] is applied to opcode sequences extracted inStep 1 in order to define both the coordinates and the colorvalues of the pixels SimHash is a local-sensitive hash functionused in the similar sentence detection system which assumesthat if the input values are similar then the output values willalso be similar That is since SimHash tokenizes the inputstrings and generates hash values for each token if a fewtokens are different in two input strings then the generatedhash values are not completely different but are similar

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 3

Disassemble(static analysis)

PIN(dynamic analysis)

Malware samples

Major block selection

Repetitionfiltering

Parsing

Opcodesequences

Basic blocks blocks

Filteredbasic blocks

Dynamic traces

Major

Representativeimage matrix

generation

Similaritycalculation

Similarities

Step 1Extraction of opcode sequences

Step 2Image matrix generation

Step 3Similarity calculation

Figure 1 Overview of the proposed method

using byte values they applied an abstract representationtechnique for the scene image that is GIST [28 29] tocompute texture features Moreover they proved that thebinary texture analysis techniques using image processingcould classify malware more quickly than existing malwareclassification methods could [30] However since the textureanalysis method has large computational overheads theproposed method has problems in processing a large amountof malware [31]

In this paper we propose a novel analysis method usingimage matrices to represent malware visually so that the fea-tures of themalware can be easily detected and the similaritiesbetween different malware samples can be calculated fasterthan with other visualization methods

3 Our Proposed Method

31 Overview Our proposed visualized malware analysismethod consists of three steps as shown in Figure 1 In Step 1opcode sequences are extracted frommalware binary samplesor dynamic execution traces Then image matrices in whichthe opcode sequences are recorded as RGB-colored pixels aregenerated in Step 2 In Step 3 the similarities between theimage matrices are calculated In the following sections eachstep is explained in detail

32 Extraction of Opcode Sequences Figure 2 shows theprocess to extract opcode sequences from malware binarysamples for Step 1 through static analysis or dynamic analysis

321 Basic Block Extraction To extract opcode sequencesfrom malware binary samples the binary sample files arefirst disassembled and divided into basic blocks using dis-assembling tools such as IDA Pro [32] or OllyDbg [33]However if obfuscation or packing techniques are applied inmalware samples static analysis using a disassembler is notfeasible [34 35] Therefore some malware samples (in whichobfuscation or packing techniques are applied) need to beexecuted in a dynamic analysis environment [36]

In dynamic analysis as shown in Figure 3 some repeatedinstruction sequences are included in the dynamic executiontraces because a program may have some loops or repeatedcalls and these repeated sequences can increase the size

of not only the execution traces but also the processingoverheads Kang et al [37] proposed a repetition filteringmethod for dynamic execution traces Our filtered basicblocks are extracted from the dynamic execution tracesafter the repetition filtering method is applied Finally ifbasic blocks are extracted frommalware samples or dynamicexecution traces then major blocks are selected from thebasic blocks by our proposed technique which is explainedin the next section

322 Major Block Selection The malware analysis methodproposed in this paper does not target all of the basicblocks from the binary disassembling results or dynamicexecution traces If all the basic blocks are used for analysisthen some blocks for binary file execution in an operatingsystem are included in the basic blocks Moreover manymeaningless blocks may be included in the basic blocksextracted from malware samples As a result the number ofbasic blocks that have to be analyzed by the security analystsis increased and distinguishing malware features becomesdifficult In addition the number of comparisons betweenthe basic blocks from two malware samples is also increaseddramatically On the contrary if the number of unnecessaryblocks can be reduced as much as possible in the malwareanalysis the analysis time cost for not only the individualmalware sample but also a large number of malware samplescan be reduced Therefore we selected some blocks relatingto suspicious behaviors and functions from among the entireset of basic blocks

As shown in Figure 4 the blocks selected as major blocksare those that include the CALL instruction which is usedto invoke APIs library functions and other user-definedfunctions This is because not only user-defined functionsbut also various system calls are used to implement thebehaviors and functions of most programs If blocks thatinclude these function invocation instructions are used inmalware analysis malware features can be extracted [2]Through a major block selection technique the image matrixgenerating time is reduced by recording only those selectedblocks in the image matrix

323 Opcode Sequence Extraction To extract malware fea-tures as shown in Figure 5 the opcode sequences in the

4 The Scientific World Journal

Static analysis

(disassemble)

Parsing Opcode sequences

Basic blocks

Repetitionfiltering

Major block

Major block

selection

DynamicDynamic analysis execution (execution) trace

Figure 2 Opcode sequence extraction procedure

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

0042A68B JBE

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS[EDI]AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

MOV ALBYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A696 JMP

0042A5FE ADD EBX EBX

0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A69C

Exploit 0042A5FE

Figure 3 An example of repeated instruction sequences

The Scientific World Journal 5

Major blocks

Basic blocks

MOV MOVMOVLEA MOVMOVMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR

PUSH ebpLEA PUSHPUSH ebxCALLADD POP ebpMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR eax eax

eax eax

JZ

PUSH ebpLEA PUSHPUSH ebxCALLADD esp 8

esp 8

POP ebpMOV eax 1

eax 1 ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

0FFFFFFFFh

0FFFFFFFFh

sub 4019EA

sub 4019EA

dword ptr [edi + ecx lowast 4 + 4]

dword ptr [edi + ecx lowast 4 + 4]

JZ short loc 40192F

ebx [ebp + arg 4]

ebx [ebp + arg 4]

[ebp + var 8] eaxeax [ebp + arg 8][ebp + var 4] eaxeax [ebp + var 8][ebx-4] eaxesi [ebx + 0Ch]edi [ebx+8

short loc 40192F

Figure 4 Major block selection

individual major blocks are used as malware informationFrom each opcode only the first three characters are usedto generate information for the block The reasons for usinga three-character opcode are as follows From the entire setof opcodes used in the Intel x86 assembly language 414 ofthem have three characters and the appearance frequenciesof these opcodes within the binary files are higher thanfor other opcodes On the other hand 288 of opcodeshave four characters 178 have five characters and 52have over six characters respectively thus their appearancefrequencies are relatively low In addition since themeaningsof the individual opcodes are maintained even though theyare reduced to three characters the different opcodes canbe distinguished For example four-character opcodes suchas PUSH are reduced to three characters PUS and two-character opcodes such as OR are expanded by adding ablank character to a three-character opcode Then thesethree-character opcodes are concatenated together and thecharacter string is used to represent the block as an opcodesequence which is used to generate image pixels in an imagematrix in the next step

33 Generation of the Image Matrix Figure 6 shows theprocedure for Step 2 that converts the opcode sequences into

pixels in an image matrix A hash function is used to decidethe119883-119884 coordinates and RGB colors of the pixels

To visualize a binary file as an image matrix both thelength and the width of an image matrix are initialized to 2nwhere 119899 is selected by the users To reduce the probability ofcollisions of the hash function n should be large enough Inour experiments we selected 119899 as 8 to minimize collisions

The coordinate-defining module and the RGB color-defining module are used to generate image matrices Firstthe coordinate-defining module defines the (119909 119910) coordi-nates of pixels on the image matrix of each code blockSecond the RGB color-defining module defines the colorvalues of pixels on the image matrix RGB colors are definedby calculating values of 8 bits each for the red green and bluecolors

SimHash [38] is applied to opcode sequences extracted inStep 1 in order to define both the coordinates and the colorvalues of the pixels SimHash is a local-sensitive hash functionused in the similar sentence detection system which assumesthat if the input values are similar then the output values willalso be similar That is since SimHash tokenizes the inputstrings and generates hash values for each token if a fewtokens are different in two input strings then the generatedhash values are not completely different but are similar

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

4 The Scientific World Journal

Static analysis

(disassemble)

Parsing Opcode sequences

Basic blocks

Repetitionfiltering

Major block

Major block

selection

DynamicDynamic analysis execution (execution) trace

Figure 2 Opcode sequence extraction procedure

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

Main

0042A68B JBE

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS[EDI]AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A68D MOV AL BYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

MOV ALBYTE PTR DS [EDX]

0042A68F INC EDX

0042A690 MOV BYTE PTR DS [EDI] AL

0042A692 INC EDI

0042A693 DEC ECX

0042A694 JNZ

0042A696 JMP

0042A5FE ADD EBX EBX

0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A68D

SHORT Exploit 0042A69C

Exploit 0042A5FE

Figure 3 An example of repeated instruction sequences

The Scientific World Journal 5

Major blocks

Basic blocks

MOV MOVMOVLEA MOVMOVMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR

PUSH ebpLEA PUSHPUSH ebxCALLADD POP ebpMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR eax eax

eax eax

JZ

PUSH ebpLEA PUSHPUSH ebxCALLADD esp 8

esp 8

POP ebpMOV eax 1

eax 1 ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

0FFFFFFFFh

0FFFFFFFFh

sub 4019EA

sub 4019EA

dword ptr [edi + ecx lowast 4 + 4]

dword ptr [edi + ecx lowast 4 + 4]

JZ short loc 40192F

ebx [ebp + arg 4]

ebx [ebp + arg 4]

[ebp + var 8] eaxeax [ebp + arg 8][ebp + var 4] eaxeax [ebp + var 8][ebx-4] eaxesi [ebx + 0Ch]edi [ebx+8

short loc 40192F

Figure 4 Major block selection

individual major blocks are used as malware informationFrom each opcode only the first three characters are usedto generate information for the block The reasons for usinga three-character opcode are as follows From the entire setof opcodes used in the Intel x86 assembly language 414 ofthem have three characters and the appearance frequenciesof these opcodes within the binary files are higher thanfor other opcodes On the other hand 288 of opcodeshave four characters 178 have five characters and 52have over six characters respectively thus their appearancefrequencies are relatively low In addition since themeaningsof the individual opcodes are maintained even though theyare reduced to three characters the different opcodes canbe distinguished For example four-character opcodes suchas PUSH are reduced to three characters PUS and two-character opcodes such as OR are expanded by adding ablank character to a three-character opcode Then thesethree-character opcodes are concatenated together and thecharacter string is used to represent the block as an opcodesequence which is used to generate image pixels in an imagematrix in the next step

33 Generation of the Image Matrix Figure 6 shows theprocedure for Step 2 that converts the opcode sequences into

pixels in an image matrix A hash function is used to decidethe119883-119884 coordinates and RGB colors of the pixels

To visualize a binary file as an image matrix both thelength and the width of an image matrix are initialized to 2nwhere 119899 is selected by the users To reduce the probability ofcollisions of the hash function n should be large enough Inour experiments we selected 119899 as 8 to minimize collisions

The coordinate-defining module and the RGB color-defining module are used to generate image matrices Firstthe coordinate-defining module defines the (119909 119910) coordi-nates of pixels on the image matrix of each code blockSecond the RGB color-defining module defines the colorvalues of pixels on the image matrix RGB colors are definedby calculating values of 8 bits each for the red green and bluecolors

SimHash [38] is applied to opcode sequences extracted inStep 1 in order to define both the coordinates and the colorvalues of the pixels SimHash is a local-sensitive hash functionused in the similar sentence detection system which assumesthat if the input values are similar then the output values willalso be similar That is since SimHash tokenizes the inputstrings and generates hash values for each token if a fewtokens are different in two input strings then the generatedhash values are not completely different but are similar

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 5

Major blocks

Basic blocks

MOV MOVMOVLEA MOVMOVMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR

PUSH ebpLEA PUSHPUSH ebxCALLADD POP ebpMOV

PUSH esiPUSH ebpLEA CALLPOP ebpPOP esiMOV OR eax eax

eax eax

JZ

PUSH ebpLEA PUSHPUSH ebxCALLADD esp 8

esp 8

POP ebpMOV eax 1

eax 1 ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebp [ebx + 10 h]

0FFFFFFFFh

0FFFFFFFFh

sub 4019EA

sub 4019EA

dword ptr [edi + ecx lowast 4 + 4]

dword ptr [edi + ecx lowast 4 + 4]

JZ short loc 40192F

ebx [ebp + arg 4]

ebx [ebp + arg 4]

[ebp + var 8] eaxeax [ebp + arg 8][ebp + var 4] eaxeax [ebp + var 8][ebx-4] eaxesi [ebx + 0Ch]edi [ebx+8

short loc 40192F

Figure 4 Major block selection

individual major blocks are used as malware informationFrom each opcode only the first three characters are usedto generate information for the block The reasons for usinga three-character opcode are as follows From the entire setof opcodes used in the Intel x86 assembly language 414 ofthem have three characters and the appearance frequenciesof these opcodes within the binary files are higher thanfor other opcodes On the other hand 288 of opcodeshave four characters 178 have five characters and 52have over six characters respectively thus their appearancefrequencies are relatively low In addition since themeaningsof the individual opcodes are maintained even though theyare reduced to three characters the different opcodes canbe distinguished For example four-character opcodes suchas PUSH are reduced to three characters PUS and two-character opcodes such as OR are expanded by adding ablank character to a three-character opcode Then thesethree-character opcodes are concatenated together and thecharacter string is used to represent the block as an opcodesequence which is used to generate image pixels in an imagematrix in the next step

33 Generation of the Image Matrix Figure 6 shows theprocedure for Step 2 that converts the opcode sequences into

pixels in an image matrix A hash function is used to decidethe119883-119884 coordinates and RGB colors of the pixels

To visualize a binary file as an image matrix both thelength and the width of an image matrix are initialized to 2nwhere 119899 is selected by the users To reduce the probability ofcollisions of the hash function n should be large enough Inour experiments we selected 119899 as 8 to minimize collisions

The coordinate-defining module and the RGB color-defining module are used to generate image matrices Firstthe coordinate-defining module defines the (119909 119910) coordi-nates of pixels on the image matrix of each code blockSecond the RGB color-defining module defines the colorvalues of pixels on the image matrix RGB colors are definedby calculating values of 8 bits each for the red green and bluecolors

SimHash [38] is applied to opcode sequences extracted inStep 1 in order to define both the coordinates and the colorvalues of the pixels SimHash is a local-sensitive hash functionused in the similar sentence detection system which assumesthat if the input values are similar then the output values willalso be similar That is since SimHash tokenizes the inputstrings and generates hash values for each token if a fewtokens are different in two input strings then the generatedhash values are not completely different but are similar

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

6 The Scientific World Journal

PUSH

PUSHPUSH

PUSHPUSHLEA

ADD

LEA

POP

POPPOP

MOV

MOVORJZ

CALL

CALL

ebp

ebp

ebp

ebp

esi

esi

eax eax

ebp [ebx + 10 h]

ebp [ebx + 10 h]

ebx

esp 8

Major blocks Opcode sequences

PUSLEAPUSPUSCALADDPOPMOV

PUSPUSLEACALPOPPOPMOV OR JZ

0FFFFFFFFh

dword ptr [edi + ecx lowast 4 + 4]

ebx [ebp + arg 4]

short loc 40192F

eax 1

sub 4019EA

Figure 5 Opcode sequences used as malware information

37 57 D6 C3 2BX Y R BG

PUSPUSLEACALPOPPOPMOV OR JZ

Malware information

Image matrix

SimHashcoordinate define

SimHashRGB color define

Figure 6 Generating images using opcode sequences

Therefore if the character strings of the opcode sequences aresimilar then the outputs will be similar and they will maponto similar coordinates in an image matrix

Once the coordinates and RGB colors of the individualpixels have been defined RGB-colored images are recordedon the individual coordinates of image matrices To providehuman analysts with amore convenient visual analysis pixelsaround the defined coordinates are recorded simultaneouslyAs shown in Figure 7 nine pixels from (119909 minus 1 119910 minus 1) to (119909 +1 119910+1) around an (119909 119910) coordinate for a block are recorded

If the images overlap each other because the coordinatesdefined formultiple opcode sequences are adjacent as shownin Figure 8 the sums of RGB colors become new pixel colorsIf the result of a color summing exceeds 255 (0xFF) the resultwill be set to 255 For example if RGB

1is (255 0 0) and RGB

2

is (0 176 50) the new color will become (255 176 50)The number of pixels recorded on an image matrix varies

according to themajor blocks and the number of overlappingpixels will increase as the number of images increases If thereare too many overlapping images then the size of the imagematrix should be increased

x minus 1

y minus 1

x minus 1

y

x + 1

y

y minus 1

x

y + 1

x

y minus 1

x + 1

y + 1

x + 1

y + 1

x minus 1

x y

Figure 7 Nine pixels for one opcode sequence

34 Representative Image Matrix Extraction Since manymalware variants exist in eachmalware family as the numberof malware samples increases the total amount and timeof the similarity calculation increase too Therefore we

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 7

111

11

11

1

223

2

2

2

2

2

2

R3 = min(R1 + R2 FF) = min (FF + 00 FF) = FF

G3 = min(G1 + G2 FF) = min (FF + B0 FF) = B0

B3 = min(B1 + B2 FF) = min (FF + 50 FF) = 50

there4 RGB3 = (FF B0 50)

Figure 8 Method of recording overlapping pixels

extracted a representative imagematrix of eachmalware fam-ily to reduce the costs of malware similarity calculationsThatis when a new malware sample is found the amount of timeto calculate the similarity is reduced by comparing the imagematrix of the new malware with the image matrices thatrepresent individualmalware families instead of comparing itwith all of the imagematrices of the existingmalware samples

As shown in Figure 9 to extract a representative imagematrix for eachmalware family imagematrices are generatedfor samples in malware families Then the representativeimage matrix is extracted by recording only the commonpixels that have same coordinates and RGB colors from theimage matrices of individual malware samples in the samefamily Figure 9 shows an example of the generation ofrepresentative images of malware families

35 Similarity Calculation Using Image Matrices The advan-tage of the similarity calculation using the imagematrices is afaster performance than with exact matching using the stringtype of opcode sequences even though there are some extrafalse positives due to hash collision When using the stringthe time complexity is defined as 119874(1198992) due to the processof finding pairs of exactly matched strings However if theimage matrices are used to calculate similarities since thecoordinates and colors of the opcode sequences are definedthrough SimHash the process of finding the pairs is skippedTherefore the time complexity of the similarity calculationsusing the image matrices is defined as O(n) because onlythe color information of the pixels recorded on the samecoordinates in both image matrices is used to calculate thesimilarities between the image matrices

Pixel similarity calculations are carried out first for pixelsin each image matrix The most important consideration ina similarity calculation in this case is that only those RGB

color pixels recorded in the individual imagematrices shouldbe used Image matrices have RGB-colored pixels on squareimageswith black backgrounds If black pixels are also used insimilarity calculations the similarities between samples fromdifferent malware families can be calculated as very highTherefore when the similarities of the image matrices arecalculated the following cases are considered for pixels onthe same coordinates in the two image matrices as shownin Figure 10 In this case the vector angular-based distancemeasurement algorithm is used to calculate the similaritiesbetween color pixels This algorithm calculates similarityvalues by expressing the color pixels constituting each imageas 3D vectors as shown in (1) and then using the angleinformation and size information

(a) Case 1 if all of the pixels in the areas of both imagematrices are black the pixel similarity calculation willnot be carried out and the next pixel will be selected

(b) Case 2 if one pixel in a selected area is black and thecorresponding pixel in the other image is colored thepixel similarity will be defined as 0

(c) Case 3 if both pixels are not black but colored thecolor pixel similarity will be calculated using the vec-tor angular-based distance measurement algorithm[39] as follows

120575 (119909119894 119909119895) = [1 minus

2

120587cosminus1(

119909119894sdot 119909119895

10038161003816100381610038161199091198941003816100381610038161003816

10038161003816100381610038161003816119909119895

10038161003816100381610038161003816

)][1 minus

10038161003816100381610038161003816119909119894minus 119909119895

10038161003816100381610038161003816

radic3 sdot 2552]

(1)

The similarity values of the image matrix when consideringindividual cases are calculated using the results from thepixel similarity calculations as shown in (2) That is the sumof pixel similarity values calculated in case 3 is divided by thenumber of pixels calculated in cases 2 and 3 to calculate theaverage

Sim (119860 119861) =sum of pixel similarity values in case 3

of pixels in case 2 and case 3 (2)

4 Experimental Results

41 Experimental Data and Environment Using the visualanalysis tools implemented in this paper and the malwaresamples shown in Table 1 imagematrices were generated andsimilarity calculations were performed First set A consists of290 malware samples from 16 families in which the detectionavoidance techniques such as obfuscation and packing arenot applied These malware samples are used to extract thebasic blocks through static analysis using a disassemblerSecond set B consists of 560 malware samples from 14 fam-ilies in which the packed and nonpacked malware samplescoexist We used these malware samples to generate dynamicexecution traces through the PIN tool in a dynamic analysisenvironment and the filtered basic blocks are extracted fromthe dynamic execution traces through the repetition filteringtechnique as explained in Section 321

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

8 The Scientific World Journal

Image matrix of samplea Image matrix of samplen Representative image matrix of sample family

Common pixels of sample family

Pixels only in sampleaPixels only in samplen

middot middot middot

Figure 9 Representative image matrix extraction

Sample A

BlackBlack

Sample A998400

(a) Case 1 black and black

Black Color

Sample A Sample A998400

(b) Case 2 black and color

Color Color

Sample A Sample A998400

(c) Case 3 color and color

Figure 10 Three cases considered in pixel similarity calculations

For the experiments we constructed an experimentalenvironment consisting of the analysis servermalware serverand monitoring machine as shown Figure 11 We set upVMware vSphere ESXi 51 in the analysis server which hasan Intel Xeon E5-1607 processor and 24GB of main memoryand we installed two Windows operating systems (OSs) asguest OSs In the first Windows OS the dynamic executiontraces were extracted through the PIN In the otherWindowsOS the image matrices were generated and similarities werecalculated through our visual analysis tool Malware samplesthat are provided for the analysis server and the dynamicexecution traces extracted from the analysis server are storedin the malware server The monitoring machine controls theanalysis server through the PowerCLI tool that is the remotecommand line interface

42 Experiments with Static Analysis For the experimentsin this section we disassembled the malware samples withinset A and extracted major blocks from the basic blocks Wethen generated the imagematrices using opcode sequences ofthosemajor blocks and analyzed the similarities among them

421 Image Matrix Generation In this paper we set the sizesof the generated image matrices to 256 times 256 pixels for theexperiments As shown in Table 2 the reasons for using thisimage matrix size can be briefly summarized as the middleground between file size similarity calculation time and

classification accuracy The accuracy was calculated by using(3)

Accuracy = of correctly classified malware samples

of total malware samples

(3)

Figure 12 shows examples of the image matrices generatedfrom the malware samples of individual families within setA Only three image matrices for each malware family andone representative image matrix extracted by recording onlythose pixels commonly existing in all image matrices wereincluded Since the number of opcode sequences used asmalware information varied the number of pixels recordedon the image matrices differed In the case of malwaremany of the same or similar RGB-colored pixels are foundamong the image matrices of malware samples classified asthe same family However even if pixels are recorded onthe same coordinates of different image matrices the pixelsimilarities have different values if the RGB color informationof the relevant pixels is different Our results show that imagematrices of variants included in the samemalware family canbe shown to be similar and that clear differences exist amongmalware samples from different families

Figure 13 shows the image matrix differences before andafter the application of the major block selection techniqueThe image progression indicates that the number of pixelsrecorded in the image matrices decreases because of theselection of major blocks from among the basic blocks The

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 9

Windows OS Windows OS

VM ware vSphere ESXi 51

Visual analysistool

Malware server

Malware

Analysis server

Commands MalwareExecution tracePo

wer

CLI

Monitoring machine

PIN

Intel xeon E5-1607 processor24 GB memory

Figure 11 Experimental environment

IstBaraa IstBarab IstBarac IstBar representative

(a) Trojan-DownloaderWin32IstBar family

Semisofta Semisoftb Semisoftc Semisoft representative

(b) VirusWin32HLLPSemisoft family

Debormc Debormj Debormk Deborm representative

(c) WormWin32Deborm family

Figure 12 Image matrices of malware family samples

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

10 The Scientific World Journal

Table 1 Malware samples

Set Type Family Number of variants

A

Email-Worm Klez 9Trojan-DDos Boxed 27

Trojan-Downloader

IstBar 41Ladder 5Lemmy 26Mediket 43

OneClickNetSearch 11Trojan-Dropper Tab 8

Virus

Eva 6Evol 3

Fosforo 4Gpcode 35Halen 7

Semisoft 14Zepp 11

Worm Deborm 40

B

Backdoor

Agobot 40Bifrose 40IRCBot 40SdBot 40

Trojan Dialer 40StartPage 40

Trojan-DownloaderBanload 40Dyfuca 40Swizzor 40

Trojan-Spy Bancos 40Banker 40

Email-Worm Bagle 40IM-Worm Kelvir 40P2P-Worm SpyBot 40

Table 2 The selection of image matrix size

Size(resolution)

File size(KB)

Similaritycalculation time

(ms avg)

Classificationaccuracy (avg)

128 times 128 48 53 09595256 times 256 192 182 09814512 times 512 768 664 09929

similarity changes after the application of the major blockselection is described in the next subsection

422 Major Block Selection Similarity calculations of theimage matrices after the application of major block selectionare shown in Figures 14 and 15 When the major block selec-tion technique was applied the similarity changes rangedfrom a minimum of 0002 (the Tab family) to a maximumof 0147 (the Lemmy family) among the malware samples inthe same families The results of the similarity calculationsfor different families showed that the changes ranged froma minimum of 0001 (the Eva family) to a maximum of

Table 3 Arbitrarily selected malware samples as unknown

Number Malware sample Most similar family (similarity)1 Klezj Klez (0181)2 Boxedg Boxed (0190)3 IstBargvf IstBar (0302)4 Ladderf Ladder (0339)5 Lemmyz Lemmy (0341)6 Mediketec Mediket (0325)7 OneClickNetSearchk OneClickNetSearch (0268)8 Tabgd Tab (0348)9 Evag Eva (0317)10 Evolc Evol (0329)11 Fosforod Fosforo (0341)12 Gpcodex Gpcode (0306)13 Halen2619 Halen (0352)14 Semisoftn Semisoft (0281)15 Zeppd Zepp (0265)16 Debormai Deborm (0339)17 Agobot02a Mediket (0042)18 SdBot04a Boxed (0054)

0053 (the Klez family) As a result while the range of thesimilarity values among the malware samples in the samefamily is more than 06 the range of the similarity valuesamong malware samples from different families is below01 Therefore although the similarity values change due toapplying the major block selection technique we can reducethe image matrix generation time and can find obviousdifferences in the similarity values

423 Representative ImageMatrix Extraction For this exper-iment we selected an arbitrary malware sample not includedin data set A as the unknown sample We then analyzedthe similarity calculation time and the similarity values ofan image matrix for the unknown sample with 290 imagematrices of all malware samples and with 16 representativeimage matrices extracted from individual families

Figure 16 shows the results of the similarity calculationsof an unknown sample both with all of the image matrices ofmalware samples and with the representative image matricesof individual families When all of the image matrices wereused the Tab family was found to have an average similarityvalue of 0781 while all the other families had values smallerthan 005 When representative image matrices of individualfamilies were used the average similarity of the Tab familyhad a value of 0348 while the other families had values ofless than 003 Therefore the unknown sample is expected tobe a variant of the Tab family In fact the diagnostic nameof the unknown sample used for this experiment was Trojan-Dropper Win32Tabgd

Table 3 shows the list of the malware samples selectedas unknown samples for this experiment and it includesthe results of the similarity calculations using represen-tative image matrices These malware samples except forAgobot02a and Sdbot04a were detected as variants of each

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 11

Every basic blocks Only major blocks

(a) Email-WormWin32Kleza

Every basic blocks Only major blocks

(b) Trojan-DDosWin32Boxeda

Every basic blocks Only major blocks

(c) Trojan-DownloaderWin32Lemmye

Every basic blocks Only major blocks

(d) VirusWin32HLLPZeppa

Figure 13 Comparison of image matrices with and without major block selection

0

02

04

06

08

1

Sim

ilarit

y

Similarities with same family

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 14 Image matrix similarity calculations of malware samplesin each family following major block selection

corresponding family in set A with a range of similarityvalues among the representative image matrices from 0181to 0352 Since the Agobot02a and the Sdbot04a samplesare not variants of the malware families in set A theirsimilarity values compared to the existing individual familyrepresentative image matrices were very low

424 Feasibility in Malware Classification Figure 17 showsthe changes in similarity values obtained by applying all the

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

Entire basic blocksMajor blocks

Figure 15 Image matrix similarity calculations of malware samplesfrom different families following major block selection

proposed methods together that is the major block selectionand representative image extraction techniques Whereas thesimilarities betweenmalware samples from the same familieshad values between 019 and 036 the similarities betweenmalware samples from different families were less than 005The classification accuracy which was obtained by usingthe image matrices that were generated through the staticanalysis was 09896 That is only three malware samplesin set A were misclassified into the other malware families

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

12 The Scientific World Journal

0010203040506070809

Similarities for unknown sample

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

mAverage similarity with family image matricesSimilarity with representative image matrix

Figure 16 Similarity values of the unknown sample compared tothe representative image matrices of individual families

000005010015020025030035040045050

Representative image matrix similarity

Sim

ilarit

y

Kle

zBo

xed

IstB

arLa

dder

Lem

my

Med

iket

One

Clic

kNet

Sear

ch Tab

Eva

Evol

Fosfo

roG

pcod

eH

alen

Sem

inso

ftZe

ppD

ebor

m

With same familyWith different families

Figure 17 Results of similarity calculations using the three pro-posed techniques

Our result was a little better than the average classificationaccuracy of 09757 using the binary texture analysis in [30]Therefore we conclude that our methods are feasible formalware classification because similarities within the samefamilies will be relatively high compared to the similaritiesbetween malware samples from different families

43 Execution Trace-Based Experiments For the executiontrace-based experiments the malware samples within set Bin Table 1 were executed in dynamic analysis environmentsusing the PIN tool Dynamic execution traces were then gen-erated and the repetition-filtered basic blocks were extractedfrom those execution traces After filtering the major blocks

50

0

100

150

200

250

Trac

e siz

e (M

B)

Trace size (total)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 18 Changes in execution trace size following the applicationof repetition filtering and major block selection

relating to suspicious behaviors and functions were selectedOur proposed techniques were applied to these executiontraces to generate image matrices and to analyze similarities

Figure 18 shows the decrease in size of the execution tracesresulting from the application of the repetition filtering andmajor block selection techniques If the sizes of the executiontraces were first reduced through the repetition filteringtechnique and then the major block selection method wasapplied the sizes of the execution traces were reducedby 765 on average (693 minimum 836 maximum)compared to the original execution traces

Figure 19 shows changes in the generated image matricesresulting from the application of the repetition filteringmethod and the major block selection Decreases in thenumber of recorded pixels in the image matrices can berecognized when the three image matrices are compared

Figures 20 and 21 show changes in the similarity valueswith the application of the repetition filtering technique andthe major block selection Although changes in the values arenot large some malware families are distinguishable if thethreshold of similarity values is set properly

In these experiments the average similarity values ofmalware samples from the same families was approximately065 and those from different families were approximately036 Compared to the results of the static analysis describedpreviously the results from the execution trace-based exper-iments show relatively small differences The reason for theseresults is that similar system dynamic link libraries (DLLs)were invoked when the malware samples of each family wereexecuted in the dynamic analysis environment to extractthe dynamic execution traces As a result similar opcodesequences due to the DLL calls and the executing of DLLsfrom the dynamic execution traces were recorded in theimage matrices of individual families so the similarity valuesincreased Nevertheless the classification accuracy obtainedthrough the similarity calculations using the image matrices

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 13

Using original trace Using filtered basic blocks Using major blocks

Backdoor Win32 Agobotac

Figure 19 Changes in the generated image matrices from the application of repetition filtering and major block selection

02

00

04

06

08

10

Sim

ilarit

y

Similarities with same family (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 20 Changes in similarity between samples in the samefamilies after repetition filtering and major block selection

that were generated based on the execution traces was 09732because only 15 malware samples in set B were misclassifiedand this result was similar to the accuracy in [30]

5 Conclusions and Future Work

In this paper we proposed a novelmethod to analyzemalwaresamples visually by generating image matrices To gener-ate the image matrices opcode sequences were extractedthrough static analysis and dynamic analysis In additionwe calculated the similarities between the malware variantsusing vectorized values of the RGB-colored pixels in theimage matrices The similarity calculation method using theimage matrices has a faster performance than exact matchingusing the string type of opcode sequences or basic blocksOur proposed method was implemented as a visual analysistool The experimental results showed that malware variants

0

02

04

06

08

1

Sim

ilarit

y

Similarities with different families (avg)

Original

Ago

bot

Bifro

se

IRCB

ot

SdBo

t

Dia

ler

Star

tPag

e

Banl

oad

Dyf

uca

Swiz

zor

Banc

os

Bank

er

Bagl

e

Kelv

ir

SpyB

ot

Filtered basic blocksMajor blocks

Figure 21 Changes in similarity between samples from differentfamilies after repetition filtering and major block selection

included in the same family were similar when convertedinto image matrices and the similarities between malwarevariantswere shown to be higherWith our proposedmethodsecurity analysts can analyze malware samples visually andcan distinguish similar malware samples for further analysisOur future studies include faster malware detection andclassification using the parallelization techniques and real-time processing based on GPGPU

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research project was supported by Ministry of CultureSports and Tourism (MCST) and from Korea Copyright

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

14 The Scientific World Journal

Commission in 2013 This paper was revised from an earlierversion presented at the Research in Adaptive and Conver-gent Systems 2013 (RACS101584013) [40]

References

[1] M Christodorescu and S Jha ldquoTesting malware detectorsrdquo inProceedings of the ACM SIGSOFT International Symposium onSoftware Testing and Analysis (ISSTA rsquo04) pp 34ndash44 July 2004

[2] B Kang T Kim H Kwon Y Choi and E G Im ldquoMal-ware classification method via binary content comparisonrdquoin Proceedings of the ACM Research in Applied ComputationSymposium (RACS 12) pp 316ndash321 San Antonio Tex USAOctober 2012

[3] A Moser C Kruegel and E Kirda ldquoLimits of static analysis formalware detectionrdquo in Proceedings of the 23rd Annual ComputerSecurity Applications Conference (ACSAC rsquo07) pp 421ndash430Miami Beach Fla USA 2007

[4] M D Ernst ldquoStatic and dynamic analysis synergy and dualityrdquoin Proceedings of the ICSE Workshop on Dynamic Analysis(WODA rsquo03) pp 24ndash27 Citeseer 2003

[5] S Cesare and Y Xiang ldquoA fast flowgraph based classificationsystem for packed and polymorphic malware on the endhostrdquoin Proceedings of the 24th IEEE International Conference onAdvanced Information Networking and Applications (AINA rsquo10)pp 721ndash728 Perth Australia April 2010

[6] G Bonfante M Kaczmarek and J-Y Marion ldquoArchitectureof a morphological malware detectorrdquo Journal in ComputerVirology vol 5 pp 263ndash270 2009

[7] I Briones andA Gomez ldquoGraphs entropy and grid computingautomatic comparison of malwarerdquo in Proceedings of the VirusBulletin Conference (VB rsquo08) pp 1ndash12 Ottawa CanadaOctober 2008 httppandalabspandasecuritycomblogsimagesPandaLabs20081007IsmaelBriones-VB2008pdf

[8] S ShangN Zheng J XuM Xu andH Zhang ldquoDetectingmal-ware variants via function-call graph similarityrdquo in Proceedingsof the 5th International Conference on Malicious and UnwantedSoftware (MALWARE rsquo10) pp 113ndash120 Nancy France 2010

[9] J Kinable andO Kostakis ldquoMalware classification based on callgraph clusteringrdquo Journal in Computer Virology vol 7 no 4 pp233ndash245 2011

[10] S M Tabish M Z Shafiq and M Farooq ldquoMalware detectionusing statistical analysis of byte-level file contentrdquo inProceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelli-gence Informatics (CSI-KDD rsquo09) pp 23ndash31 ACM June 2009

[11] B B Rad andMMasrom ldquoMetamorphic virus variants classifi-cation using opcode frequency histogramrdquo in Proceedings of the14th WSEAS International Conference on Computers pp 147ndash155 Corfu Island Greece July 2010

[12] D Bilar ldquoOpcodes as predictor for malwarerdquo InternationalJournal of Electronic Security and Digital Forensics vol 1 pp156ndash168 2007

[13] K S Han S-R Kim and E G Im ldquoInstruction frequency-based malware classification methodrdquo Information vol 15 no7 pp 2973ndash2984 2012

[14] I Santos F Brezo J Nieves et al ldquoIdea Opcode-sequence-based malware detectionrdquo in Engineering Secure Software andSystems pp 35ndash43 Springer 2010

[15] A H Sung J Xu P Chavez and SMukkamala ldquoStatic analyzerof vicious executables (SAVE)rdquo in Proceedings of the 20th

Annual Computer Security Applications Conference (ACSAC04) pp 326ndash334 December 2004

[16] A Walenstein M Venable M Hayes C Thompson andA Lakhotia ldquoExploiting similarity between variants to defeatmalwarerdquo in Proceedings of the BlackHat DC Conference 2007

[17] G Chowdhury Introduction to Modern Information RetrievalFacet publishing 2010

[18] M Egele C Kruegel E Kirda H Yin and D X SongldquoDynamic spyware analysisrdquo in Proceedings of the UsenixAnnual Technical Conference pp 233ndash246 2007

[19] M Fredrikson S Jha M Christodorescu R Sailer and XYan ldquoSynthesizing near-optimal malware specifications fromsuspicious behaviorsrdquo in Proceeding of the 31st IEEE Symposiumon Security and Privacy (SP 10) pp 45ndash60Oakland Calif USAMay 2010

[20] P Vinod H Jain Y K Golecha M S Gaur and V LaxmildquoMedusa metamorphic malware dynamic analysis using sig-nature from APIrdquo in Proceedings of the 3rd InternationalConference on Security of Information and Networks (SIN rsquo10)pp 263ndash269 ACM September 2010

[21] Q G Miao Y Wang Y Cao X G Zhang and Z L LiuldquoAPICapturemdasha tool for monitoring the behavior of malwarerdquoin Proceedings of the 3rd International Conference on AdvancedComputer Theory and Engineering (ICACTE rsquo10) pp V4-390ndashV4-394 August 2010

[22] K S Han J H Lim B Kang and E G Im ldquoMalware analysisusing visualized images and entropy graphsrdquo InternationalJournal of Information Security 2014

[23] P Trinius T Holz J Gobel and F C Freiling ldquoVisual analysisof malware behavior using treemaps and thread graphsrdquo inProceedings of the 6th International Workshop on Visualizationfor Cyber Security (VizSec rsquo09) pp 33ndash38 Atlantic City NJ USAOctober 2009

[24] J Saxe D Mentis and C Greamo ldquoVisualization of sharedsystem call sequence relationships in large malware corporardquo inProceedings of the 9th International Symposium on Visualizationfor Cyber Security (VizSec rsquo12) pp 33ndash40 ACM October 2012

[25] G Conti E Dean M Sinda and B Sangster ldquoVisual reverseengineering of binary and data filesrdquo in Visualization forComputer Security pp 1ndash17 Springer Berlin Germany 2008

[26] B Anderson C Storlie and T Lane ldquoImproving malwareclassification bridging the staticdynamic gaprdquo in Proceedingsof the 5th ACM Workshop on Security and Artificial Intelligence(AISec 12) pp 3ndash14 Raleigh NC USA October 2012

[27] L Nataraj S Karthikeyan G Jacob and B Manjunath ldquoMal-ware images visualization and automatic classificationrdquo inProceedings of the 8th International Symposium on Visualizationfor Cyber Security p 4 ACM 2011

[28] A Oliva and A Torralba ldquoModeling the shape of the scenea holistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[29] A Torralba K P Murphy W T Freeman and M A RubinldquoContext-based vision system for place and object recognitionrdquoin Proceedings of the 9th IEEE International Conference onComputer Vision pp 273ndash280 IEEE October 2003

[30] L Nataraj V Yegneswaran P Porras and J Zhang ldquoA compar-ative assessment of malware classification using binary textureanalysis and dynamic analysisrdquo in Proceedings of the 4th ACMWorkshop on Security and Artificial Intelligence (AISec rsquo11) pp21ndash30 2011

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

The Scientific World Journal 15

[31] Y Kang and A Sugimoto ldquoImage categorization and semanticsegmentation using scale-optimized textonsrdquo Journal of ITConvergence Practice vol 2 pp 2ndash14 2014

[32] C EagleThe IDA Pro Book The Unofficial Guide to the WorldrsquosMost Popular Disassembler No Starch Press 2008

[33] O Yuschuk ldquoOllydbgrdquo 2007 httpwwwollydbgde[34] Y Wei Z Zheng and N Ansari ldquoRevealing packed malwarerdquo

IEEE Security and Privacy vol 6 no 5 pp 65ndash69 2008[35] P Royal M Halpin D Dagon R Edmonds and W

Lee ldquoPolyUnpack automating the hidden-code extraction ofunpack-executing malwarerdquo in Proceeding of the 22nd AnnualComputer Security Applications Conference (ACSAC 06) pp289ndash300 Miami Beach Fla USA December 2006

[36] S Berkowits ldquoPinmdashA Dynamic Binary InstrumentationToolrdquo 2012 httpssoftwareintelcomen-usarticlespin-a-dynamic-binary-instrumentation-tool

[37] B Kang K S Han B Kang and E G Im ldquoMalware cate-gorization using dynamic mnemonic frequency analysis withredundancy filteringrdquo 2013

[38] M S Charikar ldquoSimilarity estimation techniques from round-ing algorithmsrdquo in Proceedings of the 34th Annual ACM Sym-posium onTheory of Computing pp 380ndash388 ACM New YorkNY USA 2002

[39] D Androutsos K N Plataniotis and A N VenetsanopoulosldquoNovel vector-based approach to color image retrieval using avector angular-based distance measurerdquo Computer Vision andImage Understanding vol 75 no 1 pp 46ndash58 1999

[40] K S Han J H Lim and E G Im ldquoMalware analysis methodusing visualization of binary filesrdquo in Proceedings of the 2013Research inAdaptive andConvergent Systems pp 317ndash321 ACM2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of