a survey of binary code similaritya survey of binary code similarity irfan ul haq imdea software...

22
A Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ ecnica de Madrid [email protected] Juan Caballero IMDEA Software Institute [email protected] Abstract—Binary code similarity approaches compare two or more pieces of binary code to identify their similarities and differences. The ability to compare binary code enables many real-world applications on scenarios where source code may not be available such as patch analysis, bug search, and malware detection and analysis. Over the past 20 years numerous binary code similarity approaches have been proposed, but the research area has not yet been systematically analyzed. This paper presents a first survey of binary code similarity. It analyzes 61 binary code similarity approaches, which are systematized on four aspects: (1) the applications they enable, (2) their approach characteristics, (3) how the approaches are implemented, and (4) the benchmarks and methodologies used to evaluate them. In addition, the survey discusses the scope and origins of the area, its evolution over the past two decades, and the challenges that lie ahead. KeywordsCode diffing, Code search, Cross-architecture, Pro- gram executables I. I NTRODUCTION Binary code similarity approaches compare two or more pieces of binary code e.g., basic blocks, functions, or whole programs, to identify their similarities and differences. Com- paring binary code is fundamental in scenarios where the program source code is not available, which happens with commercial-of-the-shelf (COTS) programs, legacy programs, and malware. Binary code similarity has a wide list of real- world applications such as bug search [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], malware clustering [16], [17], [18], malware detection [19], [20], [21], malware lineage [22], [23], [24], patch generation [25], patch analysis [26], [27], [28], [29], [8], [30], [31], porting infor- mation across program versions [32], [26], [27], and software theft detection [33]. Identifying binary code similarity is challenging because much program semantics are lost due to the compilation process including function names, variable names, source comments, and data structure definitions. Additionally, even when the program source code does not change, the binary code may change if the source is recompiled, due to secondary changes introduced by the compilation process. For exam- ple, the resulting binary code can significantly change when using different compilers, changing compiler optimizations, and selecting different target operating systems and CPU architectures. Furthermore, obfuscation transformations can be applied on both the source code and the generated binary code, hiding the original code. Given its applications and challenges, over the past 20 years numerous binary code similarity approaches have been proposed. However, as far as we know there does not exist a systematic survey of this research area. Previous surveys deal with binary code obfuscation techniques in packer tools [34], binary code type inference [35], and dynamic malware analysis techniques [36]. Those topics are related because binary code similarity may need to tackle obfuscation, binary code type inference may leverage similar binary analysis platforms, and malware is often a target of binary code similarity approaches. But, binary code similarity is well-separated from those topics as shown by previous surveys having no overlap with this paper on the set of papers analyzed. Other surveys have explored similarity detection on any binary data, i.e., not specific to code, such as hashing for similarity search [37] and similarity metrics on numerical and binary feature vectors [38], [39]. In contrast, this survey focuses on approaches that compare binary code, i.e., that disassemble the executable byte stream. This paper presents a first survey of binary code simi- larity. It first identifies 61 binary code similarity approaches through a systematic selection process that examines over a hundred papers published in research venues from different computer science areas such as computer security, software engineering, programming languages, and machine learning. Then, it systematizes four aspects of those 61 approaches: (1) the applications they enable, (2) their approach characteristics, (3) how the approaches have been implemented, and (4) the benchmarks and methodologies used to evaluate them. In addition, it discusses the scope and origin of binary code similarity, its evolution over the past two decades, and the challenges that lie ahead. Binary code similarity approaches widely vary in their approach, implementation, and evaluation. This survey sys- tematizes each of those aspects, and summarizes the results in easy to access tables that compare the 61 approaches across multiple dimensions, allowing beginners and experts to quickly understand their similarities and differences. For example, the approach systematization includes, among others, the number of input pieces of binary code being compared (e.g., one- to-one, one-to-many, many-to-many); the granularity of the pieces of binary code analyzed (e.g., basic blocks, functions, programs); whether the comparison happens at the syntactical representation, the graph structure, or the code semantics; the type of analysis used (e.g., static, dynamic, symbolic), and the techniques used for scalability (e.g., hashing, embedding, indexing). The implementation systematization includes the binary analysis platforms used to build the approach, the programming languages used to code it, the supported archi- tectures for the input pieces of binary code being compared, and whether the approach is publicly released. The evaluation systematization covers the datasets on which the approaches are evaluated and the evaluation methodology including the type of evaluation (e.g., accuracy, comparison wih prior works, arXiv:1909.11424v1 [cs.CR] 25 Sep 2019

Upload: others

Post on 10-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

A Survey of Binary Code Similarity

Irfan Ul HaqIMDEA Software Institute &

Universidad Politecnica de [email protected]

Juan CaballeroIMDEA Software [email protected]

Abstract—Binary code similarity approaches compare two ormore pieces of binary code to identify their similarities anddifferences. The ability to compare binary code enables manyreal-world applications on scenarios where source code may notbe available such as patch analysis, bug search, and malwaredetection and analysis. Over the past 20 years numerous binarycode similarity approaches have been proposed, but the researcharea has not yet been systematically analyzed. This paper presentsa first survey of binary code similarity. It analyzes 61 binary codesimilarity approaches, which are systematized on four aspects: (1)the applications they enable, (2) their approach characteristics,(3) how the approaches are implemented, and (4) the benchmarksand methodologies used to evaluate them. In addition, the surveydiscusses the scope and origins of the area, its evolution over thepast two decades, and the challenges that lie ahead.

Keywords—Code diffing, Code search, Cross-architecture, Pro-gram executables

I. INTRODUCTION

Binary code similarity approaches compare two or morepieces of binary code e.g., basic blocks, functions, or wholeprograms, to identify their similarities and differences. Com-paring binary code is fundamental in scenarios where theprogram source code is not available, which happens withcommercial-of-the-shelf (COTS) programs, legacy programs,and malware. Binary code similarity has a wide list of real-world applications such as bug search [1], [2], [3], [4], [5],[6], [7], [8], [9], [10], [11], [12], [13], [14], [15], malwareclustering [16], [17], [18], malware detection [19], [20], [21],malware lineage [22], [23], [24], patch generation [25], patchanalysis [26], [27], [28], [29], [8], [30], [31], porting infor-mation across program versions [32], [26], [27], and softwaretheft detection [33].

Identifying binary code similarity is challenging becausemuch program semantics are lost due to the compilationprocess including function names, variable names, sourcecomments, and data structure definitions. Additionally, evenwhen the program source code does not change, the binarycode may change if the source is recompiled, due to secondarychanges introduced by the compilation process. For exam-ple, the resulting binary code can significantly change whenusing different compilers, changing compiler optimizations,and selecting different target operating systems and CPUarchitectures. Furthermore, obfuscation transformations can beapplied on both the source code and the generated binary code,hiding the original code.

Given its applications and challenges, over the past 20years numerous binary code similarity approaches have beenproposed. However, as far as we know there does not exist asystematic survey of this research area. Previous surveys deal

with binary code obfuscation techniques in packer tools [34],binary code type inference [35], and dynamic malware analysistechniques [36]. Those topics are related because binary codesimilarity may need to tackle obfuscation, binary code typeinference may leverage similar binary analysis platforms, andmalware is often a target of binary code similarity approaches.But, binary code similarity is well-separated from those topicsas shown by previous surveys having no overlap with this paperon the set of papers analyzed. Other surveys have exploredsimilarity detection on any binary data, i.e., not specific tocode, such as hashing for similarity search [37] and similaritymetrics on numerical and binary feature vectors [38], [39].In contrast, this survey focuses on approaches that comparebinary code, i.e., that disassemble the executable byte stream.

This paper presents a first survey of binary code simi-larity. It first identifies 61 binary code similarity approachesthrough a systematic selection process that examines over ahundred papers published in research venues from differentcomputer science areas such as computer security, softwareengineering, programming languages, and machine learning.Then, it systematizes four aspects of those 61 approaches: (1)the applications they enable, (2) their approach characteristics,(3) how the approaches have been implemented, and (4) thebenchmarks and methodologies used to evaluate them. Inaddition, it discusses the scope and origin of binary codesimilarity, its evolution over the past two decades, and thechallenges that lie ahead.

Binary code similarity approaches widely vary in theirapproach, implementation, and evaluation. This survey sys-tematizes each of those aspects, and summarizes the resultsin easy to access tables that compare the 61 approaches acrossmultiple dimensions, allowing beginners and experts to quicklyunderstand their similarities and differences. For example, theapproach systematization includes, among others, the numberof input pieces of binary code being compared (e.g., one-to-one, one-to-many, many-to-many); the granularity of thepieces of binary code analyzed (e.g., basic blocks, functions,programs); whether the comparison happens at the syntacticalrepresentation, the graph structure, or the code semantics; thetype of analysis used (e.g., static, dynamic, symbolic), andthe techniques used for scalability (e.g., hashing, embedding,indexing). The implementation systematization includes thebinary analysis platforms used to build the approach, theprogramming languages used to code it, the supported archi-tectures for the input pieces of binary code being compared,and whether the approach is publicly released. The evaluationsystematization covers the datasets on which the approachesare evaluated and the evaluation methodology including thetype of evaluation (e.g., accuracy, comparison wih prior works,

arX

iv:1

909.

1142

4v1

[cs

.CR

] 2

5 Se

p 20

19

Page 2: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

performance) and how the robustness of the approach is evalu-ated in face of common code transformations such as compilerand compilation option changes, different architectures, andobfuscation.

Beyond the systematization, this survey also discusses howbinary code similarity has evolved from binary code diffingto binary code search and how the focus has moved from asingle architecture to cross-architecture approaches. It showsthat the present of the field is vibrant as many new approachesare still being proposed. It discusses technical challenges thatremain open, but concludes that the future of the area isbright with important applications scenarios under way suchas those related to binary code search engines and the Internet-of-Things.

Paper structure. This paper is organized as follows. Section IIprovides an overview of binary code similarity. Section IIIdetails the scope of the survey and the paper selection process.Section IV summarizes applications of binary code similar-ity and Section §V the evolution of the field over the lasttwo decades. Section VI systematizes the characteristics ofthe 61 binary code similarity approaches, Section VII theirimplementation, and Section VIII their evaluation. Finally, wediscuss future research directions in Section IX, and concludein Section X.

II. OVERVIEW

In this section, we first provide background on the com-pilation process (§II-A). Then, we present an overview of thebinary code similarity problem (§II-B).

A. Compilation ProcessBinary code refers to the machine code that is produced

by the compilation process and that can be run directly bya CPU. The standard compilation process takes as input thesource code files of a program. It compiles them with a chosencompiler and optimization level and for a specific platform(defined by the architecture, word size, and OS) producingobject files. Those object files are then linked into a binaryprogram, either a stand-alone executable or a library.

Binary code similarity approaches typically deal with anextended compilation process, illustrated in Figure 1, whichadds two optional steps to the standard compilation process:source code and binary code transformations. Both types oftransformations are typically semantics-preserving (i.e., do notchange the program functionality) and are most commonlyused for obfuscation, i.e., to hamper reverse-engineering ofthe distributed binary programs. Source code transformationshappen pre-compilation. Thus, their input and output are bothsource code. They can be applied regardless of the targetplatform, but may be specific to the programming languageused to write the program. On the other hand, binary codetransformations happen post-compilation. Thus, their inputand output are binary code. They are independent of theprogramming language used, but may be specific to the targetplatform.

Obfuscation is a fundamental step in malware, but can alsobe applied to benign programs, e.g., to protect their intellectualproperty. There exist off-the-shelf obfuscation tools that usesource code transformations (e.g., Tigress [40]), as well asbinary code transformations (e.g., packers [41]). Packing is

a binary code transformation widely used by malware. Oncea new version of a malware family is ready, the malwareauthors pack the resulting executable to hide its functionalityand thus bypass detection by commercial malware detectors.The packing process takes as input an executable and producesanother executable with the same functionality, but with theoriginal code hidden (e.g., encrypted as data and unpacked atruntime). The packing process is typically applied many timesto the same input executable, creating polymorphic variants ofexactly the same source code, which look different to malwaredetectors. Nowadays, the majority of malware is packed andmalware often uses custom packers for which off-the-shelfunpackers are not available [41].

A main challenge in binary code similarity is that thecompilation process can produce different binary code repre-sentations for the same source code. An author can modifyany of the grey boxes in Figure 1 to produce a different,but semantically-equivalent, binary program from the samesource code. Some of these modifications may be due tothe standard compilation process. For example, to improveprogram efficiency an author may vary the compiler’s opti-mization level, or change the compiler altogether. Both changeswill transform the produced binary code, despite the sourcecode remaining unchanged. An author may also change thetarget platform to obtain a version of the program suitablefor a different architecture. In this case, the produced binarycode may radically differ if the new target architecture uses adifferent instruction set. An author may also deliberately applyobfuscation transformations to produce polymorphic variantsof the same source code. The produced variants will typicallyhave the same functionality defined by the original sourcecode. A desirable goal for binary code similarity approachesis that they are able to identify the similarity of binary codethat corresponds to the same source code having undergonedifferent transformations. The robustness of a binary codesimilarity approach captures the compilation and obfuscationtransformations that it can handle, i.e., the transformationsdespite which it can still detect similarity.

B. Binary Code Similarity OverviewBinary code similarity approaches compare pieces of bi-

nary code. The three main characteristics of binary codesimilarity approaches are: (1) the type of the comparison(identical, similar, equivalent), (2) the granularity of the piecesof binary code being compared (e.g., instructions, basic blocks,functions), and (3) the number of input pieces being compared(one-to-one, one-to-many, many-to-many). We detail thesethree characteristics next. For simplicity, we describe thecomparison type and comparison granularity for two inputsand then generalize to multiple inputs.

Comparison type. Two (or more) pieces of binary codeare identical if they have the same syntax, i.e., the samerepresentation. The binary code can be represented in differentways such as an hexadecimal string of raw bytes, a sequence ofdisassembled instructions, or a control-flow graph. Determin-ing if several pieces of binary code are identical is a Booleandecision (either they are identical or not) that it is easy tocheck: simply apply a cryptographic hash (e.g., SHA256) tothe contents of each piece. If the hash is the same, the piecesare identical. However, such straightforward approach fails todetect similarity in many cases. For example, compiling the

Page 3: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

Source

Code

Source Code

Transformations

Transformed

Source

Code

Binary

Program

Binary Code

Transformations

Transformed

Binary

Programs

Arch

Compilation

OS

GCC

ICC

Clang

MSVC

Compiler

O1

O2

O3

Os

Ox

Optimization WordSize

Linker

Object

Files

Fig. 1. The extended compilation process. Dotted boxes are optional code transformations typically used for obfuscation. For a given source code, changingany of the grey boxes may produce a different binary program.

same program source code twice, using the same compilationparameters (i.e., same compiler version, same optimizationlevel, same target platform) produces two executables withdifferent file hash. This happens because the executable mayinclude metadata that differs in both compilations such asthe compilation date, which is automatically computed andincluded into the header of the generated executable.

Two pieces of binary code are equivalent if they have thesame semantics, i.e., if they offer exactly the same functional-ity. Equivalence does not care about the syntax of the binarycode. Clearly, two identical pieces of binary code will havethe same semantics, but different pieces of binary code mayas well. For example, mov %eax,$0 and xor %eax,%eaxare semantically equivalent x86 instructions because both setthe value of register EAX to zero. Similarly, the same sourcecode compiled for two different target architectures should pro-duce equivalent executables, whose syntax may be completelydifferent if the architectures use different instruction sets.Proving that two arbitrary programs are functionally equivalentis an undecidable problem that reduces to solving the haltingproblem [42]. In practice, determining binary code equivalenceis a very expensive process that can only be performed forsmall pieces of binary code.

Two pieces of binary code can be considered similar iftheir syntax, structure, or semantics are similar. Syntacticsimilarity compares the code representation. For example,clone detection approaches consider that a target piece ofbinary code is a clone of some source binary code if theirsyntax are similar. Structural similarity compares graph rep-resentations of binary code (e.g., control flow graphs, call-graphs). It sits between syntactic and semantic similarity.The intuition is that the control flow of the binary codecaptures to some extent its semantics, e.g., the decisions takenon the data. Furthermore, the graph representation capturesmultiple syntactic representations of the same functionality.However, it is possible to modify the graph structure withoutaffecting the semantics, e.g., by inlining functions. Semanticsimilarity compares the code functionality. A simple approachto semantic similarity compares the interaction between theprogram and its environment through OS APIs or systemcalls. But, two programs with similar system calls can performsignificantly different processing on their output, so more fine-grained semantic similarity approaches focus on a syntax-independent comparison of the code.

Generally speaking, the more robust an approach is, i.e., themore transformations it can capture, the more expensive it alsois. Syntactic similarity approaches are cheapest to compute,but least robust. They are sensitive to simple changes in the

Identical

Equivalent

Similar

Identical

Equivalent

Similar

Finer granularity Coarser granularity

Fig. 2. Similarity at a finer granularity can be used to infer a different typeof similarity at a coarser granularity.

binary code, e.g., register reallocation, instruction reordering,replacing instructions with semantically equivalent ones. Struc-tural similarity sits in the middle. It is robust against multiplesyntactical transformations, but sensitive to transformationsthat change code structure such as code inlining or removalof unused function parameters. Semantic similarity is robustagainst semantics-preserving transformations, despite changesto the code syntax and structure, but it is very expensive tocompute for large pieces of binary code.

Comparison granularity. Binary code similarity approachescan be applied at different granularities. Common granularitiesare instructions; basic blocks; functions; and whole programs.To perform a comparison at a coarser granularity, some ap-proaches use a different comparison at a finer granularity, andthen combine the finer granularity results. For example, tocompare whether two programs are similar, an approach coulddetermine the fraction of identical functions between both pro-grams. Thus, we differentiate between the input granularity,i.e., the granularity of the input pieces of binary code that theapproach compares, and the approach granularities, i.e., thegranularities of the different comparisons in the approach.

Applying a specific comparison at a finer granularity mayrestrict the type of comparison that can be performed at acoarser granularity, as illustrated in Figure 2. The figure showsthat computing whether two pieces of binary code are identicalat a finer granularity (e.g., basic block) can be used to computethat the coarser granularity pieces that encompass them (e.g.,their functions) are identical, equivalent, or similar. However,similarity at a finer granularity cannot be used to infer thatthe coarser granularity code is equivalent or identical. Forexample, when comparing two functions, just because alltheir basic blocks are similar, it cannot be concluded thatthe functions are identical or equivalent. On the other hand,similarity is the most general type of comparison and any finergranularity comparison type can be used to infer it.

Number of inputs. Binary code similarity approaches can

Page 4: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

compare two or more pieces of binary code. Those that com-pare more than two pieces can further be split into comparingone piece to the rest or comparing each piece to all otherpieces. Thus, we identify three types of approaches based onthe number of inputs and how they are compared: one-to-one(OO), one-to-many (OM ), and many-to-many (MM ). Thesource of the input pieces is application-dependent. They maycome from the same program version (e.g., two functions ofthe same executable), from two versions of the same program,and from two different programs.

One-to-one approaches compare an original piece of binarycode (also called source, old, plaintiff, or reference) to a targetpiece of binary code (also called new, patched, or upgrade).Most OO approaches perform binary code diffing, i.e., theydiff two consecutive, or close, versions of the same programto identify what was added, removed, or modified in the target(patched) version. The granularity of binary code diffing ismost often functions and the diffing tries to obtain a mappingbetween a function in the original program version and anotherfunction in the target program version. Added functions areoriginal functions that cannot be mapped to a target function;removed functions are target functions that cannot be mappedto an original function; and modified functions are mappedfunctions that are not identical.

One-to-many approaches compare a query piece of binarycode to many target pieces of binary code. Most OM ap-proaches perform binary code search, i.e., they search if thequery piece is similar to any of the target pieces and returnthe top k most similar target pieces of binary code. The targetpieces may come from multiple versions of the same program(different than the version the query piece comes from), fromdifferent programs compiled for the same architecture, or fromprograms compiled for different architectures.

In contrast to OO and OM approaches, many-to-many ap-proaches do not distinguish between source and target pieces.All input pieces are considered equal and compared againsteach other. These approaches typically perform binary codeclustering, i.e., they output groups of similar pieces of binarycode called clusters.

III. SCOPE & PAPER SELECTION

To keep our survey of the state-of-the-art focused andmanageable it is important to define what is, and what is not,within scope. Overall, the main restriction is that we focuson works that compare binary code. This restriction, in turn,introduces the following four constraints:

1) We exclude approaches that require access to the sourcecode, namely source-to-source (e.g., [43]) and source-to-binary (e.g., [44]) similarity approaches.

2) We exclude approaches that operate on bytecode(e.g., [45], [46]).

3) We exclude behavioral approaches that compare similarityexclusively on the interaction of a program with its envi-ronment through system calls or OS API calls (e.g., [47],[48], [49], [50]).

4) We exclude approaches that consider binary code as asequence of raw bytes with no structure such as file hashes(e.g., [51]), fuzzy hashes (e.g., [52], [53]), and signature-based approaches (e.g., [54], [55]). Approaches need todisassemble raw bytes into instructions to be considered.

While we do not include the papers describing byte-level approaches, we do examine the use of some ofthose techniques (e.g., fuzzy hashing) by the analyzedapproaches.

In addition, we introduce the following constraints to keepthe scope of the survey manageable:

5) We limit the survey to papers published on peer-reviewedvenues and technical reports from academic institutions.Thus, we do not analyze tools, but rather the researchworks describing their approach (e.g., [26], [27] forBINDIFF).

6) We exclude papers that do not propose a new binary codesimilarity approach or technique, but simply apply off-the-shelf binary code similarity tools as a step towardstheir goal.

Paper selection. To identify candidate papers, we first system-atically examined all papers published in the last 20 years in14 top venues for computer security and software engineering:IEEE S&P, ACM CCS, USENIX Security, NDSS, ACSAC,RAID, ESORICS, ASIACCS, DIMVA, ICSE, FSE, ISSTA, ASE,and MSR. Not all relevant binary code similarity approacheshave been published in those venues, which is especially truefor early approaches. To identify candidate papers in othervenues, we extensively queried specialized search enginessuch as Google Scholar using terms related to binary codesimilarity and its applications, e.g., code search, binary diffing,bug search. We also carefully examined the references ofthe candidate papers for any further papers we may havemissed. This exploration identified over a hundred candidatepapers. We then read each candidate paper to determine if theyproposed a binary code similarity approach that satisfied theabove scope constraints.

In the end, we identified the 61 binary code similarityresearch works in Table I, whose approaches are systematized.The first three columns of Table I capture the name ofthe approach, the year of publication, and the venue wherethe work was published. The research works are sorted bypublication date creating a timeline of the development of thefield. Papers are identified by their system name, if available,otherwise by the initials of each author’s last name and theyear of publication. The 61 papers have been published in 37venues. Binary code similarity is quite multidisciplinary; whilemost papers appear in computer security venues (36 papers in20 venues), there are works in software engineering (13 papersin 8 venues), systems (6 papers in 4 venues), and machinelearning (2 papers in 2 venues). The venues with most binarycode similarity papers are: DIMVA (6), ASE (4), CCS (3),USENIX Security (3), and PLDI (3).

IV. APPLICATIONS

This section motivates the importance of binary codesimilarity by describing the applications it enables. Of the 61papers analyzed, 36 demonstrate an application, i.e., presenta quantitative evaluation, or case studies, of at least oneapplication. The other 23 papers present generic binary codesimilarity capabilities that can be used for multiple applicationssuch as binary diffing tools (e.g., [84], [85], [86]), binarycode search platforms (e.g., [72], [62], [71]), and binary clonedetection approaches (e.g., [74], [73], [56], [65]). Table IIsummarizes the eight applications identified. Most of the 36

Page 5: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

TABLE I. COMPARISON AMONG BINARY CODE SIMILARITY APPROACHES. FOR BOOLEAN COLUMNS X MEANS SUPPORTED AND 7 UNSUPPORTED.INPUT COMPARISON CAN BE ONE-TO-ONE (OO), ONE-TO-MANY (OM), OR MANY-TO-MANY (MM). INPUT GRANULARITY AND APPROACH

GRANULARITIES CAN BE INSTRUCTION (I), BASIC BLOCK (B), FUNCTION (F), OR PROGRAM (P). APPROACH COMPARISON CAN BE SIMILAR (S),IDENTICAL (I), OR EQUIVALENT (E). STRUCTURAL SIMILARITY CAN USE CFG (C), ICFG (I), CALLGRAPH (G), AND OTHER CUSTOM GRAPHS (O).MACHINE LEARNING CAN BE SUPERVISED (S) OR UNSUPERVISED (U). IN NORMALIZATION, 7 MEANS NO NORMALIZATION, � OPERAND REMOVAL,

• OPERAND NORMALIZATION, ◦ MNEMONIC NORMALIZATION, AND ? CODE ELIMINATION.

Approach Characteristics

Approach Year Venue Inpu

tC

ompa

riso

n

App

roac

hC

ompa

riso

n

Inpu

tG

ranu

lari

ty

App

roac

hG

ranu

lari

ties

Synt

actic

alsi

mila

rity

Sem

antic

sim

ilari

ty

Stru

ctur

alsi

mila

rity

Feat

ure-

base

d

Mac

hine

lear

ning

Loc

ality

sens

itive

hash

ing

Cro

ss-a

rchi

tect

ure

Stat

ican

alys

is

Dyn

amic

anal

ysis

Dat

aflow

anal

ysis

Nor

mal

izat

ion

EXEDIFF [25] 1999 WCSSS OO I P I X 7 7 7 7 7 7 X 7 7 •BMAT [32] 1999 FDO2 OO S,I P F,B X 7 C 7 7 7 7 X 7 7 �•◦F2004 [26] 2004 DIMVA OO S P F 7 7 C,G X 7 7 7 X 7 7 7

DR2005 [27] 2005 SSTIC OO S,I P F,B,I X 7 C,G X 7 7 7 X 7 7 �•KKMRV2005 [19] 2005 RAID MM S P B* 7 X I 7 7 7 7 X 7 7 �•

BMM2006 [20] 2006 DIMVA OO S P B* 7 X I 7 7 7 7 X 7 X �•BINHUNT [28] 2008 ICISC OO S,E P F,B 7 X C,G 7 7 7 7 X 7 X 7

SWPQS2006 [56] 2009 ISSTA MM S,I I* I* X 7 7 X 7 X 7 X 7 7 •SMIT [16] 2009 CCS OM S,I P F X 7 G X 7 7 7 X 7 7 7IDEA [57] 2010 ESSoS MM S P I* X 7 7 X 7 7 7 X 7 7 •MBC [58] 2012 RACS MM S P B X 7 7 X 7 7 7 X 7 7 •

IBINHUNT [59] 2012 ICISC OO S,E P B 7 X I 7 7 7 7 X X X 7BEAGLE [22] 2012 ACSAC MM S P B* 7 X C 7 7 7 7 X X 7 •

BINHASH [60] 2012 ICMLA MM E F B 7 X 7 X U X 7 X 7 X •BINJUICE [42] 2013 PPREW OO S,E P F,B 7 X 7 7 7 7 7 X 7 X 7

BINSLAYER [61] 2013 PPREW OO S P F,B 7 7 C,G 7 7 7 7 X 7 7 �•RENDEZVOUS [62] 2013 MSR OM S F F X 7 7 7 7 7 7 X 7 7 �•MUTANTX-S [17] 2013 Usenix ATC MM S P I* X 7 7 X U 7 7 X 7 7 �•

EXPOSE [63] 2013 COMPSAC OM S,E P F,I* X X 7 X 7 7 7 X 7 X •ILINE [23] 2013 USENIX Sec MM S P B,I* X 7 7 X U 7 7 X X 7 •◦?

LKI2013 [64] 2013 RACS OO S P F,I* 7 7 C,G X 7 7 7 X 7 7 �•TRACY [1] 2014 PLDI OM S,E F I* X X 7 7 7 7 7 X 7 X ?

BINCLONE [65] 2014 SERE MM S,I I* I* X 7 7 X 7 7 7 X 7 7 •RMKNHLLP2014 [66] 2014 DIMVA MM S F* F 7 7 7 X U 7 7 X 7 X 7

CXZ2014 [21] 2014 TDSC OM S P F 7 7 C X 7 7 7 X 7 7 7BLEX [67] 2014 USENIX Sec OO S F F 7 X 7 X 7 7 7 X X 7 7

COP [33], [68] 2014 ESEC/FSE OO S,E P F,B 7 X C 7 7 7 7 X 7 X 7TEDEM [2] 2014 ACSAC OM S B* B 7 X C 7 7 7 7 X 7 7 7

SIGMA [69] 2015 DFRWS OO S F F 7 7 O 7 7 7 7 X 7 7 �•MXW2015 [24] 2015 IFIP SEC OO E P B 7 X I 7 7 7 7 X X X •?MULTI-MH [3] 2015 S&P OM S B* B 7 X C 7 7 X X X 7 X 7QSM2015 [70] 2015 SANER OO I F I* 7 7 O 7 7 7 7 X 7 X •?

DISCOVRE [4] 2016 NDSS OM S F B 7 7 C X 7 7 X X 7 7 7MOCKINGBIRD [29] 2016 SANER OM S F F 7 X 7 7 7 7 X 7 X 7 7

ESH [5] 2016 PLDI OM E F I* 7 X 7 7 7 7 7 X 7 X 7TPM [71] 2016 TrustCom OO S P F 7 7 7 X 7 7 7 X 7 7 7

BINDNN [72] 2016 SecureComm OM S F F 7 7 7 7 S 7 X X 7 7 •GENIUS [6] 2016 CCS OM S F B 7 7 C X U X X X 7 7 7BINGO [7] 2016 FSE OM S F B*,I* 7 X 7 7 7 7 X X 7 X ?

KLKI2016 [18] 2016 JSCOMPUT OO S P F 7 7 G X 7 7 7 X X 7 7KAM1N0 [73] 2016 SIGKDD OM S B* B X 7 C X 7 X 7 X 7 7 •

BINSEQUENCE [8] 2017 ASIACCS OM S F B,I X 7 C 7 7 X 7 X 7 7 •XMATCH [9] 2017 ASIACCS OM S F I* 7 X 7 7 7 7 X X 7 X 7

CACOMPARE [74] 2017 ICPC OM S F F 7 X 7 7 7 X X X 7 7 7SPAIN [30] 2017 ICSE OO S,I P F,B X X 7 7 7 7 7 X 7 X •

BINSIGN [75] 2017 IFIP SEC OM S F F 7 7 7 X 7 X 7 X 7 7 •GITZ [10] 2017 PLDI OM E F I* 7 X 7 7 7 7 X X 7 7 7

BINSHAPE [76] 2017 DIMVA OM S F F 7 7 7 X 7 X 7 X 7 7 •BINSIM [77] 2017 USENIX Sec OO S T I* 7 X 7 7 7 7 7 7 X X 7KS2017 [31] 2017 ASE OM S T I* 7 X 7 X 7 7 7 7 X 7 7

IMF-SIM [78] 2017 ASE OO S F F 7 X 7 X S 7 7 7 X X 7GEMINI [12] 2017 CCS OM S F F 7 7 C X S X X X 7 7 7FOSSIL [79] 2018 TOPS OM S F F,B* 7 X C X 7 7 7 X 7 7 •

FIRMUP [13] 2018 ASPLOS OM E F I* 7 X 7 7 7 7 X X 7 7 •BINARM [14] 2018 DIMVA OM S F F 7 7 C X 7 7 7 X 7 7 •αDIFF [15] 2018 ASE OO S P F 7 7 7 7 S 7 X X 7 7 7

VULSEEKER [11] 2018 ASE OM S F F 7 7 C X S 7 X X 7 X 7RLZ2019 [80] 2019 BAR OM S B B 7 7 7 7 S 7 X X 7 7 •

INNEREYE [81] 2019 NDSS OM S B* B 7 7 7 7 S X X X 7 7 •ASM2VEC [82] 2019 S&P OM S F I* 7 7 7 7 S 7 7 X 7 7 7

SAFE [83] 2019 DIMVA OM S F F 7 7 7 7 S 7 X X 7 7 7

Page 6: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

TABLE II. APPLICATIONS EVALUATED IN THE ANALYZED BINARY CODE SIMILARITY RESEARCH WORKS.

Application Research worksBug Search TRACY [1], TEDEM [2], MULTI-MH [3], DISCOVRE [4], ESH [5], GENIUS [6], BINGO [7], BINSEQUENCE [8]

XMATCH [9], GITZ [10], GEMINI [12], FIRMUP [13], BINARM [14], αDIFF [15], VULSEEKER [11]Malware Clustering SMIT [16], MUTANTX-S [17], KLKI2016 [18]Malware Detection KKMRV2005 [19], BMM2006 [20], CXZ2014 [21]Malware Lineage BEAGLE [22], ILINE [23], MXW2015 [24]Patch Analysis F2004 [26], DR2005 [27], BINHUNT [28], MOCKINGBIRD [29], BINSEQUENCE [8], SPAIN [30], KS2017 [31]Patch Generation EXEDIFF [25]Porting Information BMAT [32], F2004 [26], DR2005 [27]Software Theft Detection COP [33]

papers demonstrate a single application, although a few (e.g.,F2004, BINSEQUENCE) demonstrate multiple. One propertyof an application is whether the application compares dif-ferent versions of the same program (patch analysis, patchgeneration, porting information, malware lineage), differentprograms (malware clustering, malware detection, softwaretheft detection), or can be applied to both cases (bug search).Next, we detail those applications.

1) Bug search – Arguably the most popular application ofbinary code similarity is finding a known bug in a largerepository of target pieces of binary code [2], [3], [4],[6], [11], [12], [9], [8], [7], [15], [5], [10], [13], [14], [1].Due to code reuse, the same code may appear in multipleprograms, or even in multiple parts of the same program.Thus, when a bug is found, it is important to identifysimilar code that may have reused the buggy code andcontain the same, or a similar, bug. Bug search approachestake as input a query buggy piece of binary code andsearch for similar pieces of binary code in a repository.A variant of this problem is cross-platform bug search,where the target pieces of binary code in the repositorymay be compiled for different platforms (e.g., x86, ARM,MIPS) [3], [4], [6], [12], [9], [7], [15], [10], [13].

2) Malware detection – Binary code similarity can be usedto detect malware by comparing a given executable toa set of previously known malware samples. If simi-larity is high then the input sample is likely a variantof a previously known malware family. Many malwaredetection approaches are purely behavioral, comparingsystem or API call behaviors (e.g., [47], [48]). However,as described in Section III, we focus on approaches thatuse binary code similarity [19], [20], [21].

3) Malware clustering – An evolution of malware detectionis clustering similar, known malicious, executables intofamilies. Each family cluster should contain executablesfrom the same malicious program, which can be differentversions of the malicious program, as well as polymorphic(e.g., packed) variants of a version. Similar to malwaredetection, we focus on approaches that compare binarycode [16] and exclude purely behavioral approaches basedon system calls and network traffic (e.g., [50], [87], [88]).

4) Malware lineage – Given a set of executables known tobelong to the same program, lineage approaches builda graph where nodes are program versions and edgescapture the evolution of the program across versions.Lineage approaches are most useful with malware becauseversion information is typically not available [22], [23],[24], [89]. Since input samples should belong to the samefamily, malware lineage often builds on the results ofmalware clustering.

5) Patch generation and analysis – The earliest binary

code similarity application, and one of the most popular,is to diff two consecutive, or close, versions of thesame program to identify what was patched in the newerversion. This is most useful with proprietary programswhere the vendor does not disclose patch details. Thediffing produces small binary code patches that can beefficiently shipped to update the program [25]. It can alsobe used to automatically identify security patches that fixvulnerabilities [30], analyze those security patches [26],[27], [28], [29], [31], [8], and generate an exploit for theold vulnerable version [90].

6) Porting information – Binary code similarity can be usedfor porting information between two close versions ofthe same program. Since close versions typically sharea large amount of code, the analysis done for one versionmay be largely reusable for a newer version. For example,early binary code similarity approaches ported profilingdata [32], [91] and analysis results obtained during mal-ware reverse engineering [26], [27].

7) Software theft detection – Binary code similarity canbe used for identifying unauthorized reuse of code froma plantiff’s program such as the source code beingstolen, its binary code reused, a patented algorithm be-ing reimplemented without license, or the license beingviolated (e.g., GPL code in a commercial application).Early approaches for detecting such infringements usedsoftware birthmarks, i.e., signatures that capture inherentfunctionality of the plaintiff program [92], [93]. However,as described in Section III, we exclude signature-basedapproaches and focus on approaches using binary codesimilarity [33].

V. BINARY CODE SIMILARITY EVOLUTION

This section describes the origins of binary code similarityand its evolution over the last 20 years, highlighting somenoteworthy approaches.

The origins. The origins of binary code similarity are in theproblem of generating a patch (or delta) that captures thedifferences between two consecutive (or close) versions ofthe same program. Text diffing tools had existed since the1970’s (the popular UNIX diff was released in 1974) andhad been integrated in early source code versioning systemssuch as SCCS [94] (1975) and RCS [95] (1985). The in-creasing popularity of low bandwidth communication networks(e.g., wireless) and the limited resources in some devices,raised interest in techniques that would increase efficiencyby transmitting a small patch that captured the differencesbetween two versions of a binary, i.e., non-text, file, instead oftransmitting the whole file. In 1991, Reichenberger proposeda diffing technique for generating patches between arbitrary

Page 7: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

binary files without any knowledge about the file structure [96].The approach worked at the byte-level, instead of the line-levelgranularity of text diffing, and efficiently identified the bytesequences that should appear in the patch because they onlyappeared in the updated version and thus were not available inthe original file to be patched. Several tools for generating andapplying binary patches soon appeared such as RTPATCH [97],BDIFF95 [98], and XDELTA [99]. Those tools worked at byte-level and could diff any type of file.

The first binary code similarity approaches are from 1999.That year, Baker et al. proposed an approach for compressingdifferences of executable code and built a prototype diffing toolcalled EXEDIFF [25], which generated patches for DEC Alphaexecutables. Their intuition was that many of the changes whendiffing two executable versions of the same program representsecondary changes due to the compilation process, as opposedto direct changes in the source code. One example they men-tioned is register allocation that may change at recompilation.Another example they mentioned is that code added in thenewer version would displace parts of the old code, and thusthe compilation process would have to adjust pointer values inthe displaced code to point to the correct addresses. Their ideawas to reconstruct secondary changes at patch time, so thatthey would not need to be included in the patch, reducing thepatch size. EXEDIFF is the earliest approach we have identifiedthat focused on computing similarity between binary code,taking advantage of the code structure by disassembling theraw bytes into instructions.

Also in 1999, Wang et al. presented BMAT [32], a tool thataligned two versions of a Windows DLL library executableto propagate profile information from the older (extensivelyprofiled) version to a newer version, thus reducing the needfor re-profiling. Their approach is the first to compare func-tions and basic blocks (EXEDIFF compared two sequences ofinstructions). It first matched functions in the two executablesand then matched similar blocks within each matched function.It used a hashing technique to compare blocks. The hashingremoved relocation information to handle pointer changes, butwas order-sensitive.

The first decade. After the initial works of EXEDIFF andBMAT, we only identify 7 binary code similarity approachesin the next decade (2000-2009). However, some of these arehighly influential as they extend binary code similarity frompurely syntactical to also include semantics; they widen thescope from binary code diffing (OO) to also include binarycode clustering (MM) and binary code search (OM); and theyapply binary code similarity to malware.

In 2004, Thomas Dullien (alias Halvar Flake) proposed agraph-based binary code diffing approach that focused on thestructural properties of the code by heuristically constructinga callgraph isomorphism that aligns functions in two versionsof the same binary program [26]. This is the first approach tohandle instruction reordering inside a function introduced bysome compiler optimizations. A followup work [27] extendedthe approach to also match basic blocks inside matched func-tions (as done in BMAT) and introduced the Small PrimesProduct (SPP) hash to identify similar basic blocks despiteinstruction reordering. These two works are the basis forthe popular BINDIFF binary code diffing plugin for the IDAdisassembler [84].

In 2005, Kruegel et al. proposed a graph coloring tech-nique to detect polymorphic variants of a malware. Thisis the first approach that performed semantic similarity andMM comparison. They categorized instructions with similarfunctionality into 14 semantic classes. Then, they colored theinter-procedural control-flow graph (ICFG) using those classes.The graph coloring is robust against syntactical obfuscationssuch as junk insertion, instruction reordering, and instructionreplacement. In 2008, Gao et al. proposed BINHUNT [28]to identify semantic differences between two versions of thesame program. This is the first approach that checks codeequivalence. It uses symbolic execution and a constraint solverto check if two basic blocks provide the same functionality.

In 2009, Xin et al. presented SMIT [16], an approach thatgiven a malware sample finds similar malware in a repository.SMIT is the first OM and binary code search approach. Itindexes malware callgraphs in a database and uses graph editdistance to find malware with similar callgraphs.

The last decade. The last decade (2010-2019) has seen ahuge increase in the popularity of binary code similarity, with52 approaches identified. The focus on this decade has beenon binary code search approaches, with an emphasis since2015 on its cross-architecture version (16 approaches), and inrecent years on machine learning approaches. In 2013, Wei etal. proposed RENDEZVOUS, a binary code search engine thatgiven the binary code of a query function, finds other functionsin a repository with similar syntax and structural properties.Reducing the search granularity from whole programs (SMIT)to smaller pieces of binary code such as functions enables anarray of applications such as clone detection and bug search.

Most binary code search approaches target the bug searchapplication. This application was first addressed on source codein 2012 by Jang et al. [100]. In 2014, David et al. proposedTRACY [1], the first binary code search approach focused onbug search. TRACY used the concept of tracelets, an executionpath in a CFG, to find functions similar to a vulnerablefunction. In 2015, Pewny et al. presented MULTI-MH [3], thefirst cross-architecture binary code search approach. MULTI-MH indexed functions by their input-output semantics. Given afunction compiled for one CPU architecture (e.g., x86) MULTI-MH can find similar functions compiled for other architectures(e.g., MIPS). This problem quickly gained traction due to thepopularity of embedded devices. In 2016, Lageman et al. [72]trained a neural network to decide if two functions werecompiled from the same source code. The use of deep learninghas picked up in the last two years, e.g., αDIFF (2018),INNEREYE (2019), and ASM2VEC (2019).

VI. APPROACHES

In this section, we systematize the approaches of bi-nary code similarity, describing the Approach Characteristicscolumns in Table I. We recommend the reader to print Table Iin a separate page to have it in hand while reading this section.

A. Comparison Type

Columns: Input Comparison; Approach Comparison

This section discusses the type of comparison between theapproach inputs, as well as the finer granularity comparisonsthat may be used by the approach.

Page 8: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

Input comparison. All 61 works analyzed compare theirinputs to identify similarity. That is, no approach identifiesidentical inputs (since a hash suffices for that) or input equiv-alence since it is an undecidable problem [101] that can onlybe solved efficiently for small pieces of binary code. Thus,we classify binary code similarity approaches based on theirinput comparison as: one-to-one (OO, 21 approaches), one-to-many (OM, 30 approaches), and many-to-many (MM, 10approaches). The dominance of OM approaches is due to thehigh interest in binary code search in the last decade.

It is always possible to build an OM or MM approachfrom an OO approach. For example, a simple implementationof an OM approach is to compare the given query piece ofbinary code with each of the n targets using an OO approachthat returns the similarity between both inputs. Then, simplyrank the n targets by decreasing similarity and return the topk entries or the entries above a similarity threshold. However,most OM approaches avoid this simple implementation sinceit is inefficient. The two main solutions to improve perfor-mance are extracting a feature vector from each input andstoring the target pieces of binary code in a repository withindices. Obtaining a feature vector for each input allows toperform the feature extraction only once per input. This offerssignificant benefits when the feature extraction is expensive,e.g., in the case of BLEX whose feature extraction requiresexecuting a piece of binary code multiple times with differentinputs. Once the feature vectors have been extracted, a similar-ity metric between two feature vectors is used. This similaritymetric is typically cheap to compute as feature vectors areoften numerical or Boolean. A common source of confusion isthat some approaches propose similarity metrics, while otherspropose distance metrics. It is important to keep in mind thatwhen the metrics are normalized between zero and one, thedistance is simply one minus the similarity. The other solutionused by OM approaches is adding indices on a subset of thefeatures in the feature vector. The indices can be used to reducethe number of comparisons by applying the similarity metriconly between the feature vector of the input piece of binarycode and the feature vectors of selected targets more likely tobe similar.

Approach comparison. Most approaches use a single type ofcomparison: similarity (42 approaches), equivalence (5), andidentical (2). Note that even if only one type of comparison isused in the approach, it may differ from the input comparison.For example, EXEDIFF looks for identical instructions in theprocess of diffing two programs. There are 12 approachesthat use multiple comparison types at different granularities.Of those, six use identical comparison at finer granularitiesto quickly identify the same pieces of binary code (BMAT,DR2005, SWPQS2006, BINCLONE) or to reduce expensivecomparisons such as graph isomorphism (SMIT, SPAIN). Theother six use equivalence comparisons at finer granularities tocapture semantic similarity.

B. Granularity

Columns: Input Granularity; Approach Granularities

We separate the input granularity from the granularitiesof the pieces of binary code compared in the approach (i.e.,approach granularities) since it is common to use finer gran-

ularities (e.g., functions) to compare coarser input granularity(e.g., whole programs).

We have identified 8 comparison granularities: instruction(I), set of related instructions (I*), basic block (B), set ofrelated basic blocks (B*), function (F), set of related functions(F*), trace (T), and whole program (P). Instruction, basicblock, function, and whole program are standard granulari-ties that require no explanation. Related instructions (I*) areeither consecutive (e.g., n-gram) or share a property (e.g.,data dependent). They may belong to different basic blocks,and even to different functions. For example, TRACY groupsinstructions that collectively impact an output variable. Relatedbasic blocks (B*) share structural properties (e.g., graphlets ina CFG) or belong to the same execution path. Basic blocks ina set may belong to the same or multiple functions. Relatedfunctions (F*) implement a program component such as alibrary, a class, or a module. Trace granularity compares theexecution trace of two binary programs on the same input.

The most common input granularity is function (26 ap-proaches) followed by whole program (25) and related basicblocks (4). Whole program is the preferred input granularityfor OO approaches (16/21 ) since most binary code diffingapproaches try to establish a one-to-one mapping between allfunctions in the input programs, and also for MM (7/10 )approaches that tend to cluster input programs. On the otherhand, function is the preferred granularity for binary codesearch approaches (21/30 ). Another four binary code searchapproaches use B* to identify code reuse that covers only asubset of a function or crosses function boundaries.

The most common approach granularity is function (30approaches) followed by basic block (20). The majority ofapproaches (47/61) use different input and approach granular-ities, i.e., use finer approach granularities to compare coarserinput granularity. Most approaches with the same input andapproach granularity perform function searches (12/14). The11 approaches that perform equivalence comparisons do soat fine granularities due to its low efficiency: six have Bgranularity, five I*, and one F. Some approaches accumulatefeatures at a fine granularity that are never directly comparedand thus do not show in the approach granularities column. Forinstance, GEMINI accumulates basic block features to generatea numerical vector at function granularity. Thus, only functionsare compared.

C. Syntactic SimilarityColumn: Syntactic similarity

Syntactic approaches capture similarity of the code rep-resentation, more especifically they compare sequences ofinstructions. Most commonly, the instructions in a sequenceare consecutive in the virtual address space and belong tothe same function. The instructions in the sequence may firstbe normalized, e.g., considering only the mnemonic, only theopcode, or normalizing the operands into classes. We detailinstruction normalization in Section VI-J and simply refer toinstruction sequences in the rest of this subsection.

The instruction sequences may be of fixed or variablelength. Fixed-size sequences are obtained by sliding a windowover the instruction stream, e.g., over the linearly-orderedinstructions in a function. This process is characterized by thewindow size, i.e., the number of instructions in the sequence,

Page 9: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

and the stride, i.e., the number of instructions to slide thestart of the window to produce the next sequence. When thestride is smaller than the window size, consecutive sequencesoverlap. When the stride is one, the resulting sequence iscalled an n-gram. For example, given the sequence of in-struction mnemonics {mov, push, add} two 2-grams will beextracted: {mov, push} and {push, add}. There are 7 worksthat use n-grams: IDEA, MBC, RENDEZVOUS, MUTANTX-S, EXPOSE, ILINE, and KAM1N0. Fixed-size sequences arealso used by SWPQS2006 with a configurable stride largerthan one. RENDEZVOUS, in addition to n-grams, also uses n-perms, unordered n-grams that capture instruction reorderingwithin the sequence. An n-perm may capture multiple n-grams,e.g., 2-perm {mov, push} captures 2-grams {mov, push} and{push,mov}.

The most common methods to compare instruction se-quences are hashing, embedding, and alignment. Hashingis used by 6 approaches (BMAT, DR2005, SWPQS2006,SMIT, BINCLONE, SPAIN) to obtain a fixed-length value outof a variable-length instruction sequence. If the hash valuesare the same, the sequences are similar. Five approachesgenerate an embedding from n-gram sequences (IDEA, MBC,MUTANTX-S, EXPOSE, KAM1N0). Three approaches (EXED-IFF, TRACY, BINSEQUENCE) align two sequences to producea mapping between them by inserting gaps in either sequenceto account for inserted, removed, and modified instructions.These approaches define a similarity score when instructionsare aligned, and a gap score when an instruction aligns with agap. Other less common comparison methods are using vectorsof Boolean features (ILINE) and encoding sequences as stringsfor indexing (RENDEZVOUS).

D. Semantic SimilarityColumn: Semantic similarity

Semantic similarity captures if the code being comparedhas similar effects, as opposed to syntactic similarity thatcaptures similarity in the code representation. The semanticsof a piece of binary code can be described by the changes itproduces in the process state, i.e., updates to the content ofregisters and memory. We identify 26 approaches computingsemantic similarity. Most of them capture semantics at basicblock granularity because basic blocks are straight-line codewithout control flow. Three methods are used to capturesemantics: instruction classification, input-output pairs, andsymbolic formulas.

Instruction classification. The first approach to introducesemantics into binary code similarity was KKMRV2005,which classified instructions into 14 classes (e.g., arithmetic,logic, data transfer) and used a 14-bit value to capture theclasses of the instructions in a basic block. This semanticcolor bitvector captures the effects of the basic block. Thisapproach was later adopted by BMM2006, BEAGLE, FOSSIL,and SIGMA. Instruction classification can be used to computesemantic similarity, but cannot determine if two pieces ofbinary code are, or are not, equivalent.

Input-output pairs. Intuitively, two pieces of binary code arefunctionally equivalent if given the same input they producethe same output, for all possible inputs. Such equivalenceis independent of the code representation and compares thefinal state after the code is executed, ignoring intermediate

program states. This approach was proposed by Jiang et al.on source code [102] and later used by numerous binary codesimilarity approaches: BINHASH, MXW2015, BLEX, MULTI-MH, BINGO, CACOMPARE, SPAIN, KS2017 and IMF-SIM.It involves executing both pieces of binary code with the sameinput and comparing their output, repeating the process manytimes. If the output differs for any input, then the two piecesof binary code are not equivalent. Unfortunately, to determinethat both pieces are equivalent, the approach would requiretesting all possible inputs, which is not realistic for any non-trivial piece of binary code. Thus, in practice, this approachcan only determine that two pieces of binary code are likelyequivalent, with a confidence proportional to the fraction ofinputs that have been tested, or that they are not equivalent(with certainty). The tested inputs are typically selected ran-domly, although it is possible to use other selection rules, e.g.,taking values from the program data section (CACOMPARE).It is generally a dynamic approach, but some approaches (e.g.,MULTI-MH) evaluate concrete inputs on statically-extractedsymbolic formulas.

Symbolic formula. A symbolic formula is an assignmentstatement in which the left side is an output variable andthe right side is a logical expression of input variables andliterals that captures how to derive the output variable. Forinstance, the instruction add %eax,%ebx can be representedwith the symbolic formula EBX2 = EAX + EBX1 whereEBX2 and EBX1 are symbols representing the values of theEBX register before and after executing the instruction. Elevenapproaches use symbolic formulas: BINHUNT, IBINHUNT,BINHASH, EXPOSE, TRACY, RMKNHLLP2014, TEDEM,COP, MULTI-MH, ESH, and XMATCH. Of those, eight ap-proaches extract symbolic formulas at basic block granularity,XMATCH and EXPOSE extract formulas for the return valuesof a function, and BINSIM extracts symbolic formulas from anexecution trace that capture how the arguments of a system callwere derived. Three methods are used to compare symbolicformulas: using a theorem prover to check for equivalence,comparing the semantic hash of the formulas to check forequivalence, and computing the similarity of the graph repre-sentation of the formulas.Theorem prover – BINHUNT introduced the idea of usingtheorem provers such as STP [103] or Z3 [104] to check iftwo symbolic formulas are equivalent, i.e., whether the outputvariable always contains the same value after the executionof both formulas, assuming that the input variables share thesame values. The main limitation of this approach is that it iscomputationally expensive because it can only perform pair-wise equivalence queries, the solving time quickly increaseswith formula sizes, and the solver may fail to return an answerfor some queries. Note that a piece of binary code may havemultiple outputs (registers and variables in memory), eachrepresented by its own symbolic formula. These approachesneed to try all pair-wise comparisons and check if there existsa permutation of variables such that all matched variablescontain the same value.Semantic hashes – An alternative to using a theorem prover isto check if two symbolic formulas have the same hash, afternormalizing the formulas (e.g., using common register names)and simplifying them (e.g., applying constant propagation).The intuition is that if the two symbolic formulas have the

Page 10: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

same hash they should be equivalent. Three approaches usesemantic hashes: BINHASH, BINJUICE, and GITZ. Semantichashes are efficient, but are limited in that it is possible fortwo equivalent formulas to have different hashes even afternormalization and simplification. For example, reordering ofsymbolic terms in one of the formulas (e.g., due to instructionreordering) results in a different hash.Graph distance – XMATCH and TEDEM represent the symbolicformula of a basic block as a tree, and compute their similarityby applying graph/tree edit distance. Computing the graph/treeedit distance is more expensive than comparing semantichashes, but the graph representation has the advantage oversemantic hashes that it can handle term reordering.

E. Structural SimilarityColumn: Structural similarity

Structural similarity computes similarity on graph repre-sentations of binary code. It sits between syntactic and se-mantic similarity since a graph may capture multiple syntacticrepresentations of the same code and may be annotated withsemantic information. Structural similarity can be computedon different graphs. The three most common are the intra-procedural control flow graph (CFG), the inter-proceduralcontrol flow graph (ICFG), and the callgraph (CG). All threeare directed graphs. In the CFG and ICFG, nodes are basicblocks and an edge indicates a control flow transition (e.g.,branch, jump). Basic blocks in a CFG belong to a singlefunction; each function has its own CFG. Basic blocks in theICFG belong to any program function; there is one ICFG perprogram. In the CG, nodes are functions and edges capturecaller-callee relationships.

The intuition behind structural approaches is that CG,ICFG, and CFGs are fairly stable representations whose struc-ture varies little for similar code. Approaches that operate onthe CG or ICFG have a whole program input granularity, whilethose that operate on CFGs may have function granularity, oruse function similarity as a step towards whole program simi-larity. Structural similarity approaches may use labeled graphs.For example, F2004 and DR2005 use node labels in theCG to capture structural information about a function’s CFG(e.g., number of instructions and edges). Other approacheslabel basic blocks in the CFG/ICFG with a feature vector thatcaptures the semantics of the basic clock (e.g., KKMRV2005,BINJUICE) or its embedding (GENIUS, GEMINI, see §VI-F).Edge labels can be used to capture the type of control flowtransfer (BMM2006) or to aggregate the semantic labels ofsource and destination nodes (FOSSIL).

Structural similarity is used by 27 approaches. 14 ap-proaches operate only on CFGs; five on both CFGs andthe CG; four only on the ICFG; and two only on the CG.There are also three approaches that use non-standard graphs:SIGMA proposes a semantic integrated graph that combinesinformation from the CFG, CG, and register flow graph,while QSM2015 and LIBV use the execution dependencegraph [105]. The remainder of this subsection discusses dif-ferent approaches used to compute graph similarity.

(Sub)Graph isomorphism – Most structural similarity ap-proaches check for variations of graph isomorphism. Anisomorphism of two graphs G and H is an edge-preservingbijection f between their node sets such that if any two

nodes u, v are adjacent in G, then f(u) and f(v) are alsoadjacent in H . Graph isomorphism requires that the node setcardinality is the same in both graphs, which is too strictfor binary code similarity. Thus, approaches instead checkfor subgraph isomorphism, which determines if G containsa subgraph isomorphic to H . Subgraph isomorphism is aknown NP-complete problem. Other approaches check for themaximum common subgraph isomorphism (MCS), which findsthe largest subgraph isomorphic to two graphs and is alsoNP-Complete. Given the high complexity of both problems,approaches try to reduce the number of graph pairs that needto be compared, as well as the size of the compared graphs.For example, DR2005 avoids comparing CFGs with the samehash (match) and CFGs with very different number of nodesand edges (unlikely match). IBINHUNT reduces the numberof nodes to consider by assigning taint labels to basic blocks.Only nodes with the same taint label are considered in thesubgraph isomorphism. For candidate graph pairs that pass thefiltering, approximate algorithms are used that can be groupedinto greedy and backtracking.Greedy – These approaches perform neighborhood exploration.An initial set of matching nodes is first identified. Then,the matching is recursively expanded by checking only theneighbors (i.e., parents or children) of already matched nodes.BMAT, F2004, DR2005, LKI2013, TEDEM, MULTI-MH,KLKI2016, KAM1N0, BINSEQUENCE, and BINARM use thisapproach. A limitation of greedy algorithms is that early errorspropagate, significantly reducing the accuracy.Backtracking – Backtracking algorithms fix a wrong matchingby revisiting the solution, and if the new matching doesnot improve the overall matching it is reverted (BMM2006,BINHUNT, IBINHUNT, MXW2015, QSM2015, DISCOVRE).Backtracking is more expensive, but can improve accuracy byavoiding local optimal matching.

Optimization. An alternative used by four approaches (SMIT,BINSLAYER, CXZ2014, GENIUS) is to model graph simi-larity as an optimization problem. Given two CFGs and acost function between two basic blocks, they find a bijectivemapping between the two CFGs with minimum cost. Suchbipartite matching ignores graph structure, i.e., does not useedge information. To address this, SMIT and BINSLAYERassign lower cost to connected basic blocks. To perform thematching, SMIT, BINSLAYER, and CXZ2014 use the O(n3)Hungarian algorithm, while GENIUS uses a genetic algorithm.K-subgraph matching. KKMRV2005 proposed to divide agraph into k-subgraphs, where each subgraph contains onlyk connected nodes. Then, generate a fingerprint for each k-subgraph and the similarity of two graphs corresponds tothe maximum number of k-subgraphs matched. Four otherapproaches later leveraged this approach: BEAGLE, CXZ2014,RENDEZVOUS, and FOSSIL.

Path similarity. There are three approaches (COP, SIGMA,BINSEQUENCE) that convert function similarity into a pathsimilarity comparison. First, they extract a set of executionspaths from a CFG, then define a path similarity metric betweenexecution paths, and finally combine the path similarity into afunction similarity.

Graph embedding. Another method used by GENIUS,VULSEEKER, and GEMINI, detailed in §VI-F, is to extract a

Page 11: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

real-valued feature vector from each graph and then computethe similarity of the feature vectors.

F. Feature-Based SimilarityColumn: Feature-based, Machine learning

A common method (28 approaches) to compute similarityis to represent a piece of binary code as a vector or a set offeatures such that similar pieces of binary code have similarfeature vectors or feature sets. A feature captures a syntactic,semantic, or structural property of the binary code. Featurescan be Boolean, numeric, or categorical. Categorical featureshave discrete values, e.g., the mnemonic of an instruction. Afeature vector typically has all numeric or all Boolean features,the latter is called a bitvector. Categorical features are typicallyfirst encoded into Boolean features using one-hot encodingor into real-valued features using an embedding. Of the 28approaches, 21 use numeric feature vectors, six use featuresets, and BINCLONE uses bitvectors. Once the features havebeen extracted, a similarity metric between feature vectorsor feature sets is used to compute the similarity. Commonsimilarity metrics are the Jaccard index for feature sets, dotproduct for bitvectors, and the Euclidean or cosine distancefor numeric vectors.

Figure 3 shows two alternative methods for feature-basedsimilarity. The top method comprises of two steps: featureselection and feature encoding. Feature selection is a manualprocess where an analyst uses domain knowledge to identifyrepresentative features. The alternative approach showed belowis learning to automatically generate real-valued feature vec-tors, called embeddings, from training data. Embeddings areused by eight recent approaches (GENIUS, GEMINI, αDIFF,VULSEEKER, RLZ2019, INNEREYE, ASM2VEC, SAFE).Embeddings are used in natural language processing (NLP) toencode categorical features using real-valued numbers, whichhelps deep learning algorithms by reducing the dimensionalityand increasing the density of feature vectors compared to one-hot encoding. Embeddings enable automatic feature extractionand efficient similarity computation. But, features in embed-dings do not provide information about what has been learnt.

Binary code similarity embeddings can be classified bythe properties they captured and their granularity. The firstbinary code similarity approach using embeddings was GE-NIUS, later followed by VULSEEKER and GEMINI. All thethree approaches build a graph embedding for the ACFG ofa function, i.e., a CFG with nodes annotated with selectedbasic block features. While GENIUS uses clustering and graphedit distance to compute the embedding, VULSEEKER andGEMINI improve efficiency by training a neural network thatavoids expensive graph operations. Later approaches (αDIFF,RLZ2019, INNEREYE, ASM2VEC, SAFE) avoid manuallyselected features by focusing on instruction, or raw byte,co-ocurrencence. In NLP, it is common to extract a wordembedding that captures word co-occurrence (e.g., word2vec)and then build a sentence embedding that builds upon it.RLZ2019 and INNEREYE use an analogous approach by con-sidering instructions as words and basic blocks as sentences.SAFE uses similar approach to create functions embeddingthan basic blocks. Also related is ASM2VEC that obtains afunction embedding by combining path embeddings capturinginstruction co-occurrence along different execution paths inthe function. Instead of using instruction co-ocurrence, αDIFF

computes a function embedding directly from the sequence ofraw bytes of a function using a convolutional network.

Machine learning. We identify three uses of machine learningin binary code similarity approaches: (1) to generate an embed-ding as explained above, (2) to cluster similar pieces of binarycode using unsupervised learning (BINHASH, MUTANTX-S,ILINE, RMKNHLLP2014, KLKI2016, GENIUS), and (3) toclassify with a probability if the pieces of binary code arebeing compiled from the same source code (BINDNN, IMF-SIM). BINDNN is the first use of neural networks for binarycode similarity. Instead of generating an embedding, BINDNNdirectly uses a neural network classifier to determine if twofunctions are compiled from the same source code. Surpris-ingly, BINDNN is not cited by later binary code similarityapproaches including those using neural networks to buildembeddings.

G. HashingColumn: Locality sensitive hashing

A hash is a function that maps data of arbitrary size to afixed-size value. Hash values are compact to store, efficientto compute, and efficient to compare, which makes themgreat for indexing. Hashes operate at the raw-byte level. Theyare not especifically designed for binary code, but rather forarbitrary binary data. However, three classes of hashes havebeen used for binary code similarity: cryptographic hashes,locality-sensitive hashes, and executable file hashes. Crypto-graphic hashes capture identical, rather than similar, inputs.They are used by some binary code similarity approachesto quickly identify duplicates at fine granularity (e.g., basicblock). Locality-sensitive hashes produce similar hash valuesfor similar inputs, as oppossed to cryptographic hashes wherea small difference in the input generates a completely differenthash value. Executable file hashes take as input an executablefile, but the hash computation only considers parts of theexecutable such as the import table or selected fields of theexecutable’s header. Their goal is to output the same hash valuefor polymorphic variants of the same malware.

Locality sensitive hashing (LSH). LSH produces similar hashvalues for similar inputs, efficiently approximating a nearestneighbor search. LSH algorithms typically apply multiple hashfunctions on the given input and produce the same hash valuefor similar inputs, i.e., they increase collision probability forsimilar inputs. LSH is used in binary code similarity to boostperformance. For example, it is used by OM approaches forindexing pieces of binary code, enabling efficient binary codesearch (KAM1N0, GEMINI, INNEREYE). Of the 11 approachesthat use LSH, seven use MinHash [106] (BINHASH, MULTI-MH, GENIUS, BINSEQUENCE, BINSHAPE, CACOMPARE,BINSIGN), two do not specify the algorithm used (GEMINI,INNEREYE), SWPQS2006 uses the algorithm by Gionis etal. [107], and KAM1N0 proposes its own Adaptive LocalitySensitive Hashing (ALSH).

Fuzzy hashing is a popular type of LSH used to computesimilarity of arbitrary files. For example, the VirusTotal fileanalysis service [108] reports ssdeep [52] hashes for submittedfiles. Other fuzzy hashes include tlsh [53], sdhash [109], andmrsh-v2 [110]. None of the 61 approaches use them, butwe briefly discuss them because they are often applied toexecutable files. When applied on executables, fuzzy hashes

Page 12: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

FeatureEncoding

Binary

Code FeatureSelection

Embedding

BitvectorSet

Numeric

Numeric

Fig. 3. Two alternative methods for feature-based similarity.

may capture similarity of the binary code, but also similarityof data present in the executables. This issue has been recentlyexamined by Pagani et al. [111]. Their results show thatwhen applied only on the code of the .text section theywork significantly worse than when used for whole programsimilarity. In fact, they show that a byte change at the rightposition, extra nop instructions, and instruction swapping candegrade similarity significantly (in some cases bring it downto zero). They observe that when compiling the same sourcecode with different optimizations, the data sections of theexecutables remain the same, which seems to be a key reasonfuzzy hashes work better on whole executables.

Executable file hashes. This class of hashes are especificallydesigned for executable files. They are designed to output thesame hash value for polymorphic variants of the same malware.The hash computation only considers parts of the executablethat are less likely to change when simply repacking, orresigning, the same executable. They are not used by anyof the 61 approaches, but we briefly describe three popularhashes for completeness. peHash [51] hashes selected fieldsof a PE executable that are less susceptible to changes duringcompilation and packing, e.g., initial stack size, heap size.ImpHash hashes the import table of an executable. Since thefunctionality of packed variants is the same, their importedfunctions should be the same as well. Unfortunately, it givesfalse positives with unrelated executables packed with the samepacker, if the packer reconstructs the original import table atruntime. Authentihash is the hash of a PE executable ignor-ing its Windows Authenticode code signing data. It enablesidentifying the same executable signed by different publishers.It hashes the whole PE executable except three pieces: theAuthenticode data, pointers to the Authenticode data, and thefile checksum.

H. Supported ArchitecturesColumn: Cross-architecture

A cross-architecture approach can compare pieces of binarycode for different CPU architectures, e.g., x86, ARM, andMIPS. This differs from architecture-independent approaches(e.g., F2004) that support different architectures, but cannotcross-compare among them, i.e., they can compare two x86inputs and two MIPS inputs, but cannot compare an x86 inputwith a MIPS input. There are 16 cross-architecture approaches,all proposed since 2015. A common application is given abuggy piece of binary code, to search for similar pieces ofbinary code, compiled for other architectures, which mayalso contain the bug. For example, to search for programs infirmware images where a version of OpenSSL vulnerable to

Heartbleed has been statically compiled.The code syntax for different architectures may signifi-

cantly differ as they may use separate instruction sets withdifferent instruction mnemonics, sets of registers, and de-fault calling conventions. Thus, cross-architecture approachescompute semantic similarity. Cross-architecture approachesemploy one of two techniques. Seven approaches lift the binarycode to an architecture-independent intermediate representa-tion (IR): MULTI-MH, MOCKINGBIRD, BINGO, XMATCH,CACOMPARE, GITZ, FIRMUP. Then, identical analysis canbe peformeed on the IR, regardless of the origin architecture.The advantage is that the analysis only depends on the IRand the IR design can be outsourced to a separate group.Section VII details the specific architectures supported by eachapproach and the IRs they use. An alternative approach usedby 9 approaches is to use feature-based similarity (discussed in§VI-F). These approaches use a separate module for each ar-chitecture to obtain a feature vector that captures the semanticsof the binary code (DISCOVRE, BINDNN, GENIUS, GEMINI,αDIFF, VULSEEKER, RLZ2019, INNEREYE, SAFE).

I. Type of Analysis

Column: Static analysis; Dynamic analysis; Dataflow analysisBinary code similarity approaches can use static analysis,

dynamic analysis, or both. Static analysis examines the disas-sembled binary code, without executing it. Instead, dynamicanalysis examines code executions by running the code onselected inputs. Dynamic analysis can be performed online,as the code executes in a controlled environment, or offlineon traces of the execution. Fifty one approaches use onlystatic analysis, four use only dynamic analysis, and six com-bine both. The dominance of static analysis for binary codesimilarity is due to most applications requiring all the inputcode to be compared. This is easier with static analysis as itprovides complete code coverage. Dynamic analysis examinesone execution at a time and can only determine similarity of thecode run in that execution. To increase the coverage of dynamicanalysis, three approaches (IBINHUNT, BLEX, IMF-SIM) runthe code on multiple inputs covering different execution pathsand combine the results. However, it is infeasible to run anynon-trivial piece of binary code on inputs that cover all possibleexecution paths.

One advantage of dynamic analysis is simplicity, as mem-ory addresses, operand values, and control-flow targets areknown at runtime, which sidesteps static analysis challengessuch as memory aliasing and indirect jumps. Another advan-tage is that it can handle some obfuscations and does not

Page 13: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

require disassembly of the code, also difficult with obfuscatedcode [112]. Section VIII details which approaches have evalu-ated its robustness on obfuscated code. Overall, dynamic anal-ysis has been used in binary code similarity for malware un-packing (SMIT, BEAGLE, MUTANTX-S, CXZ2014, BINSIM),for operating on trace granularity (BINSIM, KS2017), andfor collecting runtime values for semantic similarity (BLEX,BINSIM, IMF-SIM).

Dataflow analysis is a common type of analysis thatexamines how values propagate through the code. It comprisesof data sources to be tracked (e.g., registers or memory loca-tions holding specific variables), propagation rules defininghow values are propagated by different instructions or IRstatements, and sinks, i.e., program points where to check thevalues reaching them. Of the 19 approaches that use dataflowanalysis, 16 use symbolic execution to extract a set of symbolicformulas to compute semantic similarity (§VI-D). SPAIN usestaint analysis to summarize the patterns of vulnerabilities andtheir security patches. And, IBINHUNT uses both taint analysisand symbolic execution. It first uses taint analysis as a filterto find pieces of binary code that process the same user input,restricting the expensive subgraph isomorphism computation tothose with the same taint label. And, it computes basic blocksimilarity using symbolic formulas. IMF-SIM uses backwardtaint analysis to infer pointer arguments of a function fromdereference errors.

J. NormalizationColumn: Normalization

Syntactic similarity approaches often normalize instruc-tions, so that two instructions that are normalized to the sameform are considered similar despite some syntactic differences,e.g., different registers being used. Overall, there are 33approaches that use instruction normalization. They apply thefollowing three types of instruction normalization:

• Operand removal – A normalization used by nineapproaches is to abstract an instruction only by itsmnemonic or opcode, ignoring all operands. For ex-ample, add%eax,%ebx and add [%ecx],%edx wouldbe both represented by add and considered similar,despite both using different operands.

• Operand normalization – A normalization usedby 17 approaches is to replace instruction operandswith symbols that capture the operand type suchas REG for register, MEM for memory, and IMMfor immediate values. For example, add%eax,%ebxand add%ecx,%edx would be both represented asaddREG,REG, matching the instructions despitedifferent register allocations used by the compiler.Operand normalization abstracts less than operandremoval. For example, add [%ecx],%edx would berepresented as addMEM,REG and thus be consid-ered different from the above. Some approaches alsouse different symbols for general purpose registers andsegment registers, and for operands of different sizes(e.g., RegGen8 and RegGen32 in BINCLONE).

• Mnemonic normalization – A normalization usedby 3 approches (BMAT, EXPOSE and ILINE) is torepresent multiple mnemonics by the same symbol.For example, both BMAT and ILINE represent all

conditional jumps (e.g., je, jne) with the samesymbol to account for the compiler modifying jumpconditions.

Another type of normalization is to ignore code that shouldnot affect the semantics. For example, some instruction setscontain no-op instructions that do not change the process state.And, compilers often use no-op equivalent instructions forpadding such as instructions that move a register to itself, e.g.,mov%eax,%eax. No-op equivalent instructions do not matterfor semantic similarity and structural similarity approaches,but since they change the syntax they may affect syntacticsimilarity approaches. Three approaches remove no-op instruc-tions (ILINE, MXW2015, QSM2015). A few approachesalso remove unreachable dead code (BMM2006, FIRMUP),which may be introduced by obfuscations; function epilogueand prologue (EXPOSE) instructions, which may make smallunrelated functions to look similar; and functions added by thecompiler to load the program, not present in the source code(BINGO, SIGMA).

VII. IMPLEMENTATIONS

This section systematizes the implementation of the 61approaches. For each approach, Table III shows the static anddynamic platforms it builds on, the programming languageused to implement the approach, whether the implementationsupports distributing the analysis, the supported target programarchitectures and operating systems, and how the approachis released. A dash (-) indicates that we could not find theinformation (i.e., unknown), while a cross (7) means unsup-ported/unused.

Building a binary code similarity approach from scratchrequires significant effort. Thus, all approaches build on topof previously available binary analysis platforms or tools,which provide functionality such as disassembly and controlflow graphs for static analysis, or instruction-level monitoringfor dynamic analysis. However, the implementation of anapproach may only use a subset of the functionality offeredby the underlying platform. The most popular static platformis IDA (42 approaches), followed by BAP (4), DynInst (3),and McSema (2). IDA main functionalities are disassemblyand building control flow graphs. Its popularity comes fromsupporting a large number of architectures. Some binaryanalysis platforms already support using IDA as their disas-sembler, so it is not uncommon to combine IDA and anotherplatform (6 approaches). Among dynamic approaches, PINis the most popular with 5 approaches, followed by TEMUand ANUBIS with 3 approaches each. Previous work hasanalyzed binary analysis platforms used by binary code typeinferece approaches [35]. Since most platforms overlap, werefer the reader to that work for platform details, but providean extended version of their table in the Appendix (Table V)with six extra platforms.

A few approaches build on top of previous binary codesimilarity approaches. One case is that both approacheshave overlapping authors. For example, RMKNHLLP2014is based on BINJUICE, and MXW2015 extends IBINHUNT,which already shared components with BINHUNT. The othercase is using a previously released approach. For example,BINSLAYER and SPAIN use BINDIFF as a first stage filter tofind matched functions.

Page 14: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

TABLE III. COMPARISON OF THE IMPLEMENTATIONS OF BINARY CODE SIMILARITY APPROACHES. SYMBOL – MEANS INFORMATION IS UNKNOWN.

Targeted Arch. & OS ReleasePlatform CISC RISC OS

Approach Static Dynamic Prog. Lang. Dis

trib

uted

Use

sIR

Firm

war

e

x86

x86-

64

AR

M

MIP

S

Oth

er

Win

dow

s

Lin

ux

Mac

OS

Ope

nso

urce

Free

bina

ry

EXEDIFF [25] – 7 – 7 7 7 7 7 7 7 X 7 X 7 7 7BMAT [32] Vulcan 7 – 7 7 7 X 7 7 7 7 X 7 7 7 7F2004 [26] IDA 7 – 7 7 7 X 7 7 7 7 X X 7 7 7

DR2005 [27] IDA 7 – 7 7 7 X 7 7 7 7 X 7 7 7 XKKMRV2005 [19] – 7 – 7 7 7 X 7 7 7 7 X X 7 7 7

BMM2006 [20] Boomerang 7 – 7 7 7 X 7 7 7 7 7 X 7 7 7BINHUNT [28] IDA 7 – 7 X 7 X 7 7 7 7 X X 7 7 7

SWPQS2006 [56] IDA 7 C++ 7 X 7 X 7 7 7 7 X X 7 7 7SMIT [16] IDA 7 C++ 7 7 7 X 7 7 7 7 X 7 7 7 7IDEA [57] NewBasic 7 – 7 7 7 X 7 7 7 7 X 7 7 7 7MBC [58] – 7 – 7 7 7 X 7 7 7 7 X 7 7 7 7

IBINHUNT [59] IDA Temu – 7 X 7 X 7 7 7 7 7 X 7 7 7BEAGLE [22] – Anubis – 7 7 7 X 7 7 7 7 X 7 7 7 7

BINHASH [60] ROSE 7 – 7 7 7 X 7 7 7 7 X 7 7 7 7BINJUICE [42] IDA 7 Python, Prolog 7 7 7 X 7 7 7 7 X X 7 7 7

BINSLAYER [61] DynInst C++ 7 7 7 X 7 7 7 7 7 X 7 X 7RENDEZVOUS [62] DynInst 7 C++ 7 7 7 X 7 7 7 7 7 X 7 7 7MUTANTX-S [17] IDA 7 – 7 7 7 X 7 7 7 7 7 X 7 7 7

EXPOSE [63] IDA 7 – 7 7 7 X 7 7 7 7 7 X 7 7 7ILINE [23] IDA PIN C, Python 7 7 7 X 7 7 7 7 X X 7 7 7

LKI2013 [64] IDA 7 Python 7 7 7 X 7 7 7 7 X 7 7 7 7TRACY [1] IDA 7 Python 7 7 7 X X 7 7 7 7 X 7 X 7

BINCLONE [65] IDA 7 C++ 7 7 7 X 7 7 7 7 X 7 7 7 7RMKNHLLP2014 [66] IDA 7 Python 7 7 7 X 7 7 7 7 X 7 7 7 7

CXZ2014 [21] Malwise 7 C++ 7 7 7 X 7 7 7 7 X 7 7 7 7BLEX [67] IDA PIN C++ 7 7 7 7 X 7 7 7 7 X 7 7 7COP [33] IDA, BAP 7 C++ 7 X 7 X 7 7 7 7 7 X 7 7 7

TEDEM [2] IDA 7 C++ 7 X 7 X 7 7 7 7 X X X 7 7SIGMA [69] – 7 – 7 7 7 X 7 7 7 7 X 7 7 7 7

MXW2015 [24] 7 Temu Ocaml 7 X 7 X 7 7 7 7 X 7 7 7 7MULTI-MH [3] IDA 7 C++ 7 X X X 7 X X 7 X X X 7 7QSM2015 [70] IDA 7 C++ 7 7 7 X 7 7 7 7 X 7 7 X 7

DISCOVRE [4] IDA 7 – 7 7 X X 7 X X 7 X X 7 7 7MOCKINGBIRD [29] IDA Valgrind Python 7 X 7 X 7 X X 7 7 X 7 7 7

ESH [5] IDA, BAP 7 C#, Python 7 X 7 7 X 7 7 7 7 X 7 X 7TPM [71] IDA 7 Python 7 7 7 X 7 7 7 7 X 7 7 7 7

BINDNN [72] IDA 7 – 7 7 7 X X X 7 7 7 X 7 7 7GENIUS [6] IDA 7 Python 7 7 X X 7 X X 7 7 X 7 X 7BINGO [7] IDA 7 Python 7 X 7 X X X 7 7 X X X 7 7

KLKI2016 [18] IDA PIN – 7 7 7 X 7 7 7 7 X 7 7 7 7KAM1N0 [73] IDA 7 – X 7 7 X X 7 7 7 X 7 7 X 7

BINSEQUENCE [8] IDA 7 – 7 7 7 7 X 7 7 7 X 7 7 7 7XMATCH [9] IDA, McSema 7 – 7 X X X 7 7 X 7 X X X 7 7

CACOMPARE [74] IDA 7 Python 7 X 7 X 7 X X 7 7 X 7 7 7SPAIN [30] IDA 7 – 7 7 7 X 7 7 7 7 X X 7 7 7

BINSIGN [75] IDA 7 – X 7 7 X 7 7 7 7 X 7 7 7 7GITZ [10] BAP, McSema 7 – 7 X 7 7 X X 7 7 7 X 7 7 7

BINSHAPE [76] IDA 7 Python 7 7 7 7 X 7 7 7 X X 7 7 7BINSIM [77] 7 Temu – 7 X 7 X 7 7 7 7 X 7 7 7 7KS2017 [31] 7 PIN C++, Python 7 7 7 X X 7 7 7 7 X 7 7 7

IMF-SIM [78] 7 PIN C++, Python 7 7 7 7 X 7 7 7 7 X 7 7 7GEMINI [12] IDA 7 Python 7 7 X X 7 X X 7 X X 7 X 7FOSSIL [79] IDA, Dyninst 7 Python 7 7 7 X 7 7 7 7 X X 7 7 7

FIRMUP [13] IDA, Angr 7 – 7 X X X 7 X X X 7 X 7 7 7BINARM [14] IDA 7 C++, Ptyhon 7 7 X 7 7 X 7 7 7 X 7 7 7αDIFF [15] IDA 7 – 7 7 X X X X X 7 7 X 7 7 7

VULSEEKER [11] IDA 7 Python 7 7 X X X X X 7 7 X 7 X 7RLZ2019 [80] – 7 – 7 7 7 7 X X 7 7 7 X 7 X 7

INNEREYE [81] BAP 7 Python 7 7 7 X 7 X 7 7 7 X 7 X 7ASM2VEC [82] IDA 7 – 7 7 7 X X 7 7 7 7 X 7 X 7

SAFE [83] IDA, ANGR 7 Python 7 7 7 7 X X 7 7 7 X 7 X 7

Page 15: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

The most popular programming language is Python (20approaches), followed by C++ (14). One reason for this isthat the majority of approaches use IDA for disassemblyand IDA supports both C++ and Python plugins. Moreover,two approaches (KAM1N0, BINSIGN) have used distributedplatforms, such as Hadoop, to distribute their analysis.

Binary code analysis can operate directly on a particularinstruction set (e.g., x86, ARM) or convert the instructionset into an intermediate representation (IR). Using an IRhas two main advantages. First, it is easier to reuse theanalysis built on top of an IR. For example, supporting a newarchitecture only requires adding a new front-end to translateinto the IR, but the analysis code can be reused. Second,complex instruction sets (e.g., x86/x86-64) can be convertedinto a smaller set of IR statements and expressions, whichmake explicit any side-effects such as implicit operands orconditional flags. There are 15 approaches that use an IR andmost use the IR provided by the underlying platform. Out of15 approaches, 5 use VINE provided by BITBLAZE (BIN-HUNT, IBINHUNT, COP, MXW2015, BINSIM), another 5use VEX provided with VALGRIND (MULTI-MH, MOCKING-BIRD, CACOMPARE, GITZ, FIRMUP), and two use LLVM-IR(ESH, XMATCH). The remaining three approaches use SAGEIII (SWPQS2006), METASM (TEDEM) and REIL (BINGO).It is surprising that only 6 of the 16 cross-architecture ap-proaches use an IR (MULTI-MH, XMATCH, MOCKINGBIRD,CACOMPARE, GITZ, FIRMUP). The remaining approachesprovide separate analysis modules for each architecture thattypically output a feature vector with a common format.

The most supported architectures for the target programsto be analyzed are x86/x86-64 (59 approaches), followedby ARM (16), and MIPS (10). The only two approachesthat do not support x86/x86-64 are EXEDIFF that targetsDigital Alpha, and BINARM that targets ARM. There are 16cross-architecture approaches and 9 approaches that supportfirmware. The first cross-architecture approach was MULTI-MH, which added support for ARM in 2015. Since then, ARMsupport has become very prevalent due to the popularity ofmobile and IoT devices. It is worth noting that even if thereare 42 approaches that use IDA, which supports more than60 processor families, most approaches built on top of IDAonly analyze x86/x86-64 programs. The most supported OSis Linux (41 approaches) followed by Windows (35). Only4 approaches support MacOS. Early approaches that focusedon x86/x86-64 often used IDA to obtain support for bothPE/Windows and ELF/Linux executables. Most recently, allapproaches that leverage ARM support Linux, which is usedby Android and also by many IoT devices.

Of the 61 approaches, only 12 are open source(BINSLAYER, TRACY, QSM2015, ESH, GENIUS, KAM1N0,GEMINI, ASM2VEC, VULSEEKER, RLZ2019, INNEREYE,SAFE). DR2005 was implemented in the BINDIFF com-mercial tool, which is now available as a free binary. Theremaining approaches have not been released in any form,although the platforms they build on may be open source.Moreover, 4 of the 12 open-source approaches (ESH, GENIUS,RLZ2019, INNEREYE) have partially released their sourcecode and machine learning models.

VIII. EVALUATIONS

This section systematizes the evaluation of the 61 binarycode similarity approaches. For each approach, Table IVsummarizes the datasets used (§VIII-A) and the evaluationmethodology (§VIII-B).

A. DatasetsThe left side of Table IV describes the datasets used by

each approach. It first shows the total number of executablesused in the evaluation and their split into benign and maliciousexecutables. Executables may come from different programsor correspond to multiple versions of the same program, e.g.,with varying compiler and compilation options. Then, forapproaches that have function granularity, it captures the totalnumber of functions evaluated, and for approaches that analyzefirmware, the number of images from where the executablesare obtained. A dash (–) in a column means that we couldnot find the number in the paper. For example, SWPQS2006evaluates on system library files in Windows XP, but the totalnumber of executables is not indicated.

Most approaches use custom datasets, the one popularbenchmark is Coreutils used by 18 approaches. In addition,approaches that evaluate on firmware use two openly availablefirmwares (ReadyNAS [113] and DD-WRT [114]). Over halfof the approaches evaluate on less than 100 executables, 7on less than 1K, 8 on less than 10K, and only 8 on over10K. Out of 61 approaches, 37 have evaluated only on benignprograms, 8 only on malware, and 16 on both. This indicatesthat binary code similarity is also popular for malware analysis.However, of the 24 approaches evaluated on malware, only fiveuse packed malware samples (SMIT, BEAGLE, MUTANTX-S, CXZ2014, BINSIM). These approaches first unpack themalware using a custom unpacker (SMIT) or a generic (write-and-execute) unpacker (BEAGLE, MUTANTX-S, CXZ2014,BINSIM), and then compute binary code similarity. The resthave access to the malware’s source code or to unpackedsamples.

For binary code search approaches that use function granu-larity, the number of functions in the repository better capturesthe dataset size. The largest dataset is by GENIUS, whichevaluates on 420M functions extracted from 8,126 firmwares.Prior approaches had evaluated on at most 0.5M functions,which demonstrates the scalability gains from its embeddingapproach. Five other approaches have evaluated on over 1Mfunctions: FOSSIL (1.5M), BINSEQUENCE (3.2M), BINARM(3.2M), FIRMUP (40M), and GEMINI (420M), which uses thesame dataset as GENIUS. In addition, INNEREYE has evaluatedon a repository of 1.2M basic blocks.

B. MethodologyThe right side of Table IV describes four aspects of the

evaluation methodology used by each approach: robustness,accuracy, performance, and comparison with prior approaches.

Robustness. The first 8 columns on the right side of Ta-ble IV capture how authors evaluate the robustness of theirapproaches, i.e., their ability to capture similarity despitetransformations applied to the input programs. First, it showswhether they use each of the four compilers we have observedbeing employed to compile the programs in the dataset: GCC,ICC, Visual Studio (MSVS), and Clang. Then, it captures

Page 16: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

TABLE IV. COMPARISON OF THE EVALUATIONS OF BINARY CODE SIMILARITY APPROACHES. ON THE LEFT IT SUMMARIZES THE DATASETS USED ANDON THE RIGHT THE EVALUATION METHODOLOGY. A DASH (–) MEANS WE COULD NOT FIND THE INFORMATION. FOR BOOLEAN COLUMNS, 7 INDICATES

NO SUPPORT.

Datasets Methodology

Approach Exe

cuta

bles

Ben

ign

Mal

war

e

Func

tions

Firm

war

es

GC

C

ICC

MSV

C

Cla

ng

Cro

ss-o

ptim

izat

ion

Cro

ss-c

ompi

ler

Cro

ss-a

rchi

tect

ure

Obf

usca

tion

Acc

urac

y

Com

pari

son

Perf

orm

ance

EXEDIFF [25] 38 38 0 0 0 X 7 7 7 7 7 7 7 7 1 7BMAT [32] 32 32 0 0 0 7 7 X 7 7 7 7 7 X 0 XF2004 [26] 8 8 0 0 0 7 7 X 7 7 7 7 7 7 1 X

DR2005 [27] 6 4 2 0 0 7 7 X 7 7 7 7 7 7 0 XKKMRV2005 [19] 503 61 442 0 0 X 7 X 7 7 7 7 X X 0 7

BMM2006 [20] 587 572 115 0 0 7 7 7 7 7 7 7 7 X 0 XBINHUNT [28] 6 6 0 0 0 X 7 X 7 7 7 7 7 X 0 X

SWPQS2006 [56] – 1,723 0 0 0 X 7 X 7 X 7 7 7 X 0 XSMIT [16] 102,391 0 102,391 0 0 7 7 X 7 7 7 7 X X 0 XIDEA [57] 26,189 13,000 13,189 0 0 7 7 X 7 7 7 7 7 7 0 7

IBINHUNT [59] 9 9 0 0 0 X 7 7 7 7 7 7 7 7 1 XMBC [58] 27 0 27 0 0 7 7 X 7 7 7 7 7 7 0 7

BEAGLE [22] 381 0 381 0 0 7 7 X 7 7 7 7 7 X 1 7BINHASH [60] 16 0 16 – 0 7 7 X 7 7 7 7 7 7 0 7BINJUICE [42] 70 20 50 0 0 X 7 X 7 7 7 7 7 7 0 7

BINSLAYER [61] 44 44 0 0 0 – – – – 7 7 7 7 X 1 XRENDEZVOUS [62] 98 98 0 0.004 M 0 X 7 7 X X X 7 7 X 0 XMUTANTX-S [17] 137,055 0 137,055 0 0 7 7 X X 7 7 7 X X 0 X

EXPOSE [63] 3,075 3,075 0 0 0 X 7 7 7 7 7 7 7 X 0 XILINE [23] 1,891 1,777 114 0 0 X 7 X 7 7 7 7 7 X 0 X

LKI2013 [64] 20 16 4 0 0 7 7 X 7 7 7 7 7 X 1 7TRACY [1] – – 0 1.0 M 0 X 7 7 7 7 7 7 7 X 0 X

BINCLONE [65] 90 18 72 0 0 7 7 X 7 7 7 7 7 X 1 XRMKNHLLP2014 [66] 936 0 936 – 0 7 7 X 7 7 7 7 7 X 0 X

CXZ2014 [21] 16,999 1,601 25,398 0 0 7 7 X 7 7 7 7 X X 0 XBLEX [67] 1,140 1,140 0 0.196 M 0 X X 7 X X X 7 7 X 1 XCOP [33] 321 321 0 0 0 X X 7 7 X X 7 X X 4 7

TEDEM [2] 15 15 0 – 0 – 7 X – 7 7 7 7 X 0 XSIGMA [69] 18 16 2 – 0 7 7 X 7 7 7 7 7 7 0 7

MXW2015 [24] 155 0 155 0 0 7 7 X 7 7 7 7 7 7 0 XMULTI-MH [3] 60 60 0 – 2 X 7 7 X X X X 7 X 0 XQSM2015 [70] 384 384 0 0.020 M 0 7 7 X 7 7 7 7 7 X 1 X

DISCOVRE [4] 2,280 2,280 0 0.564 M 2 X X X X X X X 7 X 2 XMOCKINGBIRD [29] 10 10 0 – 0 X 7 7 X X X X 7 X 2 X

ESH [5] 1,000 1,000 0 – 0 X X 7 X 7 X 7 7 X 2 XTPM [71] 7 0 7 0 0 7 7 X 7 7 7 7 X X 4 7

BINDNN [72] – 2,068 0 0.013 M 0 X X 7 7 X X X 7 X 1 XGENIUS [6] – 17,626 0 420 M 8,128 X 7 7 X X X X 7 X 3 XBINGO [7] 110 110 0 0.127 M 0 X 7 7 X X X X 7 X 4 X

KLKI2016 [18] 350 30 320 0 0 7 7 X 7 7 7 7 7 7 1 XKAM1N0 [73] 10 10 0 – 0 – – – – 7 7 7 7 7 4 X

BINSEQUENCE [8] 19 17 2 3.2 M 0 7 7 X 7 7 7 7 7 X 4 XXMATCH [9] 72 72 0 0.007 M 1 X 7 7 X 7 X X 7 X 4 X

CACOMPARE [74] 72 72 0 – 0 X 7 7 X X X X 7 X 3 XSPAIN [30] 28 28 0 – 0 – – – – 7 7 7 7 X 0 X

BINSIGN [75] 9 7 2 0.023 M 0 7 7 X 7 X 7 7 X X 2 XGITZ [10] – – 0 0.5 M 0 X X 7 X 7 X X 7 X 0 X

BINSHAPE [76] 51 50 1 3 M 0 X 7 X 7 X X 7 X X 0 XBINSIM [77] 1,062 4 1,058 0 0 7 7 X 7 7 7 7 X X 6 XKS2017 [31] 11 11 0 0 0 X 7 7 X X X 7 X X 0 X

IMF-SIM [78] 1,140 1,140 0 – 0 X X 7 X X X 7 X X 3 XGEMINI [12] 51,314 51,314 0 420 M 8,126 X 7 7 7 X 7 X 7 X 2 XFOSSIL [79] 6,925 1,920 5,005 1.5 M 0 X X X X 7 7 7 X X 7 X

FIRMUP [13] 200,000 200,000 0 40 M 2,000 – – – – 7 7 X 7 X 2 XBINARM [14] 0 0 0 3.2 M 5,756 X 7 7 7 X 7 7 7 X 5 XαDIFF [15] 67,427 67,427 0 0 2 X 7 7 X X X X 7 X 6 7

VULSEEKER [11] – – 0 0.736 M 4,643 X 7 7 7 X 7 X 7 X 1 XRLZ2019 [80] – – 0 0.202 M 0 7 7 7 X X 7 X 7 X 0 X

INNEREYE [81] – – 0 1.2 M* 0 7 7 7 X X 7 X 7 X 0 XASM2VEC [82] 1,116 1,116 0 0.140 M 0 X X 7 X X X 7 X X 12 X

SAFE [83] – – 0 1.847 M 0 X 7 7 X X X X 7 X 1 X

Page 17: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

whether the authors evaluate similarity between programs com-piled with different compilation options (cross-optimization),between programs compiled with different compilers (cross-compiler), and between programs compiled for different ar-chitectures (cross-architecture). Finally, it captures whether theauthors evaluate similarity when obfuscation transformationsare applied to the input programs.

There are 34 approaches that evaluate robustness (at leastone Xin the last four robustness columns) and 27 that do not.Many early works did not evaluate robustness. This evalua-tion has become increasingly popular as approaches mature.The most popular robustness evaluation is cross-optimization(23 approaches), followed by cross-compiler (19), cross-architecture (16), and obfuscation (13). There are 9 approachesthat have evaluated cross-optimization, cross-compiler, andcross-architecture. Approaches that evaluate cross-compileralso typically evaluate cross-optimization, as it is a simplercase. Simlarly, approaches that evaluate cross-architecturetypically also evaluate cross-compiler, as cross-architectureprograms may be produced using different compilers. Notethat it is possible for approaches to compile programs withmultiple compilers, but not perform cross-compiler evaluation,i.e., not compare similarity between programs compiled withdifferent compilers.

There are 13 approaches that have evaluated on obfuscatedprograms. Of those, two use only source code transformations(COP, ASM2VEC), three use only binary code transformations(KKMRV2005, TPM, BINSHAPE), five use packed malware(SMIT, BEAGLE, MUTANTX-S, CXZ2014, BINSIM), andthree evaluate both source code and binary code transforma-tions (BINSIM, IMF-SIM, FOSSIL).

Accuracy evaluation and comparison. There are 49 ap-proaches that perform a quantitative evaluation of their ac-curacy using some ground truth (X), and 12 that performqualitative accuracy evaluation through case studies (7). Quan-titative evaluation most often uses standard accuracy metricssuch as true positives, false positives, precision, and recall.However, two approaches propose novel application-specificaccuracy metrics (ILINE, KS2017).

There are 33 approaches that compare with prior ap-proaches. All of them compare accuracy and six also com-pare runtime. The top target for comparison is BINDIFF(13 approaches compare with it), followed by TRACY (5),DISCOVRE (4), and MULTI-MH (4). Comparing accuracyacross binary code similarity approaches is challenging formultiple reasons. First, only a small fraction of the proposedapproaches have publicly released their code (Section VII).Since most approaches are not publicly available, comparisonis often performed by re-implementing previous approaches,which may require significant effort. One advantage of reim-plementation is that approaches can be compared on newdatasets. The alternative to re-implementation is to evaluatethe new approach on the same dataset used by a prior ap-proach, and compare with the reported results. This methodis only used by 6 approaches (GENIUS, BINGO, XMATCH,CACOMPARE, GEMINI, and BINARM) likely because mostdatasets are custom and not publicly available. Fortunately, weobserve that public code release has become more common inrecent approaches. Second, the input comparison and inputgranularity may differ among approaches making it nearly

impossible to perform a fair comparison. For instance, it is hardto compare in a fair manner an approach that identifies programsimilarity using callgraphs (e.g., SMIT) with an approachcomparing basic blocks (e.g., INNEREYE). Third, even whenthe input comparison and input granularity are the same, theevaluation metrics and methodology may differ, significantlyimpacting the measured accuracy.

The latter challenge is best illustrated on binary code searchapproaches operating at function granularity. These approachesfind the most similar functions in a repository to a givenfunction. They return multiple entries ranked in descendingsimilarity order and count a true positive if one of the top-k most similar entries is a true match. Unfortunately, thevalues of k vary across approaches and significantly impactthe accuracy, e.g., a 98% precision on top 10 is significantlyworse than a 98% precision on top 3. Thus, it becomes hardto compare accuracy numbers obtained with different k valuesand it becomes tempting to raise k until a sufficiently highaccuracy number is achieved. Furthermore, many approachesdo not describe the similarity threshold used to determine thatno similar entry exists in the repository. This means that theyalways find some similar entry in the repository, even if thesimilarity may be really low.

Performance. It is common (49/61) to measure the runtimeperformance of an approach. Runtime is typically measuredend-to-end, but a few approaches report it for each approachcomponent (e.g., BINSIM). Four approaches report theirasymptotic complexity (BMM2006, SWPQS2006, ILINE,MOCKINGBIRD).

IX. DISCUSSION

This section discusses open challenges and possible futureresearch directions.

Small pieces of binary code. Many binary code similarityapproaches ignore small pieces of binary code, setting a thresh-old on the minimum number of instructions or basic blocks tobe considered. Oftentimes, only pieces of binary code with ahandful of instructions are ignored, e.g., functions with lessthan 5 instructions, but some approaches use large thresholdslike 100 basic blocks in TRACY. Small pieces of binary codeare challenging because they are common, may comprise ofa single basic block that prevents structural analysis, and mayhave identical syntax despite different semantics. For example,setter functions that update the value of a field in an object havenearly identical syntax and structure, simply setting a memoryvariable with the value of a parameter. But, they may have verydifferent semantics, e.g., setting a security level or updating aperformance counter, Furthermore, semantic similarity tech-niques like instruction classification, symbolic formulas, andinput-output pairs may fail to capture their differences, e.g., ifthey do not distinguish different memory variables. Similarityof small pieces of binary code remains an open challenge. Onepotential avenue would be to further incorporate context. Somestructural approaches already consider the callgraph to matchfunctions (e.g., F2004), but this does not always suffice. Webelieve that it may be possible to further incorporate othercontext like locality (e.g., how close the binary code is in theprogram structure) or data references (e.g., whether they useequivalent variables).

Page 18: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

Source-to-binary similarity. Some applications like plagia-rism detection may require source code to be compared withbinary code. Early approaches for source-to-binary similarityused software birthmarks that capture inherent functionalityof the source code [92], [93]. Recently, source-to-binarysimilarity has been applied for searching if a known bugin open source code exists in some target binary code [44].The availability of source code provides more semantics (e.g.,variable and function names, types, comments) potentiallyimproving accuracy compared to simply compiling the sourcecode and performing binary code similarity. We believe otherapplications remain that require determining if a target pieceof binary code has been compiled, or has evolved, from somegiven source code. For example, there may be programs forwhich source code is only available for an old version andthere is a need to understand how newer binary versions haveevolved from the original source code.

Data similarity. This survey has focused on binary codesimilarity, but programs comprise both code and data. Theremay be situations where the data used by the code is asimportant as the code, e.g., when the only change betweenversions of a program is in the data used such as changingparameters for a machine learning classifier. Furthermore, datamay be stored in complex data structures that may be key tothe functionality. There exists a long history of type inferencetechniques on binary code [35], which we believe could becombined with binary code similarity to compare the datastructures used by different pieces of binary code, or how twopieces of binary code use the same data structure.

Semantic relationships. A different aspect of semantic simi-larity is to identify binary code with related functionality, e.g.,cryptographic or networking functions. This is challenging ascode that is related in its functionality may not have the samesemantics. For example, a decryption function is clearly relatedto its encryption function, but performs opposite operations. Sofar, most work has focused on domain specific techniques suchas those for identifying cryptographic functions (e.g., [115],[116]). But, recently some approaches have started exploringdomain-agnostic techniques [117]. We believe further work isneeded to better define such semantic relationships and theiridentification.

Scalability. The largest binary code similarity evaluation sofar is by GENIUS and GEMINI on 420M functions from 8,126firmware images (500K functions per image). While that is asignificant step from prior approaches, if we consider instead100K unique firmware, a conservative number since it isexpected that there will be 20 Billion IoT devices connected tothe Internet by 2020 [118], we need a binary code similarityapproach that can handle 50 Billion functions. Thus, furtherimprovements on scalability will be needed to realize thevision of binary code search engines.

Obfuscation. Many challenges still remain for binary codesimilarity on obfuscated code. For example, a recent studyhas shown that state of the art unpacking techniques canmiss 20%–60% of the original code, e.g., due to incom-plete function detection [89]. And, no binary code similarityapproach currently handles virtualization-based packers suchas Themida [119] and VMProtect [120], which generate arandom bytecode instruction set, transform an input piece of

binary code into that bytecode instruction set, and attach aninterpreter (or virtual machine) from the generated bytecode tonative code. Furthermore, obfuscation is best addressed withsemantic similarity, which has a challenge with obfuscationtransformations that do not respect the original semantics ofthe code, but still perform its main goals.

Approach comparison. The variety of datasets and method-ologies used to evaluate the approaches (e.g., top k evaluationfor OM approaches), together with the absence of sourcecode for many of them, makes it hard to perform a faircomparison to understand their benefits and limitations. Webelieve the field would greatly benefit from open datasets forbenchmarking, and independent validation under the same ex-perimental conditions. Furthermore, there is a need to improvethe comparison of approaches that handle obfuscation, beyondchanges of compiler and compiler options. Building a datasetthat covers real-world obfuscations is fundamental given thehuge space of possible obfuscation transformations and thateach approach supports a different subset.

X. CONCLUSION

During the last 20 years, researchers have proposed manyapproaches to perform binary code similarity and have appliedthem to address important problems such as patch analysis,bug search, and malware detection and analysis. The fieldhas evolved from binary code diffing to binary code search;from syntactic similarity to incorporate structural and semanticsimilarity; to cover multiple code granularities, to strengthenthe robustness of the comparison (i.e., cross-optimization,cross-compiler, cross-OS, cross-architecture, obfuscation); toscale the number of inputs compared (e.g., through hashing,embedding, indexing); and to automatically learn similarityfeatures.

Despite its popularity, the area had not yet been system-atically analyzed. This paper has presented a first survey ofbinary code similarity. The core of the paper has systematized61 binary code similarity approaches on four dimensions: theapplications they enable, their approach characteristics, howthe approaches are implemented, and the benchmarks and eval-uation methodologies used to evaluate them. It has discussedthe advantages and limitations of different approaches, theirimplementation, and their evaluation. It has summarized theresults into easy to access tables useful for both beginners andexperts. Furthermore, it has discussed the evolution of the areaand outlined the challenges that remain open, showing that thearea has a bright future.

Acknowledgments. This research was partially supported bythe Regional Government of Madrid through the BLOQUES-CM (S2018/TCS-4339) grant; and by the Spanish Governmentthrough the SCUM (RTI2018-102043-B-I00) grant. This re-search is also supported by the EIE Funds of the EuropeanUnion. All opinions, findings and conclusions, or recommen-dations expressed herein are those of the authors and do notnecessarily reflect the views of the sponsors.

REFERENCES

[1] Y. David and E. Yahav, “Tracelet-based Code Search in Executables,”in Conference on Programming Language Design and Implementation.ACM, 2014.

Page 19: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

[2] J. Pewny, F. Schuster, L. Bernhard, T. Holz, and C. Rossow, “Lever-aging Semantic Signatures for Bug Search in Binary Programs,” inAnnual Computer Security Applications Conference. ACM, 2014.

[3] J. Pewny, B. Garmany, R. Gawlik, C. Rossow, and T. Holz, “Cross-architecture bug search in binary executables,” in IEEE Symposium onSecurity and Privacy. IEEE Computer Society, 2015.

[4] S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla, “discovRE: Ef-ficient Cross-Architecture Identification of Bugs in Binary Code,” inNetwork and Distributed System Security Symposium, 2016.

[5] Y. David, N. Partush, and E. Yahav, “Statistical Similarity of Binaries,”in Conference on Programming Language Design and Implementation.ACM, 2016.

[6] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, and H. Yin, “ScalableGraph-based Bug Search for Firmware Images,” in ACM Conferenceon Computer and Communications Security. ACM, 2016.

[7] M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan,“BinGo: Cross-architecture cross-OS Binary Search,” in InternationalSymposium on Foundations of Software Engineering. ACM, 2016.

[8] H. Huang, A. M. Youssef, and M. Debbabi, “BinSequence: Fast, Ac-curate and Scalable Binary Code Reuse Detection,” in Asia Conferenceon Computer and Communications Security. ACM, 2017.

[9] Q. Feng, M. Wang, M. Zhang, R. Zhou, A. Henderson, and H. Yin,“Extracting Conditional Formulas for Cross-Platform Bug Search,” inAsia Conference on Computer and Communications Security. ACM,2017.

[10] Y. David, N. Partush, and E. Yahav, “Similarity of Binaries ThroughRe-optimization,” in Conference on Programming Language Designand Implementation. ACM, 2017.

[11] J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “VulSeeker: A SemanticLearning Based Vulnerability Seeker for Cross-platform Binary,” inConference on Automated Software Engineering, ser. ASE 2018, 2018.

[12] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “NeuralNetwork-based Graph Embedding for Cross-Platform Binary CodeSimilarity Detection,” in ACM Conference on Computer and Com-munication Security. ACM, 2017.

[13] Y. David, N. Partush, and E. Yahav, “FirmUp: Precise Static Detectionof Common Vulnerabilities in Firmware,” in Conference on Architec-tural Support for Programming Languages and Operating Systems.ACM, 2018.

[14] P. Shirani, L. Collard, B. L. Agba, B. Lebel, M. Debbabi, L. Wang,and A. Hanna, “BINARM: Scalable and Efficient Detection of Vul-nerabilities in Firmware Images of Intelligent Electronic Devices,” inConference on Detection of Intrusions and Malware, and VulnerabilityAssessment, 2018.

[15] B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, and W. Zou,“αDiff: Cross-version Binary Code Similarity Detection with DNN,”in Conference on Automated Software Engineering. ACM, 2018.

[16] X. Hu, T.-c. Chiueh, and K. G. Shin, “Large-scale Malware IndexingUsing Function-call Graphs,” in ACM Conference on Computer andCommunications Security. ACM, 2009.

[17] X. Hu, K. G. Shin, S. Bhatkar, and K. Griffin, “MutantX-S: ScalableMalware Clustering Based on Static Features,” in USENIX AnnualTechnical Conference, 2013.

[18] T. Kim, Y. R. Lee, B. Kang, and E. G. Im, “Binary executablefile similarity calculation using function matching,” The Journal ofSupercomputing, Dec 2016.

[19] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, “Polymor-phic Worm Detection Using Structural Information of Executables,”in Symopsium on Recent Advance in Intrusion Detection. SpringerBerlin Heidelberg, 2005.

[20] D. Bruschi, L. Martignoni, and M. Monga, “Detecting Self-mutatingMalware Using Control-flow Graph Matching,” in Detection of Intru-sions and Malware and Vulnerability Assessment. Springer-Verlag,2006.

[21] S. Cesare, Y. Xiang, and W. Zhou, “Control Flow-Based MalwareVariant Detection,” IEEE Transactions on Dependable and SecureComputing, vol. 11, no. 4, pp. 307–317, 2014.

[22] M. Lindorfer, A. Di Federico, F. Maggi, P. M. Comparetti, andS. Zanero, “Lines of Malicious Code: Insights into the Malicious Soft-

ware Industry,” in Annual Computer Security Applications Conference.ACM, 2012.

[23] J. Jang, M. Woo, and D. Brumley, “Towards Automatic SoftwareLineage Inference,” in 22nd USENIX Security Symposium, 2013.

[24] J. Ming, D. Xu, and D. Wu, “Memoized semantics-based binary diffingwith application to malware lineage inference,” in IFIP InternationalInformation Security Conference. Springer, 2015.

[25] B. S. Baker, U. Manber, and R. Muth, “Compressing Differences ofExecutable Code,” in ACMSIGPLAN Workshop on Compiler Supportfor System Software, 1999.

[26] H. Flake, “Structural comparison of executable objects,” in Detectionof Intrusions and Malware & Vulnerability Assessment, 2004.

[27] T. Dullien and R. Rolles, “Graph-based Comparison of ExecutableObjects,” in Symposium Sur La Securite Des Technologies DeL’Information Et Des Communications, 2005.

[28] D. Gao, M. K. Reiter, and D. Song, BinHunt: Automatically FindingSemantic Differences in Binary Programs. Springer Berlin Heidel-berg, 2008.

[29] Y. Hu, Y. Zhang, J. Li, and D. Gu, “Cross-architecture binary seman-tics understanding via similar code comparison,” in Conference onSoftware Analysis, Evolution, and Reengineering. IEEE, 2016.

[30] Z. Xu, B. Chen, M. Chandramohan, Y. Liu, and F. Song, “SPAIN:Security Patch Analysis for Binaries Towards Understanding the Painand Pills,” in International Conference on Software Engineering.IEEE Press, 2017.

[31] U. Kargen and N. Shahmehri, “Towards Robust Instruction-level TraceAlignment of Binary Code,” in International Conference on AutomatedSoftware Engineering. IEEE Press, 2017.

[32] Z. Wang, K. Pierce, and S. McFarling, “BMAT - A Binary Match-ing Tool,” in In Second ACM Workshop on Feedback-Directed andDynamic Optimization, 1999.

[33] L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-basedObfuscation-resilient Binary Code Similarity Comparison with Appli-cations to Software Plagiarism Detection,” in International Symposiumon Foundations of Software Engineering. ACM, 2014.

[34] K. A. Roundy and B. P. Miller, “Binary-Code Obfuscations in Preva-lent Packer Tools,” ACM Computing Surveys (CSUR), vol. 46, no. 1,p. 4, 2013.

[35] J. Caballero and Z. Lin, “Type Inference on Executables,” ACMComputing Surveys, vol. 48, no. 4, pp. 1–35, May 2016.

[36] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A Survey onAutomated Dynamic Malware-Analysis Techniques and Tools,” ACMcomputing surveys (CSUR), vol. 44, no. 2, p. 6, 2012.

[37] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for Similarity Search:A Survey,” Tech. Rep., 2014.

[38] S.-H. Cha, “Comprehensive Survey on Distance/Similarity Measuresbetween Probability Density Functions,” vol. 1, 01 2007.

[39] S. seok Choi and S. hyuk Cha, “A survey of Binary similarity anddistance measures,” Journal of Systemics, Cybernetics and Informatics,pp. 43–48, 2010.

[40] C. Collberg, S. Martin, J. Myers, and J. Nagra, “Distributed Appli-cation Tamper Detection via Continuous Software Updates,” in 28thAnnual Computer Security Applications Conference, 2012.

[41] X. Ugarte-Pedrero, D. Balzarotti, I. Santos, and P. G. Bringas, “SOK:Deep packer inspection: A longitudinal study of the complexity of run-time packers,” in IEEE Symposium on Security and Privacy, 2015.

[42] A. Lakhotia, M. D. Preda, and R. Giacobazzi, “Fast Location of Sim-ilar Code Fragments Using Semantic ’Juice’,” in Program Protectionand Reverse Engineering Workshop. ACM, 2013.

[43] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a multilinguistictoken-based code clone detection system for large scale source code,”IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–670, 2002.

[44] H. Zhang and Z. Qian, “Precise and Accurate Patch Presence Testfor Binaries,” in 27th USENIX Security Symposium, Baltimore, MD,2018.

[45] J. Crussell, C. Gibler, and H. Chen, “AnDarwin: Scalable Detection ofSemantically Similar Android Applications,” in European Symposiumon Research in Computer Security. Springer Berlin Heidelberg, 2013.

Page 20: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

[46] K. Chen, P. Liu, and Y. Zhang, “Achieving accuracy and scalabilitysimultaneously in detecting application clones on android markets,”in Proceedings of the 36th International Conference on SoftwareEngineering. ACM, 2014.

[47] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “Asense of self for unix processes,” in IEEE Symposium on Securityand Privacy. IEEE, 1996.

[48] E. Kirda, C. Kruegel, G. Banks, G. Vigna, and R. Kemmerer,“Behavior-based spyware detection.” in USENIX Security Symposium,2006.

[49] M. Christodorescu, S. Jha, and C. Kruegel, “Mining Specifications ofMalicious Behavior,” in 6th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium on TheFoundations of Software Engineering. ACM, 2007.

[50] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda,“Scalable, Behavior-Based Malware Clustering,” in Network and Dis-tributed System Security Symposium, 2009.

[51] G. Wicherski, “peHash: A Novel Approach to Fast Malware Cluster-ing,” in Conference on Large-scale Exploits and Emergent Threats:Botnets, Spyware, Worms, and More. USENIX Association, 2009.

[52] J. Kornblum, “Identifying Almost Identical Files Using Context Trig-gered Piecewise Hashing,” Digital Investigation, vol. 3, pp. 91–97,Sep. 2006.

[53] J. Oliver, C. Cheng, and Y. Chen, “TLSH – A Locality SensitiveHash,” in Cybercrime and Trustworthy Computing Workshop. IEEEComputer Society, 2013.

[54] J. Newsome, B. Karp, and D. Song, “Polygraph: Automatically gen-erating signatures for polymorphic worms,” in IEEE Symposium onSecurity and Privacy, 2005.

[55] Z. Li, M. Sanghi, Y. Chen, M.-Y. Kao, and B. Chavez, “Hamsa: Fastsignature generation for zero-day polymorphic worms with provableattack resilience,” in IEEE Symposium on Security and Privacy, 2006.

[56] A. Sæbjørnsen, J. Willcock, T. Panas, D. Quinlan, and Z. Su, “Detect-ing Code Clones in Binary Executables,” in Symposium on SoftwareTesting and Analysis. ACM, 2009.

[57] I. Santos, F. Brezo, J. Nieves, Y. K. Penya, B. Sanz, C. Laorden, andP. G. Bringas, “Idea: Opcode-Sequence-Based Malware Detection,” inSymposium on Engineering Secure Software and Systems. SpringerBerlin Heidelberg, 2010.

[58] B. Kang, T. Kim, H. Kwon, Y. Choi, and E. G. Im, “MalwareClassification Method via Binary Content Comparison,” in Researchin Applied Computation Symposium. ACM, 2012.

[59] J. Ming, M. Pan, and D. Gao, “iBinHunt: Binary Hunting with Inter-procedural Control Flow,” in International Conference on InformationSecurity and Cryptology. Springer-Verlag, 2013.

[60] W. Jin, S. Chaki, C. Cohen, A. Gurfinkel, J. Havrilla, C. Hines, andP. Narasimhan, “Binary function clustering using semantic hashes,”in International Conference on Machine Learning and Applications.IEEE, 2012.

[61] M. Bourquin, A. King, and E. Robbins, “BinSlayer: Accurate Com-parison of Binary Executables,” in Program Protection and ReverseEngineering Workshop. ACM, 2013.

[62] W. M. Khoo, A. Mycroft, and R. Anderson, “Rendezvous: a SearchEngine for Binary Code,” in Working Conference on Mining SoftwareRepositories, 2013.

[63] B. H. Ng and A. Prakash, “Expose: Discovering potential binary codere-use,” in Annual Computer Software and Applications Conference.IEEE, 2013.

[64] Y. R. Lee, B. Kang, and E. G. Im, “Function Matching-based Binary-level Software Similarity Calculation,” in Research in Adaptive andConvergent Systems. ACM, 2013.

[65] M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi,“BinClone: Detecting Code Clones in Malware,” in InternationalConference on Software Security and Reliability. IEEE ComputerSociety, 2014.

[66] B. Ruttenberg, C. Miles, L. Kellogg, V. Notani, M. Howard,C. LeDoux, A. Lakhotia, and A. Pfeffer, “Identifying Shared SoftwareComponents to Support Malware Forensics,” in Detection of Intrusionsand Malware, and Vulnerability Assessment. Springer InternationalPublishing, 2014.

[67] M. Egele, M. Woo, P. Chapman, and D. Brumley, “Blanket Execution:Dynamic Similarity Testing for Program Binaries and Components,”in 23rd USENIX Security Symposium, 2014.

[68] L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-basedobfuscation-resilient binary code similarity comparison with applica-tions to software and algorithm plagiarism detection,” IEEE Transac-tions on Software Engineering, no. 12, pp. 1157–1177, 2017.

[69] S. Alrabaee, P. Shirani, L. Wang, and M. Debbabi, “Sigma: A semanticintegrated graph matching approach for identifying reused functionsin binary code,” 2015.

[70] J. Qiu, X. Su, and P. Ma, “Library functions identification in binarycode by using graph isomorphism testings,” in Conference on SoftwareAnalysis, Evolution and Reengineering, 2015.

[71] Y. Xiao, S. Cao, Z. Cao, F. Wang, F. Lin, J. Wu, and H. Bi,“Matching Similar Functions in Different Versions of a Malware,” inTrustcom/BigDataSE/ISPA, 2016 IEEE. IEEE, 2016.

[72] N. Lageman, E. D. Kilmer, R. J. Walls, and P. D. McDaniel, “BinDNN:Resilient Function Matching Using Deep Learning,” in InternationalConference on Security and Privacy in Communication Systems.Springer, 2016.

[73] S. H. Ding, B. C. Fung, and P. Charland, “Kam1N0: MapReduce-basedAssembly Clone Search for Reverse Engineering,” in InternationalConference on Knowledge Discovery and Data Mining. ACM, 2016.

[74] Y. Hu, Y. Zhang, J. Li, and D. Gu, “Binary Code Clone DetectionAcross Architectures and Compiling Configurations,” in InternationalConference on Program Comprehension. IEEE Press, 2017.

[75] L. Nouh, A. Rahimian, D. Mouheb, M. Debbabi, and A. Hanna, Bin-Sign: Fingerprinting Binary Functions to Support Automated Analysisof Code Executables. Springer International Publishing, 2017.

[76] P. Shirani, L. Wang, and M. Debbabi, “BinShape: Scalable and RobustBinary Library Function Identification Using Function Shape,” inDetection of Intrusions and Malware, and Vulnerability Assessment.Springer International Publishing, 2017.

[77] J. Ming, D. Xu, Y. Jiang, and D. Wu, “BinSim: Trace-based SemanticBinary Diffing via System Call Sliced Segment Equivalence Check-ing,” in 26th USENIX Security Symposium, 2017.

[78] S. Wang and D. Wu, “In-memory Fuzzing for Binary Code SimilarityAnalysis,” in International Conference on Automated Software Engi-neering. IEEE Press, 2017.

[79] S. Alrabaee, P. Shirani, L. Wang, and M. Debbabi, “FOSSIL: AResilient and Efficient System for Identifying FOSS Functions in Mal-ware Binaries,” IEEE Transactions on Privacy and Security, vol. 21,no. 2, pp. 8:1–8:34, Jan. 2018.

[80] K. Redmond, L. Luo, and Q. Zeng, “A Cross-Architecture InstructionEmbedding Model for Natural Language Processing-Inspired BinaryCode Analysis,” in Workshop on Binary Analysis Research, San Diego,CA, USA, 2019.

[81] F. Zuo, X. Li, Z. Zhang, P. Young, L. Luo, and Q. Zeng, “NeuralMachine Translation Inspired Binary Code Similarity Comparisonbeyond Function Pairs,” in Network and Distributed System SecuritySymposium, 2019.

[82] S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2vec: Boostingstatic representation robustness for binary clone search against codeobfuscation and compiler optimization,” in IEEE Symposium on Se-curity and Privacy. San Francisco, CA: IEEE Computer Society,2019.

[83] L. Massarelli, G. A. Di Luna, F. Petroni, R. Baldoni, and L. Querzoni,“SAFE: Self-Attentive Function Embeddings for Binary Similarity,” inDetection of Intrusions and Malware, and Vulnerability Assessment,2019.

[84] “BinDiff,” 2018, https://www.zynamics.com/bindiff.html.[85] “Diaphora,” 2018, http://diaphora.re/.[86] “DarunGrim,” 2018, http://www.darungrim.org/.[87] R. Perdisci, W. Lee, and N. Feamster, “Behavioral Clustering of

HTTP-Based Malware and Signature Generation Using Malicious Net-work Traces,” in USENIX Symposium on Networked Systems Designand Implementation, 2010.

[88] M. Z. Rafique and J. Caballero, “FIRMA: Malware Clustering andNetwork Signature Generation with Mixed Network Behaviors,” in

Page 21: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

Symposium on Research in Attacks, Intrusions and Defenses, St. Lucia,October 2013.

[89] I. U. Haq, S. Chica, J. Caballero, and S. Jha, “Malware Lineage in theWild,” Elsevier Computer & Security, vol. 78, pp. 347–363, 2018.

[90] Brumley, David and Poosankam, Pongsin and Song, Dawn and Zheng,Jiang, “Automatic patch-based exploit generation is possible: Tech-niques and implications,” in IEEE Symposium on Security and Privacy,2008.

[91] Z. Wang, K. Pierce, and S. McFarling, “BMAT – A Binary MatchingTool for Stale Profile Propagation,” The Journal of Instruction-LevelParallelism, vol. 2, p. 2000, 2000.

[92] G. Myles and C. Collberg, “K-gram Based Software Birthmarks,” inSymposium on Applied Computing. ACM, 2005.

[93] S. Choi, H. Park, H.-I. Lim, and T. Han, “A Static Birthmark of BinaryExecutables Based on API Call Structure,” in Asian Computing ScienceConference on Advances in Computer Science: Computer and NetworkSecurity. Springer-Verlag, 2007.

[94] M. J. Rochkind, “The Source Code Control System,” IEEE Transac-tions on Software Engineering, no. 4, pp. 364–370, 1975.

[95] W. F. Tichy, “RCS - A System for Version Control,” Software: Practiceand Experience, vol. 15, no. 7, pp. 637–654, 1985.

[96] C. Reichenberger, “Delta Storage for Arbitrary Non-Text Files,” inProceedings of the 3rd ACM International Workshop on SoftwareConfiguration Management, 1991.

[97] “RTPatch,” 1991, http://pocketsoft.com/rtpatch binary diff.html.[98] K. Coppieters, “A Cross-platform Binary Diff,” 1995.[99] “Xdelta3,” 1998, http://xdelta.org/.

[100] J. Jang, A. Agrawal, and D. Brumley, “ReDeBug: Finding UnpatchedCode Clones in Entire OS Distributions,” in IEEE Symposium onSecurity and Privacy, 2012.

[101] V. A. Zakharov, “The Equivalence Problem for Computational Models:Decidable and Undecidable Cases,” in International Conference onMachines, Computations, and Universality, 2001.

[102] L. Jiang and Z. Su, “Automatic Mining of Functionally EquivalentCode Fragments via Random Testing,” in Symposium on SoftwareTesting and Analysis. ACM, 2009.

[103] V. Ganesh and D. L. Dill, “A Decision Procedure for Bit-vectors andArrays,” in Conference on Computer Aided Verification. Springer-Verlag, 2007.

[104] L. De Moura and N. Bjørner, “Z3: An Efficient SMT Solver,” inConference on Tools and Algorithms for the Construction and Analysisof Systems. Springer-Verlag, 2008.

[105] C. B. Zilles, “Master/slave Speculative Parallelization And Approxi-mate Code,” Univerisity of Wisconsin, USA, Tech. Rep., 2002.

[106] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and SystemSciences, vol. 60, no. 3, pp. 630–659, 2000.

[107] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in highdimensions via hashing,” in International Conference on Very LargeData Bases (VLDB), 1999.

[108] “Virustotal,” https://virustotal.com/.[109] V. Roussev, “Data fingerprinting with similarity digests,” in IFIP

International Conference on Digital Forensics. Springer, 2010.[110] F. Breitinger and H. Baier, “Similarity preserving hashing: Eligible

properties and a new algorithm mrsh-v2,” in Conference on DigitalForensics and Cyber Crime. Springer, 2012.

[111] F. Pagani, M. Dell’Amico, and D. Balzarotti, “Beyond Precision andRecall: Understanding Uses (and Misuses) of Similarity Hashes inBinary Analysis,” in Conference on Data and Application Securityand Privacy. ACM, March 2018.

[112] C. Linn and S. Debray, “Obfuscation of Executable Code to ImproveResistance to Static Disassembly,” in ACM Conference on Computerand Communications Security. ACM, 2003.

[113] “ReadyNAS2014 Firmware Image v6.1.6,” 2019, http://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNASOS-6.1.6-arm.zip.

[114] “DD-WRT2013 Firmware Image r21676,” 2019, ftp://ftp.dd-wrt.com/betas/2013/05-27-2013-r21676/senao-eoc5610/linux.bin.

[115] Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace, “ReFormat:Automatic Reverse Engineering of Encrypted Messages,” in EuropeanSymposium on Research in Computer Security, 2009.

[116] D. Xu, J. Ming, and D. Wu, “Cryptographic Function Detection inObfuscated Binaries via Bit-Precise Symbolic Loop Mapping,” inIEEE Symposium on Security and Privacy, 2017.

[117] V. Karande, S. Chandra, Z. Lin, J. Caballero, L. Khan, and K. Hamlen,“BCD: Decomposing Binary Code Into Components Using Graph-Based Clustering,” in ACM ASIA Conference on Information, Com-puter and Communications Security, 2018.

[118] “Gartner says 8.4 billion connected things will be in use in 2017, up31 percent from 2016,” 2017, https://www.gartner.com/newsroom/id/3598917.

[119] “Themida,” 2018, https://www.oreans.com/themida.php/.[120] “VMProtect,” 2018, http://vmpsoft.com/products/vmprotect/.[121] “Angr: The Binary Analysis Framework,” 2016, http://angr.io/.[122] D. Brumley, I. Jager, T. Avgerinos, and E. J. Schwartz, “BAP: A binary

analysis platform,” in Conference on Computer Aided Verification,2011.

[123] “BeaEngine,” 2018, http://www.beaengine.org/.[124] “Bitblaze: Binary Analysis For Computer Security,” 2008, http://

bitblaze.cs.berkeley.edu/.[125] “Boomerang decompiler,” 2004, http://boomerang.sourceforge.net/.[126] “Automated Malware Analysis,” 2012, https://cuckoosandbox.org/.[127] “Dyninst: Putting the Performance in High Performance Computing,”

2009, http://www.dyninst.org/.[128] “IDA,” 2018, https://www.hex-rays.com/products/ida/.[129] “The LLVM Compiler Infrastructure,” 2004, http://llvm.org/.[130] “IDA Processor Support,” 2019, http://www.fysnet.net/newbasic.htm.[131] “McSema,” 2012, https://github.com/trailofbits/mcsema/.[132] “Pin - A Dynamic Binary Instrumentation

Tool,” 2005, https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.

[133] “QEMU: an open source processor emulator,” 2006, http://www.qemu.org/.

[134] “ROSE compiler infrastructure,” 2000, http://www.rosecompiler.org/.[135] “Valgrind,” 2007, http://valgrind.org/.[136] A. Srivastava, A. Edwards, and H. Vo, “Vulcan: Binary Transformation

In A Distributed Environment,” Tech. Rep., 2001.

APPENDIX

TABLE V. COMPARISON OF THE BINARY ANALYSIS PLATFORMS USEDIN THE IMPLEMENTATION OF BINARY CODE SIMILARITY APPROACHES.

THIS TABLE OVERLAPS WITH PREVIOUS WORK ON TYPEINFERENCE [35]. WE HAVE ADDED FOLLOWING PLATFORMS: ANGR,

BEAENGINE, CUCKOO, NEWBASIC, MCSEMA, VULCAN.

Target Arch. & OS ReleaseCISC RISC OS

Platform IR Stat

ican

alys

is

Dyn

amic

anal

ysis

x86

x86-

64

AR

M

MIP

S

SPA

RC

Pow

erPC

Win

dow

s/

PE

Lin

ux/

EL

F

Ope

nso

urce

Free

bina

ry

Com

mer

cial

Angr [121] VEX X 7 X 7 X X 7 7 X X X 7 7BAP [122] BIL X X X X X 7 7 7 X X X 7 7

Beaengine [123] 7 X 7 X 7 7 7 7 7 X 7 X 7 7BitBlaze [124] VINE X X X 7 7 7 7 7 X X X 7 7

Boomerang [125] RTL X 7 X 7 7 X X X X X X 7 7Cuckoo [126] 7 7 X X X X 7 7 7 X X X 7 7Dyninst [127] 7 X X X X 7 7 7 7 7 X X 7 7

IDA [128] IDA X 7 X X X X X X X X 7 7 XLLVM [129] LLVM-IR X 7 7 7 7 7 7 7 7 7 X 7 7

NewBasic [130] 7 X 7 X 7 7 7 7 7 7 7 X 7 7McSema [131] LLVM-IR X 7 X X X 7 7 7 X X X 7 7

PIN [132] 7 7 X X X 7 7 7 7 X X 7 X 7QEMU [133] TCG 7 X X X X X X X X X X 7 7ROSE [134] SAGE-III X 7 X 7 X X X X X X X 7 7

Valgrind [135] VEX 7 X X X X X X X X X X 7 7Vulcan [136] VULCAN-IR X 7 X X 7 7 7 7 X 7 7 7 7

Implementation platforms. Table V shows the intermediaterepresentation (IR) used by the platform, whether it supports

Page 22: A Survey of Binary Code SimilarityA Survey of Binary Code Similarity Irfan Ul Haq IMDEA Software Institute & Universidad Polit´ecnica de Madrid irfanul.haq@imdea.org Juan Caballero

static and dynamic analysis, the target architectures and oper-ating systems it supports, and how it is released (open sourceor free binary). Among the 17 platforms, 12 support staticanalysis and 7 dynamic analysis. The functionality provided bystatic analysis platforms widely varies. IDA is a disassembler,Boomerang is a decompiler, and the rest offer diverse staticanalysis functionality such as disassembly, building controlflow graphs and call graphs, IR simplifications, and dataflow propagation. All dynamic analysis platforms can rununmodified binaries (i.e., no need to recompile or relink).QEMU is a whole system emulator that can run a full guestOS (e.g., Windows) on a potentially different host OS (e.g.,Linux). Dyninst, PIN, and Valgrind execute an unmodified tar-get program with customizable instrumentation (e.g., throughbinary rewriting) on the CPU of the host system. BAP andBITBLAZE build their dynamic analysis components on topof PIN and QEMU.