experimental,study,of,fuzzy,hashing,in, malware,clustering ... · •...

21
Experimental Study of Fuzzy Hashing in Malware Clustering Analysis Yuping Li 1 , Sathya Chandran Sundaramurthy 1 Alexandru G. Bardas 1 , Xinming Ou 1 Doina Caragea 1 , Xin Hu 2 , Jiyong Jang 2 1) Kansas State University 2) IBM Research 8/10/15 CSET15 1

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Experimental  Study  of  Fuzzy  Hashing  in  Malware  Clustering  Analysis  

Yuping  Li1,    Sathya  Chandran  Sundaramurthy1  

Alexandru  G.  Bardas1,  Xinming  Ou1  

Doina  Caragea1,  Xin  Hu2,  Jiyong  Jang2  1)  Kansas  State  University  2)  IBM  Research    

 

 8/10/15   CSET15   1  

Page 2: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

MoNvaNon  

•  Huge  volume  of  malware  samples:  

8/10/15   CSET15  

Source:  Malware  submission  staNsNcs  from  VirusTotal  (Jul  19,  2015  –  Jul  24,  2015)  

2  

Page 3: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Why  to  use  fuzzy  hashing?  

                                   •  Use  fuzzy  hashing  to  idenNfy  the  new  specie  -­‐  “cat”  for  prioriNzed  analysis  

8/10/15   CSET15   3  

Page 4: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Cryptographic  Hash  Overview  

         Input                          Hash  value  (digest)  

8/10/15   CSET15  

“The  quick  brown  fox  jumps  over  a  lazy  dog”  

“The  quick  brown  fox  jumps  over  A  lazy  dog”  

f8d517f0d3b6e442  56aaeb2a265811e3  

30ded807d65ee037  0fc6d73d6ab55a95  

Hash  genera?on  func?on  (e.g.  md5)  

Comparison  

Different  (or  0  similarity)  

Hash  genera?on  func?on  (e.g.  md5)  

4  

Page 5: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Fuzzy  Hash  Overview  

         Input                          Hash  value  (fingerprint)  

8/10/15   CSET15  

“The  quick  brown  fox  jumps  over  a  lazy  dog”  

“The  quick  brown  fox  jumps  over  A  lazy  dog”  

dfcaf6129c45f88a4  

dfcaf6129c45f8827  

Fuzzy  hash  fingerprint  genera?on  func?on  

Comparison  

Similarity:  95%  

Fuzzy  hash  fingerprint  genera?on  func?on  

Fuzzy  hash  fingerprint  comparison  func?on  

5  

Page 6: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Previous  Fuzzy  Hashing  ApplicaNons  

•  Using  fuzzy  hashing  techniques  to  idenNfy  malicious  code  (ShadowServer,  2007)  

•  VirusTotal  incorporated  SSDeep  hashes  into  their  data  set  (VirusTotal,  2012)  

•  Fuzzy  hashing  techniques  in  applied  malware  analysis  (CMU  CERT,  2011)  

8/10/15   CSET15   6  

Page 7: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Our  ContribuNons  

•  We  idenNfy  two  design  flaws  within  some  of  exisNng  fuzzy  hashing  algorithms.  

•  We  design  and  implement  a  generic  experimentaNon  framework  for  evaluaNng  the  performances  of  different  fuzzy  hash  funcNons.  

•  We  show  that  current  fuzzy  hashing  algorithms  can  be  further  improved  and  proposed  a  new  fuzzy  hash  funcNon.  

8/10/15   CSET15   7  

Page 8: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Design  Flaw  1:    Asymmetric  Distance  ComputaNon  

       Input          Hash  value  (fingerprint)  

8/10/15   CSET15  

Input  A  

Input  B   Fingerprint  B  

Fingerprint  A   Fuzzy  hash  fingerprint  comparison  funcNon  

Fuzzy  hash  fingerprint  generaNon  funcNon  

Comparison  

),(),( ABSimBASim ≠

8  

Page 9: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Design  Flaw  2:    IncompaNble  InterpretaNons  

•  Containment  analysis,  expected  opNmal  score:  1.0  

     •  ExplanaNon:  A  is  100%  contained  in  B  

8/10/15   CSET15  

•  Similarity  analysis,  expected  opNmal  score:  0.5  

•  ExplanaNon:  A  and  B  have  an  overall  similarity  of  50%  

a   b   c   d   e   f  Input  B:  

Input  A:   a   b   c  

a   b   c   d   e   f  Input  B:  

Input  A:   a   b   c  

9  

Page 10: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Generic  ExperimentaNon  Framework  For    Fuzzy  Hash  EvaluaNon

8/10/15   CSET15   10  

                   

 

Get  unpacked  samples  

MalwareSamples

For  each  malware,  apply

genHash()

For  each  malware  pair,  

applycompareHash()  

       

 

     

 

   

 

Malware  Families

For  each  threshold,  apply

hacluster()  

Time  complexity:  n Time  complexity:  n^2

Malware  Preprocess

Fingerprint  Generation

Distance  Computation

Clustering  Analysis

Page 11: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Malware  Data  PreparaNon  

•  Preparing  malware  dataset  that  are  reliable:          1)  Prepare  unpacked  samples    -­‐  ClassificaNon  of  packed  and  unpacked  samples  

         2)  Prepare  samples  with  accurate  family  name            -­‐  Majority  vote  of  VirusTotal  labels  

8/10/15   CSET15   11  

Page 12: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Dataset  StaNsNcs  (from  VirusShare)  Family   Size   Family   Size  

Viking   31   Vilsel   185  

Fesber   57   Jeefo   36  

Neshta   39   Turkojan   22  

Skintrim   41   Belersurf   300  

Ramnit   38   Koutodoor   30  

Zenosearch   99   Zbot   22  

Hupigon   28   Fosniw   22  

Domaiq   147   Wabot   27  

Xpaj   22   Total   1146  

8/10/15   CSET15   12  

Page 13: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Clustering  Accuracy  Measurement  

•  Precision  is  to  capture  how  well  the  clustering  algorithm  separates  samples  of  different  families  to  different  clusters  

•  Recall  is  to  capture  how  well  the  clustering  algorithm  assigns  samples  of  same  family  to  the  same  cluster  

•  Intersec?on  point  between  precision  and  recall  can  be  seen  as  good  balance  between  the  two  measurements  

8/10/15   CSET15   13  

Page 14: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Precision  &  Recall  of  SSDeep  

•  ModificaNon  of  SSDeep:  – Let  similarity  equals  0  if  two  SSDeep  fingerprints  can  not  be  meaningfully  compared  

8/10/15   CSET15   14  

(0.964,  0.797)  

Page 15: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Precision  &  Recall  of  sdHash  

•  ModificaNon  of  sdHash:  – Fix  the  asymmetric  distance  computaNon  problem  

8/10/15   CSET15   15  

(0.469,  0.807)  

Page 16: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Precision  &  Recall  of  mvHash-­‐B  

•  ModificaNon  of  mvHash-­‐B:  – Fix  the  asymmetric  distance  computaNon  problem  

8/10/15   CSET15   16  

(0.387,  0.792)  

Page 17: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Key  Elements  of  Fuzzy  Hashing  Algorithms  

•  Features  (characterisNcs  of  input):  – hash  values  of  substrings,  entropy  values  of  substrings,  ngrams  

•  Fingerprint  (representaNon  of  features):  – dynamic  length  strings,  mulNple  bit-­‐vectors,  single  bit-­‐vector    

•  Distance  funcNon  (comparison  of  fingerprints):  – edit  distance,  customized  distance,  jaccard  distance,  etc  

8/10/15   CSET15   17  

Page 18: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

New  Fuzzy  Hash  FuncNon:  nextGen-­‐hash  

•  Input:        2f  2a  0a  20  2a  20  43  6f  70  79  72  69  67  68  74  20  

•  4-­‐gram  Features:  2f2a0a20,  2a0a202a,  0a202a20,  202a2043,  …    

•  Fingerprint:        

•  Comparison:      

8/10/15   CSET15  

101010010100000101001010010101001010010010100010100  

similarity = bitcount( fa ∧ fb )bitcount( fa ∨ fb )

18  

Page 19: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Precision  &  Recall  of  nextGen-­‐hash  

•  Highlights  of  nextGen-­‐hash:  – Use  ngrams  as  features  – Use  bit-­‐vector  as  final  fingerprint  

8/10/15   CSET15   19  

(0.271,  0.791)  

Page 20: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Further  Improve  Fuzzy  Hashing  Algorithms  

Algorithm  Name  

Whole  Sample  as  Input  (f-­‐score)  

Code  Sec?on  as  Input  (f-­‐score)  

ssdeep   0.797   0.872  

sdHash   0.807   0.877  

mvHash-­‐B   0.792   0.893  

nextGen-­‐hash   0.791   0.919  

8/10/15   CSET15   20  

Page 21: Experimental,Study,of,Fuzzy,Hashing,in, Malware,Clustering ... · • We#idenNfy#two#design#flaws#within#some#of#exisNng#fuzzy# hashing#algorithms.# • We#design#and#implementageneric#experimentaon#

Summary  •  We  idenNfy  several  design  flaws  within  exisNng  fuzzy  hashing  algorithms  and  provide  algorithm  and  suggesNon  to  fix  the  problems  

•  We  design  an  evalua?on  framework  and  demonstrate  that  it  can  be  used  to  compare  the  effecNveness  of  different  fuzzy  hash  funcNons  

•  We  propose  a  new  fuzzy  hashing  algorithm  and  show  that  performances  of  exisNng  fuzzy  hash  funcNons  can  be  further  improved  

8/10/15   CSET15   21