基於機器學習的惡 意軟體分類實作 microsoft malware … › 2016 › pacific ›...
TRANSCRIPT
MachineLearning
Protect againsttomorrow’s
threats
基於機器學習的惡意軟體分類實作:MicrosoftMalwareClassificationChallenge經驗談Trend Microch0upimiaoskiKyle Chung2 Dec 2016
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsch0upi
• StaffengineerinTrendMicro• MachineLearning+DataAnalysis• Threatintelligenceservices• KDDCup 2014+KDDCup 2016:Top10• GoTrend:6th inUECCup2015
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsmiaoski
• SeniorthreatresearcherinTrendMicro• Threatintelligence• SmartCity• SDR• Arduino+RPi makers• Lovescats
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
4
Outline
• WhyMalwareClassification?• MachineLearning• MicrosoftChallenge• HowtoSolveit?• Conclusion
MachineLearning
Protect againsttomorrow’s
threats
MALWARECLASSIFICATION
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsWhat’sMalwareClassification?
• Identifymalwarefamily
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsWhyMalwareClassification?
• Knowhowtoclean• Possibleattribution• Setproperpriority
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsCurrentMalwareClassification
• Manuallygeneratedbyresearchers• Usesignaturetofingerprintmalware• YARArules
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsChallenge
• Manualprocessè wrongfamily• moreandmoremalwarefamilies
• Verylargevolume• daily1M+samples
• Increasingsignatures• Slowinscanning+needmorestorage
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsBSidesLV 2016:VirusShare
• JohnSeymour,LabelingtheVirusShare Corpus:LessonsLearned,BSidesLV 2016
• VirusShare Corpus:~20Mfiles
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threatsMachineLearningfor
• Automationofmalwarefamilyidentification• Saveresearcher’seffort
MachineLearning
Protect againsttomorrow’s
threats
MACHINE LEARNING
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
13
MachineLearningSteps
• PrepareData• GenerateFeature• TrainModel• MakePrediction• Evaluate
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
14
FruitClassification
• Apple
• Banana
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
15
FruitFeatures
• Color• Shape• Size• Weight
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
16
Learning
• Apple
• Banana
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
17
LearnedModelofFruit
• Apple• Color:Red• Shape:Round
• Banana• Color:Yellow• Shape:Long
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
18
NewFruitComing…
• Apple?Banana?
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
19
PredictionofFruits
• Fruit1• Color:Red=>Apple• Shape:Round=>Apple
• Fruit2• Color:Yellow=>Banana• Shape:Long=>Banana
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
20
EvaluationofFruit
• Accuracy:(9+9)/20=90%Apple Banana
Apple 9 1Banana 1 9Total 10 10
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
21
MachineLearningSteps
• PrepareData• GenerateFeature• TrainModel• MakePrediction• Evaluate
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
22
MachineLearningis...
• Mathematicalmethodsandalgorithms• Fromhistoricallabelleddata• Findaseparatinghyperplane• Applyitonfuturedata
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
23
Featureis...
• Measurablepropertyofaphenomenonbeingobserved• Usetodescribeentries
• Featurevector• inputofmachinelearningalgorithm
• Sourceoffeatures• dataexploring• domainknowledge
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
24
Modelis...
• Amathematicaldescriptionofhowtoclassifythedata
• Parameterstunedbycertainalgorithm• training
• Usedtomakeprediction
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
25
Prediction
• Identifytheclassofnewentities• Withtrainedmodelfromtrainingdata
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
26
Evaluation
• Reviewmodelresultbysomemeasurements• Crossvalidation• Evaluationfunctions
• Accuracy• logloss• AUC• precision,recall,F1
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
27
GlueLanguage
• GluethestepsofMachineLearningLearning• Batchrunningforlargeamountofdata• IntegrationwithHadoop,Spark• Richlibraries/Algorithmsupport• Easytodevelop/learn
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
28
scikit learn
• Opensourcemachinelearninglibraryforpython• Variousclassification,regressionandclustering
algorithms• InteroperatewithNumPy,SciPy,andunderlying
BLAS
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
29
scikit learncheat-sheet
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
30
SupportedAlgorithminscikit learn
• ClassificationAlgorithm• LogisticRegression:linear_model.LogisticRegression()• SVM:svm.SVC()• RandomForest:ensemble.RandomForestClassifier()
• Interface• fit(X,Y):trainmodel• Yp=predict(X):makeprediction
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
31
EvaluationFunctionsinscikit learn
• Evaluationfunctions• metrics.accuracy_score()• metrics.log_loss()• metrics.auc()• metrics.f1_score()
• metrics.confusion_matrix()• metrics.classification_report()
MachineLearning
Protect againsttomorrow’s
threats
MICROSOFTMALWARECLASSIFICATIONCHALLENGE
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
33
MicrosoftMalwareClassificationChallenge
• HostedbyWWW2015/BIG2015• MicrosoftMalwareProtectionCenter
MicrosoftAzureMachineLearningMicrosoftTalentManagement
• PEHexdump &Disassembled• Training:10,868(compressed:17.5GB)• Testing:10,873(compressed:17.7GB)
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
34
9 Classes
Category Count1 Ramnit 15412 Lollipop 24783 Kelihos_ver3 29424 Vundo 4755 Simda 426 Tracur 7517 Kelihos_ver1 3988 Obfuscator.ACY 12289 Gatak 1013
Total 10868
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
35
Class:Ramnit
• Stealsensitivepersonalinformation• Infectedthroughremovabledrivers• Copyitselfusingahard-codedname,orwitha
randomfilenametoarandomfolder• Injectcodesintosvchost.exe• InfectsDLL,EXE,HTML
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
36
Class:Lollipop
• Anadwareshowsadswhenbrowsingweb• Bundlewiththird-partysoftware• AutorunwhenWindowsstarting
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
37
Class:Kelihos
• ATrojanfamilydistributesspamemailwithmalwaredownloadlink
• CommunicatewithC&Cserver• SomevariantsinstallWinPcap tospynetwork
activity
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
38
PEHexdump w/oHeader
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
39
IDAProDump
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
40
IDAProDump
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
41
IDAProDump
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
42
EvaluationFunction
pij is the submitted probability of sample i is class jyij=1 if sample i is class j,yij=0 for others
𝑙𝑜𝑔𝑙𝑜𝑠𝑠 = −1𝑁))𝑦𝑖𝑗log(𝑝𝑖𝑗)
4
567
8
967
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
43
EvaluationExample
• Submission00000000,0.5,0.5,0,0,0,0,0,0,0,000000001,0,0,0.5,0.5,0,0,0,0,0,000000002,0,0,1,0,0,0,0,0,0,0
• logloss =- (log(0.5)+log(0)+log(1))/3• log(0)=>log(1e-15)
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
44
LeaderBoard
• Publicvs.Private
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
45
LeaderBoard
Public
Private
MachineLearning
Protect againsttomorrow’s
threats
HOWTOSOLVEIT?
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
47
FirstFeatureSet
• Binarysize• Hexcount• Stringlengthstats• TLSH
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
48
BinarySize
Category Avg. Size1 Ramnit 14821702 Lollipop 58295303 Kelihos_ver3 89826304 Vundo 11209505 Simda 45523306 Tracur 18011507 Kelihos_ver1 50519008 Obfuscator.ACY 8271189 Gatak 2555070
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
1 2 3 4 5 6 7 8 9
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
49
HexCount
• CountofHEX• 00,01,02,…,FE,FF,??• 257dimensions
• 1-gram
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
50
HexCountDistribution
0009121b242d36 3f 48515a636c757e879099a2abb4bdc6 cf d8e1ea f3 fc
0009121b242d36 3f 48515a636c757e879099a2abb4bdc6 cf d8e1ea f3 fc
0009121b242d36 3f 48515a636c757e879099a2abb4bdc6 cf d8e1ea f3 fc
1. Ramnit
2. Lollipop
3. Kelihos_v3
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
51
HexCountConfusionMatrix?
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
52
StringStats
• String:printablecharswherelength>4• Stringcount,avg.length,maxlength
0
2000
4000
6000
8000
10000
12000
14000
1 2 3 4 5 6 7 8 902468
101214161820
1 2 3 4 5 6 7 8 9
Avg. Count Avg. length
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
53
TLSH
• TrendMicroLocalitySensitiveHash• Fuzzymatchingforsimilaritycomparison• GetthemostsimilarclassbyvotingofTop5similar
filesfromtrainingdata
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
54
TLSH
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
55
MoreFeatures
• HEXn-gram• APIcall• Importtable• Instruction• Domainknowledge
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
56
2-gram/3-gram
• 2-gram:(256+1)^2=66,049• 3-gram:(256+1)^3=16,974,593
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
57
HEX2-gram/3-gram
• Important2-gramExample• Featureselection:reducefeaturesize
BiHEX 1. Ramnit 2. Lollipop 3. Kelihos_ver397 86 1.412 2.047 26.6514b e5 1.718 0.722 13.201f7 99 1.746 12.539 13.60675 08 228.09 288.78 13.1684e 47 146.318 12.159 13.512
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
58
APICall
• APIusedinPE
API 1. Ramnit 2. Lollipop 3. Kelihos_ver3IsWindow() 0.164 0.257 0.987DispatchMessageA() 0.159 0.845 0.987GetCommandLineA() 0.355 0.981 0.025DllEntryPoint() 0.656 0 0GetIconInfo() 0.023 0 0.936
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
59
ImportTable
• Alookuptableforcallingfunctionsinothermodule
1. Ramnit 2. Lollipop 3. Kelihos_ver3KERNEL32.dll KERNEL32.dll USER32.dllUSER32.dll USER32.dll KERNEL32.dllADVAPI32.dll ADVAPI32.dll MSASN1.dllole32.dll OPENGL32.dll UXTHEME.dllOLEAUT32.dll OLEAUT32.dll CLBCATQ.dllmsvcrt.dll GDI32.dll DPNET.dllAPPHELP.dll WS2_32.dll NTSHRUI.dll
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
60
OtherInfoinImportTable
• NumberofdistinctDLL
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
61
InstructionFrequency
• Verypowerful
instruction 1. Ramnit 2. Lollipop 3. Kelihos_ver3imul 86.768 2257.3 0.002movzx 289.17 118.79 0sbb 68.815 17.375 4.746jnz 1154.8 154.57 7.842mov 12336.6 7059.8 158.94
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
62
MoreDomainKnowledge
• Segment• Packer• Othertypeofbinary
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
63
Segments
• Commonsegmentname
• Uniquesegmentname1. Ramnit 2. Lollipop 3. Kelihos_ver3_data _text _rdata_text _data _text_rdata _rdata _data_bss _zenc_gnu_deb_tls
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
64
OtherInfofromSegments
• NumberofSegments
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
65
Packer
• CommonsegmentnameofPacker• UPX0/UPX1onlyinclass8.Obfuscator.ACY
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
66
OtherTypeofBinary
• RARfiles
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
67
OtherTypeofBinary
• MicrosoftOfficefiles
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
68
Ensemble:LinearBlending
• Combinetheresultfromseveralmodels• Voteofmodels
MachineLearning
Protect againsttomorrow’s
threats
WORKOFWINNINGTEAM
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
70
Features
• Instructionn-gram• ASMpixelmap
http://blog.kaggle.com/2015/05/26/microsoft-malware-winners-interview-1st-place-no-to-overfitting/
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
71
Features
• ASMpixelmap(intensityoffirst1000bytes)
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
72
xgboost
• Gradientboostingpackage• WidelyusedinKagglecompetition
MachineLearning
Protect againsttomorrow’s
threats
CONCLUSION
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
74
PhysicalMeaningofFeatures
• Hexn-gram• Opcode + imm/addr
• Instructionn-gram• Opcode
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
75
HappyEnding?
• Welcometotherealworld!• Newmalwarefamily• Mis-labelling
• Mechanismtomitigatetheissues.
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
76
TrendMicroXGen
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
77
TrendMicroMLContest
• MalwareIdentificationChallenge• 134teams,626players,from6+countries• Real-timescoring
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
78
HowtoImproveModel
• Usedomainknowledge• Unpack, unzip ...
• Improvefeaturerepresentation• Distinctive features for classes which you don’t do
well• Regulateoverfitting
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
79
HowtoImproveModel
• Findwhichitemscannotbecoveredbymodel• Adjustcurrentfeatures• Findnewfeatures• Tuningalgorithmparameters• Usedifferentalgorithm• Ensemble/Blending
MachineLearning
Protect againsttomorrow’s
threats
MachineLearning
Protect againsttomorrow’s
threats
80
LocalLibraryvs.CloudPlatform
Cloudplatformisnotnecessarilyeasier• Glue&Integration
• Data(pre-)processing• Modeltraining/prediction• Evaluation
• DiversityofMLalgorithms• Parametertuning
MachineLearning
Protect againsttomorrow’s
threats
THANKYOU