su zhang 1. quick review. data source – nvd. six most popular/vulnerable vendors for our...
TRANSCRIPT
![Page 1: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/1.jpg)
1
PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA
MINING--SECOND PRESENTATION
Su Zhang
![Page 2: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/2.jpg)
2
Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our
Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach. Statistical Results Plan For Next Phase.
Outline
![Page 3: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/3.jpg)
3
Quick Review
![Page 4: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/4.jpg)
4
National Vulnerability Database◦ U.S. government repository of standards based
vulnerability management data.◦ Data included in each NVD entry
Published Date Time Vulnerable software’s CPE Specification
◦ Derived data Published Date Time Month Published Date Time Day Two adjacent vulnerabilities’ CPE diff (v1,v2)Version diff CPE Specification Software Name Adjacent different Published Date Time ttpv Adjacent different Published Date Time ttnv
Source Database – NVD
![Page 5: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/5.jpg)
5
Linux: 56925 instances Sun: 24726 instances Cisco: 20120 instances Mozilla: 19965 instances Microsoft: 16703 instances Apple: 14809 instances.
Six Most Vulnerable/Popular Vendors
![Page 6: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/6.jpg)
6
r e s t
Adob
eIB
MPh
p
Appl
e
Micro
soft
Moz
illa
Cisco Su
nLinu
x0
10000
20000
30000
40000
50000
60000
Instances Table
Instances
Why We Only Choose Instances Of Pop Vendors—Instances Table
![Page 7: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/7.jpg)
7
r e s tHP
Linu
x
Moz
ila
Cisco
Oracle
IBMAp
ple
Sun
Micro
soft
0
500
1000
1500
2000
2500
Vulnerability Table
Vul_Num
Why We Only Choose Instances Of Pop Vendors—Vulnerability Table
![Page 8: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/8.jpg)
8
Huge size of nominal types (vendors and software) will result in a scalability issue.
Top six take up 43.4% of all instances.
We have too many vendors(10411) in NVD.
The seventh most popular/vulnerable vendor is much less than the sixth.
Vendors are independent for our approach.
Why We Only Choose Instances Of Pop Vendors
![Page 9: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/9.jpg)
9
NVD data—Training/Testing dataset◦ Starting from 2005 since before that the data
looks unstable.◦ Correct some obvious errors in NVD(e.g.
“cpe:/o:linux:linux_kernel:390”).
Attributes◦ Published time : Only use month and day. ◦ Version diff: A normalized difference between two
versions.◦ Vendor: Removed.
Data Preprocessing
![Page 10: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/10.jpg)
10
Attributes◦ “Group” vulnerabilities published at the same
day- we can guarantee ttnv/ttpv are non-zero values.
◦ ttnv is the predicted attribute.
For each software◦ Delete its first bunch of instances.◦ Delete its last bunch of instances.
Data Preprocessing(cont)
![Page 11: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/11.jpg)
11
v1= 3.6.4; v2 = 3.6; MaxVersionLength=4; v1= expand ( v1, 4 ) = 3.6.4.0 v2 =expand ( v2, 4 ) = 3.6.0.0 diff(v1, v2) = (3-3) * 1000 +(6-6) * 100-1
+(4-0) * 100-2
+(0-0) * 100-3 = 4 E -4
version diff Calculation
![Page 12: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/12.jpg)
12
Vendor, soft, version, month, day, vdiff, ttpv, ttnv linux,kernel,2.6.18, 05, 02, 0, 70, 5 linux,kernel,2.6.19.2, 05, 07,1.02E-4,5, 281
An Example
![Page 13: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/13.jpg)
13
Least Mean Square. Linear Regression Multilayer Perceptron. SMOreg. RBF Network. Gaussian Processes.
Functions Available For Our Approach On Weka
![Page 14: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/14.jpg)
14
Function: Linear Regression Training Dataset: 66% Linux(Randomly
picked since 2005). Test Dataset: the rest 34% Test Result:
◦ Correlation coefficient 0.5127◦ Mean absolute error 11.2358◦ Root mean squared error 25.4037◦ Relative absolute error 107.629 %◦ Root relative squared error 86.0388 %◦ Total Number of Instances 17967
Several Statistical Results
![Page 15: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/15.jpg)
15
Correlation Coefficient
![Page 16: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/16.jpg)
16
Mean absolute error :
Root mean square error:
Several Definitions About “Error”
![Page 17: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/17.jpg)
17
Relative absolute error:
Root relative squared error:
Several Definitions About “Error”(Cont)
![Page 18: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/18.jpg)
18
Function: Least Mean Square Training Dataset: 66% Linux(Randomly
picked since 2005). Test Dataset: the rest 34% Test Result:
◦ Correlation coefficient -0.1501◦ Mean absolute error 7.6676◦ Root mean squared error 30.6038◦ Relative absolute error 73.449 %◦ Root relative squared error 103.6507 %◦ Total Number of Instances 17967
Several Statistical Results
![Page 19: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/19.jpg)
19
Function: Multilayer Perceptron Training Dataset: 66% Linux(Randomly
picked since 2005). Test Dataset: the rest 34% Test Result:
◦ Correlation coefficient 0.9886◦ Mean absolute error 0.4068◦ Root mean squared error 4.6905◦ Relative absolute error 3.7802 %◦ Root relative squared error 15.1644 %◦ Total Number of Instances 17967
Several Statistical Results
![Page 20: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/20.jpg)
20
Function: RBF Network Training Dataset: 66% Linux(Randomly picked since
2005). Test Dataset: the rest 34% Test Result:
◦ Linear Regression Model ttnv = -15.3206 * pCluster_0_1 + 21.6205
◦ Correlation coefficient 0.1822◦ Mean absolute error 10.5857◦ Root mean squared error 29.048 ◦ Relative absolute error 101.4023 %◦ Root relative squared error 98.3814 %◦ Total Number of Instances 17967
Several Statistical Results
![Page 21: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/21.jpg)
21
Linear Regression: Not accurate enough but looks promising (correlation coefficient: 0.5127).
Least Mean Square: Probably not good for our approach(negative correlation coefficient).
Multilayer Perceptron: Looks good but it couldn’t provide us with a linear model.
Summary Of Current Results
![Page 22: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/22.jpg)
22
SMOreg: For most vendors, it takes too long time to finish (usually more than 80 hours).
RBF Network: Not very accurate.
Gaussian Processes: Runs out of heap memory for most of our experiments.
Summary Of Current Results (Cont)
![Page 23: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/23.jpg)
23
Adding CVSS metrics as predictive attributes.
Binarize our predictive attributes (e.g. divide ttnv/ttpv into several categories.)
Use regression SVM with multiple kernels.
Possible Ways To Improve The Accuracy Of Our Models.
![Page 24: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/24.jpg)
24
Try to find out an optimal model for our prediction.
Try to investigate how to apply it with MulVAL if we get a good model. Otherwise, find out the reason why it is not accurate enough.
Plan For Next Phase
![Page 25: Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing](https://reader036.vdocuments.net/reader036/viewer/2022062619/55164cda5503469d698b49e3/html5/thumbnails/25.jpg)
25
Thank you!