1 a feature selection and evaluation scheme for computer virus detection olivier henchiri and...

11
1 A Feature Selection A Feature Selection and Evaluation and Evaluation Scheme for Computer Scheme for Computer Virus Detection Virus Detection Olivier Henchiri and Olivier Henchiri and Nathalie Japkowicz Nathalie Japkowicz School of Information School of Information Technology and Engineering Technology and Engineering University of Ottawa University of Ottawa

Upload: sabrina-gibson

Post on 01-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

11

A Feature Selection and A Feature Selection and Evaluation Scheme for Evaluation Scheme for

Computer Virus DetectionComputer Virus Detection

Olivier Henchiri and Nathalie JapkowiczOlivier Henchiri and Nathalie Japkowicz

School of Information Technology and School of Information Technology and EngineeringEngineering

University of OttawaUniversity of Ottawa

Page 2: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

22

MotivationMotivation Traditional anti-computer virus systems are signature-Traditional anti-computer virus systems are signature-

based. This technique is appropriate to detect existing based. This technique is appropriate to detect existing viruses, but it falls short of detecting new unseen viruses viruses, but it falls short of detecting new unseen viruses or variants of existing ones.or variants of existing ones.

Yet, virus writers strategically modify their viruses so Yet, virus writers strategically modify their viruses so that existing virus signatures do not match the new that existing virus signatures do not match the new viruses. They do so in random and unpredictable ways, viruses. They do so in random and unpredictable ways, each time the virus replicates.each time the virus replicates.

Heuristic scanners attempt to compensate for this lacuna Heuristic scanners attempt to compensate for this lacuna by using more general features from viral code. However, by using more general features from viral code. However, the process requires human intervention and falls short of the process requires human intervention and falls short of yielding both good detection rates for new viruses and yielding both good detection rates for new viruses and low false positives. low false positives. Automated searchesAutomated searches for general for general features are needed.features are needed.

Page 3: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

33

Purpose: To Improve on current automated Purpose: To Improve on current automated search methods for general featuressearch methods for general features

This talk presents:This talk presents: A A Feature Search and Selection approachFeature Search and Selection approach for for

Virus Detection that performs an exhaustive search Virus Detection that performs an exhaustive search on a data set of viruses, yielding a large number of on a data set of viruses, yielding a large number of short generic features, that are then filtered with short generic features, that are then filtered with respect to how representative they are of viral respect to how representative they are of viral properties.properties.

A A Stringent Cross-Validation schemeStringent Cross-Validation scheme allowing us allowing us to simulate real-world conditions of new virus to simulate real-world conditions of new virus outbreaks.outbreaks.

Evidence Evidence that our Feature Selection approach has that our Feature Selection approach has high predictive power.high predictive power.

Page 4: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

44

BackgroundBackground Computer Viruses are often organized within sets of Computer Viruses are often organized within sets of

Virus FamiliesVirus Families. . Virus families are characterized by their similarities in:Virus families are characterized by their similarities in:

StructureStructure CodeCode Methods of infectionMethods of infection

Consideration of Virus Families is crucial to the task of Consideration of Virus Families is crucial to the task of detection. Indeed, the first virus of a family is usually detection. Indeed, the first virus of a family is usually devastating while its family variants are typically less so.devastating while its family variants are typically less so.

Our approach uses a-priori knowledge of virus Our approach uses a-priori knowledge of virus families, but our evaluation scheme focuses on families, but our evaluation scheme focuses on evaluating classifiers in their detection of viruses of a evaluating classifiers in their detection of viruses of a family they were not trained on.family they were not trained on.

Page 5: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

55

Feature Search and Selection IFeature Search and Selection I Our feature search and selection algorithm is Our feature search and selection algorithm is

comprised of three steps:comprised of three steps: Scanning & Recording:Scanning & Recording: A scanning window of A scanning window of

length, length, SequenceLengthSequenceLength, moves across the binary , moves across the binary code, recording the frequency code, recording the frequency within each familywithin each family of each sequence it encounters.of each sequence it encounters.

Selection:Selection: The features whose family frequency is The features whose family frequency is at or above the threshold, at or above the threshold, IntraFamilySupportIntraFamilySupport, are , are selected selected Only the features most representative Only the features most representative of a family are retained.of a family are retained.

Elimination:Elimination: The features that fall below the The features that fall below the threshold, threshold, InterFamilySupportInterFamilySupport, are eliminated , are eliminated Features that are too exclusive of a particular Features that are too exclusive of a particular family are rejected.family are rejected.

Page 6: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

66

Feature Search and Selection IIFeature Search and Selection II

Our Feature Search and Selection method is Our Feature Search and Selection method is hierarchicalhierarchical, and, thus, , and, thus, scalable to large datasetsscalable to large datasets:: The Scanning and Recording step is done only once.The Scanning and Recording step is done only once. The Selection step is conducted on small family subsets.The Selection step is conducted on small family subsets. The Elimination step is conducted on shorter feature lists.The Elimination step is conducted on shorter feature lists.

Our Feature Search and Selection method ensures Our Feature Search and Selection method ensures that all retained features represent viral that all retained features represent viral properties common to many types of viruses, as properties common to many types of viruses, as opposed to idiosyncrasies specific to one family.opposed to idiosyncrasies specific to one family.

Page 7: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

77

Evaluation Scheme IEvaluation Scheme I

Purpose:Purpose: To simulate an environment where a virus detection To simulate an environment where a virus detection

system is faced with the outbreak of a new unseen system is faced with the outbreak of a new unseen virus.virus.

Procedure:Procedure: Form k- folds fForm k- folds f11..f..fkk, such that , such that

for each pair of folds (ffor each pair of folds (fii,f,fjj), i= 1..k, j= 1..k, and i ), i= 1..k, j= 1..k, and i ≠ j≠ j The set of families represented in fThe set of families represented in fii is disjoint from is disjoint from

the set of families represented in fthe set of families represented in fjj Benign programs are added, at random, to each fold.Benign programs are added, at random, to each fold.

Perform a regular cross-validation scheme.Perform a regular cross-validation scheme.

Page 8: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

88

Evaluation Scheme IIEvaluation Scheme II

Page 9: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

99

Results Results Traditional Feature Search Traditional Feature Search

(best strategy to date):(best strategy to date): retain retain 16-byte sequences appearing 16-byte sequences appearing with a support of at least 1% with a support of at least 1% [Schultz et al., 2001][Schultz et al., 2001]

Data Set: 1512 viruses + 1488 Data Set: 1512 viruses + 1488 benign executablesbenign executables

The viruses belong to 110 The viruses belong to 110 families.families.

Parameter Setting: Parameter Setting: SequenceLength= 8SequenceLength= 8 IntraFamilySupport= 40%IntraFamilySupport= 40% InterfamilySupport= 3InterfamilySupport= 3 We obtain up to We obtain up to 93.65%93.65% accuracy accuracy

versus versus 65.04%65.04% obtained by the obtained by the traditional feature search approach.traditional feature search approach.

Page 10: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

1010

Other ObservationsOther Observations Extra Experiments Set-up: Extra Experiments Set-up:

An additional set of experiments were performed in which the An additional set of experiments were performed in which the three search parameters where varied.three search parameters where varied.

The Intra-family Support was modified according to the other The Intra-family Support was modified according to the other two, so that a maximum of 500 features per family are selected two, so that a maximum of 500 features per family are selected in the second step of our algorithm.in the second step of our algorithm.

Observations: Observations: Classifiers perform better with shorter sequence length. Classifiers perform better with shorter sequence length.

Sequence lengths of size 5, 4 and 3 seem optimal.Sequence lengths of size 5, 4 and 3 seem optimal. Low Inter-Family Support thresholds yield better results, Low Inter-Family Support thresholds yield better results,

especially for longer sequences.especially for longer sequences. Performance generally decreases when the feature set contains Performance generally decreases when the feature set contains

fewer than 200 features. fewer than 200 features. Large numbers of small features Large numbers of small features perform better than small numbers of large ones.perform better than small numbers of large ones.

Page 11: 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering

1111

Conclusion and Future WorkConclusion and Future Work Summary:Summary:

Our Feature Search and Selection and Evaluation methods Our Feature Search and Selection and Evaluation methods focus on selecting generic features useful on new, unseen focus on selecting generic features useful on new, unseen families of viruses.families of viruses.

Our results demonstrate the usefulness of our method in this Our results demonstrate the usefulness of our method in this setting.setting.

Future Work:Future Work: To reduce the false positive rate further, using a larger To reduce the false positive rate further, using a larger

number of benign files for training, or, simply stratification number of benign files for training, or, simply stratification or cost-sensitive learning.or cost-sensitive learning.

To test our Feature Search and Selection method in a To test our Feature Search and Selection method in a Retrospective Testing setting, that would involve a set of Retrospective Testing setting, that would involve a set of older viruses in the training set and a set of more recent ones older viruses in the training set and a set of more recent ones in the test set.in the test set.