whap: web-hacking profiling using case-based reasoningahreumka/doc/cns_whap.pdffig. 2. overview of...

2
WHAP: Web-Hacking Profiling Using Case-Based Reasoning Mee Lan Han * , Hee Chan Han * , Ah Reum Kang , Byung Il Kwak * , Aziz Mohaisen and Huy Kang Kim * * Korea University, South Korea, SUNY Buffalo, USA Emails: {blosst, huer2783, kwacka12, cenda}@korea.ac.kr, Emails: {ahreumka, mohaisen}@buffalo.edu Abstract—As in the real world’s criminal investigation, cy- ber criminal profiling is important to attribute cyber attacks. Every cyber crime committed by the same hacker or hacking group has unique characteristics such as attack purpose, attack methods, and target’s profile. Therefore, a complete analysis of the hacker’s activities can give investigators hard evidence to attribute attacks and unveil criminals. In this paper, we implemented WHAP, a profiling system that uses Case-Based Reasoning (CBR). We verified WHAP’s usefulness by analyzing large scale of web defacement cases including North Korean hacker’s attacks against South Korea, and unveiling a relation- ship between those attacks and another set of attacks against Sony Pictures Entertainment. I. I NTRODUCTION In the recent five years, South Korea suffered from various nation-wide cyber attacks that were presumably associated with the North Korean cyber army, known as the Dark Seoul Group, as chronologized in Figure1 [1]. The severest attack happened on March 20, 2013, and as a result, South Korea’s major broadcasting companies and top four banks were hacked and taken down. Interestingly, the malware used in and found in the aftermath of these attacks has several common strings and routines that can be an important clue to their origin and identity. North Korean cyber army’s attacks become the utmost threat. The same type of attacks is not limited to South Korea but included targets in United States initiated by the North Korean malicious actors, including the notorious Sony Pictures Entertainment hacking case (2014). Those attacks belong broadly to the class of Advanced Persistent Threats (APTs), which have become a significant source of the threat. These attacks are sophisticated in nature, thus it is hard to capture hard evidence needed to detect the malicious actors behind them. However, we speculate that 1) if one can find the most similar hacking cases from historical data to a case of interest, contextualizing and annotating such APT attacks would be made easier, and 2) if one has a well- defined profile vector that describes a hacking case, he can also efficiently search similar cases. To this end, we developed WHAP, a hacking profiling system. WHAP consists of a hacking case database, a hacking case crawling system, a similarity measure, and a case-based reasoning system, and is used to facilitate cyber crime investigation. II. WHAP: SYSTEM DESIGN AND I MPLEMENTATION The overall architectural view of the system is illustrated in Figure2. In the following, we describe the system using an instance of implementation. First, we built a large hacking case database which includes 212,093 web-hacking cases that happened during the past 15 years. We collected the hacking 2009 2010 2011 2012 2013 Trojan.Dozer (July 2009) Trojan.Koredos (March 2011) Trojan.Castov Trojan.Castdos (June 2013) Backdoor.Prioxer (June 2010) Backdoor.Prioxer.B (July 2012) Downloader.Castov (October 2012) Infostealer.Castov (May 2013) Trojan.Jokra (March 2013) DDoS and disk wiping attacks against South Korea DDoS attacks against South Korea government DNS server Attack on South Korea financial companies, banks and broadcasters Fig. 1. Recent attacks by North Korean cyber army TABLE I. Case vector design for WHAP Case type Selected features Web-hacking attack date, notify, target IP address, domain name, OS, web server, domain type, thanks to, hacker group, e-mail address, social media, messenger, country code, encoding, font, sound, link site, hacker’s message cases from Zone-H.org [2]. Second, we designed the case vector according to the 5W1H formula (Who, When, Where, What, How, and Why) to depict the attack case intuitively and informatively. Among many features found in the attack cases, and after a thorough review by trained security analysts, we chose the most significant features with the frequently found in web-hacking cases as highlighted in Table I. Third, we developed a measure to estimate the similarity between case vectors. Table II shows examples of how to cal- culate the similarity score between the non-numeric features. The similarity score between two cases is the similarity sum among all of their features (weights are set evenly, then updated based on features’ importance): Similarity Score (S)= n X i=1 (F eature i × W eight i ) We defined the extent of similarity between cases C i and C j as 0, 1 and a numeral value from 0 to 1. 0 means that case C i and case C j are unrelated while 1 means that case C i and case C j are identical. S (0 <S< 1) is the similarity between case C i and case C j . If S is closer to 1, the case C i is more similar to case C j [3]. Based on this similarity measure, we applied a Case-Based Reasoning algorithm [4] to search the most similar attack cases utilizing the aforementioned case database.

Upload: others

Post on 03-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • WHAP: Web-Hacking ProfilingUsing Case-Based Reasoning

    Mee Lan Han∗, Hee Chan Han∗, Ah Reum Kang†, Byung Il Kwak∗, Aziz Mohaisen† and Huy Kang Kim∗∗Korea University, South Korea, †SUNY Buffalo, USA

    Emails: {blosst, huer2783, kwacka12, cenda}@korea.ac.kr, Emails: {ahreumka, mohaisen}@buffalo.edu

    Abstract—As in the real world’s criminal investigation, cy-ber criminal profiling is important to attribute cyber attacks.Every cyber crime committed by the same hacker or hackinggroup has unique characteristics such as attack purpose, attackmethods, and target’s profile. Therefore, a complete analysisof the hacker’s activities can give investigators hard evidenceto attribute attacks and unveil criminals. In this paper, weimplemented WHAP, a profiling system that uses Case-BasedReasoning (CBR). We verified WHAP’s usefulness by analyzinglarge scale of web defacement cases including North Koreanhacker’s attacks against South Korea, and unveiling a relation-ship between those attacks and another set of attacks againstSony Pictures Entertainment.

    I. INTRODUCTION

    In the recent five years, South Korea suffered from variousnation-wide cyber attacks that were presumably associatedwith the North Korean cyber army, known as the DarkSeoul Group, as chronologized in Figure1 [1]. The severestattack happened on March 20, 2013, and as a result, SouthKorea’s major broadcasting companies and top four bankswere hacked and taken down. Interestingly, the malware usedin and found in the aftermath of these attacks has severalcommon strings and routines that can be an important clue totheir origin and identity. North Korean cyber army’s attacksbecome the utmost threat. The same type of attacks is notlimited to South Korea but included targets in United Statesinitiated by the North Korean malicious actors, including thenotorious Sony Pictures Entertainment hacking case (2014).

    Those attacks belong broadly to the class of AdvancedPersistent Threats (APTs), which have become a significantsource of the threat. These attacks are sophisticated in nature,thus it is hard to capture hard evidence needed to detect themalicious actors behind them. However, we speculate that 1)if one can find the most similar hacking cases from historicaldata to a case of interest, contextualizing and annotating suchAPT attacks would be made easier, and 2) if one has a well-defined profile vector that describes a hacking case, he canalso efficiently search similar cases. To this end, we developedWHAP, a hacking profiling system. WHAP consists of ahacking case database, a hacking case crawling system, asimilarity measure, and a case-based reasoning system, andis used to facilitate cyber crime investigation.

    II. WHAP: SYSTEM DESIGN AND IMPLEMENTATION

    The overall architectural view of the system is illustratedin Figure2. In the following, we describe the system usingan instance of implementation. First, we built a large hackingcase database which includes 212,093 web-hacking cases thathappened during the past 15 years. We collected the hacking

    2009 2010 2011 2012 2013

    Trojan.Dozer(July 2009)

    Trojan.Koredos(March 2011)

    Trojan.CastovTrojan.Castdos

    (June 2013)

    Backdoor.Prioxer(June 2010)

    Backdoor.Prioxer.B(July 2012)

    Downloader.Castov(October 2012)

    Infostealer.Castov(May 2013)

    Trojan.Jokra(March 2013)

    DDoS and disk wiping attacks against South KoreaDDoS attacks against South Korea

    government DNS server

    Attack on South Korea financial companies, banks and broadcasters

    Fig. 1. Recent attacks by North Korean cyber army

    TABLE I. Case vector design for WHAP

    Case type Selected features

    Web-hacking attack date, notify, target IP address, domain name,OS, web server, domain type, thanks to, hackergroup, e-mail address, social media, messenger,country code, encoding, font, sound, link site,hacker’s message

    cases from Zone-H.org [2]. Second, we designed the casevector according to the 5W1H formula (Who, When, Where,What, How, and Why) to depict the attack case intuitively andinformatively. Among many features found in the attack cases,and after a thorough review by trained security analysts, wechose the most significant features with the frequently foundin web-hacking cases as highlighted in Table I.

    Third, we developed a measure to estimate the similaritybetween case vectors. Table II shows examples of how to cal-culate the similarity score between the non-numeric features.

    The similarity score between two cases is the similaritysum among all of their features (weights are set evenly, thenupdated based on features’ importance):

    Similarity Score (S) =n∑

    i=1

    (Featurei ×Weighti)

    We defined the extent of similarity between cases Ci and Cjas 0, 1 and a numeral value from 0 to 1. 0 means that caseCi and case Cj are unrelated while 1 means that case Ci andcase Cj are identical. S (0 < S < 1) is the similarity betweencase Ci and case Cj . If S is closer to 1, the case Ci is moresimilar to case Cj [3]. Based on this similarity measure, weapplied a Case-Based Reasoning algorithm [4] to search themost similar attack cases utilizing the aforementioned casedatabase.

  • Data Collection Data Preprocessing Case Vector Design Reasoning Engine Evaluation

    Zone-H.org

    Case Centric DB

    Actual cases application

    Deduction of the high similarity

    cases

    Encoding gTLDccTLDOS

    Vector for Clustering

    Encoding gTLDccTLDOSDate

    Vector for Similarity

    - Expectation Mazimization (EM) algorithm

    Clustering Processing

    - Similarity based on Case-Based Reasoning- Setting weight value

    Similarity Processing

    Parsing from HTML source in mirror page

    Data Parsing

    Amending incomplete, improperly formatted, or duplicated data

    Data Cleaning

    Crawling

    Fig. 2. Overview of WHAP, the hacking profiling system

    TABLE II. Similarity score among attack IP address, domainname, date with condition of IP address and attack domain

    Condition of IP address Sif the same (e.g. 143.248.1.1 & 143.248.1.1) 1if class A,B and C are matched (e.g. 143.248.1.1 & 143.248.1.8) 0.75if class A and B are matched (e.g. 143.248.1.1 & 143.248.3.4) 0.5only class A is matched (e.g. 143.248.1.1 & 143.13.2.8) 0.25no common class (e.g. 143.248.1.1 & 163.13.2.5) 0Condition of attack domain San identical domain 1service name is matched, one of the gTLD and ccTLD is matched 0.8gTLD and ccTLD are matched 0.3service name is matched 0.1ccTLD is matched 0.1gTLD is matched 0.1non-identical domain 0

    III. ANALYSIS OF THE DARK SEOUL ATTACKS

    As a result of the Dark Seoul cyber-attack on March20, 2013, the groupware homepage of LGU+ and KoreanBroadcasting System’s homepage were defaced. The attackerleft a unique image and messages on the defaced websites.The significant characteristics found in the websites are theCalaveras (skull) image in LGU+ website and “HASTATI”strings in KBS website. All clues in the two cases havecommon characteristics: 1) the image is frequently found onEuropean sites, 2) “HASTATI” is a military word related toRoman Empire army, and 3) the message left is written in thesame encoding system as European languages based on theLatin language system, where most Korean web pages useanother encoding system.

    Fig. 3. LGU+ groupware case (left) and KBS case (right)

    We calculated all similarities between the past cases.Among the 212,093 collected cases of web-hackings, thesimilarity scores of the randomly selected two cases typicallyare distributed around 0.3 points. We took the distribution ofthe similarity score using the central limit theorem, whichis the process repeated ten thousand times after calculating

    the mean value by summarizing a set of ten of the similarityresults of the randomly selected two cases. Interestingly, wefound that the Sony Pictures Entertainment and Dark Seoulcases are very similar as shown in Figure4.

    1.0.8.6.4.2.0

    600

    400

    200

    0

    Similarity score

    Freq

    uenc

    y

    Mean = 0.2930 Var = 0.0866

    Sony Pictures Entertainment 0.615

    Dark Seoul 0.69

    Fig. 4. Similarity score distribution of two random cases

    IV. CONCLUSION

    Intelligence derived from WHAP cannot guarantee a fullaccuracy, but highlights hidden similarities between variouscases and assists investigator in timely decision-making. Itcontributes by providing insight into the confidential andundercover network of cyber-crime as well, especially whenthere is a lack of information. WHAP makes the analysiseasier and reduce the time required in searching possiblesuspects.

    ACKNOWLEDGMENT

    This work was supported by the ICT R&D Program ofMSIP/IITP.[14-912-06-002, The Development of Script-basedCyber Attack Protection Technology]

    REFERENCES[1] Symantec Security Response. “Four Years of DarkSeoul Cyberattacks

    against South Korea Continue on Anniversary of Korean War,”http://www.symantec.com/connect/blogs/four-years-darkseoul-cyberattacks-against-south-korea-continue-anniversary-korean-war,2013.

    [2] Zone-H.org, http://zone-h.org/.[3] H. K. Kim, K. H Im, and S. C. Park, “DSS for Computer Security

    Incident Response Applying CBR and Collaborative Response,” ExpertSystems with Applications, vol. 37, pp. 852-870, 2010.

    [4] D. B. Leake, “Case-Based Reasoning: Experiences, Lessons and FutureDirections,” MIT press, 1996.