phishing detection using named entity recognition

Upload: csea

Post on 02-Jun-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    1/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 1

    ABSTRACT

    Phishing is a way of attempting to acquire sensitive information such as

    usernames, passwords and credit card details by masquerading as a trustworthy entity in

    an electronic communication. Phishing is a major security threat to the online

    community. Phishing scams have been escalating in number and sophistication by the

    day. A phishing attack today targets audience by using mass-mailings to millions of email

    addresses around the world, as well as by communicating with highly targeted groups of

    customers that have been enumerated through security faults in small clicks-and-mortar

    retail websites .

    This project proposes a methodology to detect phishing attacks and to discover theentity/organization that the attackers impersonate during phishing attacks. The

    methodology first discovers

    (i) named entities, which includes names of people, organizations, and

    locations; and

    (ii) hidden topics.

    Utilizing topics and named entities as features, the next stage classifies eachmessage as phishing or non-phishing. For messages classified as phishing, the final stage

    discovers the impersonated entity. The automatic discovery of impersonated entity from

    phishing helps the legitimate organization to take down the offending phishing site. This

    project also proposes a technique to discriminate phishing e-mails from the legitimate e-

    mails using the distinct structural features present in them. The derived features can be

    used to efficiently classify phishing emails before it reaches the users inbox .

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    2/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 2

    TABLE OF CONTENTS

    SL.NO TITLE PAGE NO

    1. Introduction

    2. Proposed System

    3. Proposed System architecture

    4. Literature Survey

    5. Objectives

    6. Statement of how the objectives are to be

    tackled

    7. Time Schedule

    8. References

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    3/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 3

    1.INTRODUCTION

    Phishing is a new word produced from 'fishing', it refers to the act that the attacker

    allure users to visit a faked Web site by sending them faked e-mails (or instant messages),

    and stealthily get victim's personal information such as user name, password, and

    national security ID, etc. This information then can be used for future target

    advertisements or even identity theft attacks (e.g., transfer money from victims' bank

    account). The frequently used attack method is to send e-mails to potential victims, which

    seemed to be sent by banks, online organizations, or ISPs. In these e-mails, they will

    make up some causes, e.g. the password of your credit card had been mis-entered for

    many times, or they are providing upgrading services, to allure you visit their Web site to

    conform or modify your account number and password through the hyperlink provided in

    the e-mail. If you input the account number and password, the attackers then successfully

    collect the information at the server side, and is able to perform their next step actions

    with that information (e.g., withdraw money out from your account).Phishing itself is not

    a new concept, but it's increasingly used by phishers to steal user information and

    perform business crime in recent years. Within one to two years, the number of phishing

    attacks increased dramatically.

    Phishing is a type of deception designed to steal your valuable personal data, such as

    credit card numbers, passwords, account data, or other information. It is a form of social

    engineering that is executed via electronic means and can lead to identity threat and

    fraud. Phishing email messages take a number of forms:

    They might appear to come from your bank or financial institution, a company you

    regularly do business with, such as Microsoft, or from your social networking site.

    They might appear to be from someone you in your email address book.

    They might ask you to make a phone call. Phone phishing scams direct you to call

    a phone number where a person or an audio response unit waits to take your

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    4/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 4

    account number, personal identification number, password, or other valuable

    personal data.

    They might include official-looking logos and other identifying information taken

    directly from legitimate websites, and they might include convincing details about

    your personal history that scammers found on your social networking pages.

    They might include links to spoofed websites where you are asked to enter

    personal information.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    5/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 5

    2. PROPOSED SYSTEM

    In this project we propose a method for classifying emails as legitimate or not using

    named entity recognition. We use email features for detecting phished mails. Emails that

    are labeled as spam, ham or phishing are then classified using a classifier. The classifier

    identifies the mails as phishing or not.

    Phishing

    Non -phishing

    Emails Feature

    comparison

    Classifier

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    6/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 6

    3. PROPOSED SYSTEM ARCHITECTURE

    Phisher sends

    e-mail

    User

    B-OnGuaRd

    Inbox Alert

    Legitimate Phished mail

    Feature Comparison

    Classifier

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    7/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 7

    4. LITERATURE SURVEY

    Topic:Phishing Detection: A Literature Survey[1]This article surveys the literature on the detection of phishing attacks. Phishing

    attacks target vulnerabilities that exist in systems due to the human factor.This paper aims

    at surveying many of the recently proposed phishing mitigation techniques. A high-level

    overview of various categories of phishing mitigation techniques is also presented, such

    as: detection, offensive defense, correction, and prevention.

    The phishing detection survey begins by

    defining the phishing problem

    categorizing anti-phishing solutions from the perspective of phishing campaign

    life cycle

    presenting evaluation metrics that are commonly used in the phishing domain to

    evaluate the performance of phishing detection techniques

    presenting a literature survey of anti-phishing detection techniques

    presenting a comparison of the various proposed phishing detection techniques in

    the literature.

    Definition

    The definition of phishing attacks is not consistent in the literature, which is due to the

    fact that the phishing problem is broad and incorporates varying scenarios.According to

    Phishtank:

    Phishing is a fraudulent attempt, usually made through email, to steal your personnel

    information

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    8/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 8

    Categorizing anti-phishing solutions

    Fig 1:Life cycle of phishing campaign[1]

    Detection approaches

    User training approachesend-users can be educated to better understand the

    nature of phishing attacks, which ultimately leads them into correctly

    identifying phishing and non-phishing messages

    Software classification approaches these mitigation approaches aim at

    classifying phishing and legitimate messages on behalf of the user in an attempt

    to bridge the gap that is left due to the human error or ignorance.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    9/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 9

    Fig 2:Overview of phishing detection approaches[1]

    Evaluation Metrics

    Based on our review of the literature, the following are the most

    commonly used evaluation metrics:

    True Positive (TP) ratemeasures the rate of correctly detected phishing attacks

    in relation to all existing phishing attacks.

    False Positive (FP) rate measures the rate of legitimate instances that are

    incorrectly detected as phishing attacks in relation to all existing legitimateinstances.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    10/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 10

    True Negative (TN) ratemeasures the rate of correctly detected legitimate

    instances in relation to all existing legitimate instances.

    False Negative (FN) rate measures the rate of phishing attacks that are

    incorrectly detected as legitimate in relation to all existing phishing attacks.

    Precision (P) measures the rate of correctly detected phishing attacks in

    relation to all instances that were detected as phishing.

    Recall (R)equivalent to TP.

    f1 scoreIs the harmonic mean betweenP andR.

    Accuracy (ACC) measures the overall rate of correctly detected phishing and

    legitimate instances in relation to all instances.

    Weighted Error (WErr) measures the overall weighted rate of incorrectlydetected phishing and legitimate instances in relation to all instances.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    11/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 11

    Topic: Multi-tier phishing detection and filtering approach[2]

    Phishing attacks continue to pose serious risks for consumers and businesses as

    well as threatening global security and the economy. Therefore, developing

    countermeasures against such attacks is an important step towards defending critical

    infrastructures such as banking. This paper presents a phishing email filtering approach

    using multi-tier classification technique that combines multiple classification algorithms.

    The major contributions are summarised as follows:

    Proposes a new method for extracting the features of phishing email based on

    weighting of message content and message header and select the features

    according to priority ranking.

    Presents a new approach called multi-tier classification model for filtering

    phishing emails.

    Examines the impact of rescheduling the classifier algorithms in a multi-tier

    classification process to classify the phishing email and to find out the optimum

    scheduling.

    Provides an empirical evidence that the proposed approach reduces the false

    positive problems substantially with lower complexity.

    The multi-tier model

    In this approach, the email message will be classified in a sequential

    fashion by using the first two tier ML algorithms and the outputs will be sent to the

    analyser section. The analyser will analyse the outputs and send them to the

    corresponding mail- boxes based on the labeling of the ML algorithms. If the email

    messages are misclassified by any of the first two tier(T1 and T2) ML algorithms, then

    the analyser will invoke the tier-3(T3) ML algorithm. The T3 ML algorithm will classify

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    12/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 12

    the misclassified email messages and send them to the corresponding mail boxes based

    on the identification.

    Fig 3: Block diagram for multi-tier classification model[2]

    Feature construction

    Features are extracted from each email based on weighting of message content and

    message header and select the features according to priority ranking. Each phishing email

    is parsed as text file to identify each header element to distinguish them from the body of

    the message. Every substring within the subject header and the message body that was

    delimited by white space was considered to be a token, and an alphabetic word was

    defined as a token delimited by white space that contains only English alphabetic

    characters (AZ, az) or apostrophes.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    13/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 13

    Category 1: features from the message subject header

    Binary feature indicating 3 or more repeated characters.

    Number of words with all letters in uppercase.

    Number of words with at least 15 characters.

    Number of words with at least two of letters J, K, Q, X, Z.

    Number of words with no vowels.

    Number of words with non-English characters, special characters such as

    punctuation, or digits at beginning or middle of word.

    Category 2: features from the priority and content-type headers

    Binary feature indicating whether the priority had been set to any level

    besides normal or medium.

    Binary feature indicating whether a content-type header appeared within the

    message header.

    Category 3: features from the message body

    Proportion of alphabetic words with no vowels and at least 7 characters

    Proportion of alphabetic words with at least two of letters J, K, Q, X, Z

    Proportion of alphabetic words at least 15 characters long

    Binary feature indicating whether the strings From: and To: were

    both present

    Number of HTML opening comment tags

    Number of hyperlinks (href)

    Number of clickable images represented in HTML

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    14/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 14

    Binary feature indicating whether a text colour was set to white

    Number of URLs in hyperlinks with digits or &, %, or @

    Number of colour element (both CSS and HTML format)

    Binary feature indicating whether JavaScript has been used or not

    Binary feature indicating whether CSS has been used or not

    Binary feature indicating opening tag of table

    Multi-tier filtering algorithm

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    15/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 15

    Topic: An efficacious method for detecting phishing webpages through target

    domain identification[3]

    Any anti-phishing technique becomes incomplete without identification of the

    phishing target. Hence, there is a need for a holistic approach that can identify the right

    phishing target even when attackers use any masquerading techniques .Such a method

    would gain significant importance among anti-phishing techniques as it alerts the target

    owners to take necessary counter measures and enhance security.

    In this paper, a novel approach to detect the phishing webpages is proposed. The

    webpage is taken under scrutiny and identify all the direct and indirect links associated

    with the page and generate domain group sets S1 and S2 respectively. From these sets the

    target domain set is identified , which is given as input to Target Identification (TID)

    algorithm to identify the phishing target. Using DNS lookup, the domains of suspicious

    webpage and phishing target are mapped to corresponding IP addresses. On comparing

    both the IP addresses, the authenticity of the suspicious webpage can be concluded. As

    this approach depends only on content of the suspicious webpage it requires neither a

    prior knowledge about the site nor requires the training data.

    System overview

    This system identifies phishing websites based on the following certainty that for a

    phishing website, the target will be a legitimate site, whereas for a genuine website, the

    system will point to the genuine site itself as its own target. On this stand the phishing

    webpage is identified by comparing the suspicious webpage with its target.

    For a given suspicious page, our method first identifies all the direct and indirect

    links associated with that page. The links which are directly associated with the webpage

    are extracted from the HTML source of the page and grouped based on their domains, as

    a set of domain S1. The indirectly associated links of the page are then retrieved by first

    extracting the keywords in the webpage and feeding these keywords to a search engine.

    The first n links returned by the search engine as indirectly associated links are retrieved

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    16/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 16

    and group them as a second set of domain S2. A reduced domain set S3 is constructed by

    extracting only the common domains present in both S1 and S2. This set S3 is fed as an

    input to a TID algorithm, to identify the phishing target domain. DNS lookup is used to

    map the domain of the identified phishing target to its corresponding IP address.Similarly, the domain of the suspicious webpage to its corresponding IP address is also

    mapped. On comparing the two IP addresses the authenticity of the suspicious webpage

    can be concluded.

    Fig. 4. System design (A1

    A3: Extract links present in webpage; group links according todomains;domain set S1 given for set comparison; B1B5: Extract keywords; keywords feed to

    search engine;extract the results; group links according to domains; domain set S2 given for set

    comparison; C1C4: Identified target domain set; input target domain set to TIDalgorithm;

    identify the target domain;supply domain name of the target domain to third-party DNS server;

    D1: Supply domain name of thesuspicious webpage to third-party DNS server; E1: Label

    generation based on DNS comparison (phishing = 0, legitimate = 1).[3]

    Identifying the target domain

    The target domain is identified from the target domain set (S3) the authenticity of

    the suspicious webpage is checked. The set S3 contains the predicted target domains and

    depending on the number of domains in it two scenarios are possible.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    17/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 17

    Fig 5:TID Algorithm

    Phishing detection using DNS lookup

    Here the target domain and the domain of the suspicious webpage P is taken ,and

    perform third-party DNS lookup. As a result the corresponding IP addresses isobtained

    for both the domains. On comparing these two sets of IP addresses thelegitimacy of the

    webpage of P can be concluded. If the IP addresses of the domain P are matched with

    those retrieved for the target domain P is declared to be a legitimate webpage. Otherwise,

    it can be concluded as a phishing webpage. Third party DNS lookup is used to avoid

    pharming attack (The user is redirected to a phished page even though he enters a correct

    URL. Attackers carry out this by exploiting the vulnerability in DNS server software). In

    identifying the legitimacy of a webpage IP address is used in comparison instead of

    domain names, to overcome the discrepancies in domain names.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    18/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 18

    Topic: Intelligent phishing detection and protection scheme for online

    transactions[4].

    Phishing is an instance of social engineering techniques used to deceive users into

    giving their sensitive information using an illegitimate website that looks and feels

    exactly like the target organization website.Most phishing detection approaches utilizes

    Uniform Resource Locator (URL) blacklists or phishing website features combined with

    machine learning techniques to combat phishing. Despite the existing approaches that

    utilize URL blacklists, they cannot generalize well with new phishing attacks due to

    human weakness in verifying blacklists, while the existing feature-based methods suffer

    high false positive rates and insufficient phishing features. As a result, this leads to an

    inadequacy in the online transactions.

    To address the problem robustly, it is important to build a state of-the-art model

    using Neuro-Fuzzy scheme with five inputs. Neuro-Fuzzy is a Fuzzy Logic and a Neural

    Network.

    Methodologies

    The proposed approach utilized Neuro-Fuzzy with five inputs to detect phishingwebsite in online transaction while maximizing the accuracy of performance and

    minimizing false positive and operation time.

    Neuro-Fuzzy

    Neuro-Fuzzy is a combination of a Fuzzy Logic and a neural network with ability

    of reasoning and learning .This combination allows the use of numeric and linguistic

    properties. The advantage of Neuro-Fuzzy approach is that it has universal

    approximations with ability to use Fuzzy IF...THEN rules. While Neural Network

    performs well when dealing with raw data, Fuzzy Logic deals with reasoning on a higher

    level, using numerical and linguistic information from domain expert. Neuro-Fuzzy was

    chosen because it has capabilities of data learning from Neural Network view point, and

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    19/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 19

    forms linguistic rules from Fuzzy Inference point of view ,thus allowing the power of

    intelligent systems to be used .

    Five Inputs

    Five inputs are five tables where features are extracted and stored for reference.These includes:

    1. Legitimate site rules Legitimate site rules is a summary of law covering

    phishing crime

    2. User-behavior profile User-behavior profile is a list of peoples behavior when

    interacting with phishing and legitimate websites.

    3. PhishTank PhishTank is a free community website operated by Open Domain

    Names where suspected websites are verified and voted as phish by the

    community experts

    4. User-specific site User-specific site contains binding requirements between a

    user and online transaction service providers

    5. Pop-Ups from Email Pop-Ups from Email are regular phrases that are used by

    phishers as appears on screen.

    These five inputs are used because they are wholly representative of phishing attack

    techniques and strategies. From the five inputs, 288 features are extracted which are used

    as training and testing input data into the Neuro-Fuzzy system to generate Fuzzy

    IF...THEN rules, and to discriminate between phishing, suspicious and legitimate sites

    accurately in real-time. If a phishing website is detected, then a voice alarm is generated.

    For a suspicious website, the system generates red.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    20/24

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    21/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 21

    constant, we obtain zero-order Sugeno Fuzzy model in which the consequent of a rule is

    specified by a singleton.When y is a first order polynomial (y = k0 + k1x1 + k2x2 + ... +

    kmxm), we get a first-order Sugeno Fuzzy model.

    Layer 1 is the input layer. Neurons in this layer easily transmit external crispindications straight to the next layer. Neurons in this layer undertake fuzzification.

    Fuzzification neurons contain a bell activation function. The activation of a membership

    function is a set that specifies the Fuzzy set. Thus, the activation for the neuron in layer 2

    is set to generalization bell (gbell) membership functions. Layer 3 is the rule base. This

    layer gets inputs from the individual fuzzification nodes and calculates the firing strength

    of the rule it represents. Layer 4 is the normalization. Every neuron based in this layer is

    connected to individual normalization neuron. The Neuron gets inputs from every neuron

    in the rule layers and calculates the normalized firing strength of a given rule. The

    normalized firing strength is the percentage of the firing strength of a given rule to the

    sum of firing strengths of every rule.Layer 5 is defuzzification. This neuron computes the

    sum of outputs of every combined neurons and produces the overall Adaptive Neuro-

    Fuzzy Inference System output, y.

    Fig:7 Intelligent phishing detection fuzzy inference system structure[4]

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    22/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 22

    5.OBJECTIVES

    The aim of this project is to provide the anti-phishing industry with a solution that can

    detect more sophisticated phishing attacks as well as detecting simple phishing attacks.

    To achieve this project aim, there are some detailed objectives and tasks that are required

    to be performed:

    To survey and examine the current techniques and solutions of anti-phishing and

    gain further knowledge through the understanding of these techniques.

    To conduct an investigation of new phishing attacks and potential threats.

    To collect the proposed system requirements.

    To design the proposed systems architecture.

    To implement the designed architecture into a working program.

    To evaluate the resulting system.

    .

    6. STATEMENT OF HOW THE OBJECTIVES ARE TO BE

    TACKLED

    Phishing is a continual threat that keeps growing to this day. The damage caused

    by phishing ranges from denial of access to email to substantial financial loss.

    To achieve the objectives a survey of various papers are done through which

    different phishing techniques, anti-phishing techniques are identified and studied. A new

    system , B-OnGuaRd was proposed that discriminates phished mails and legitimate

    emails before it reaches the users inbox after comparing features present in the emails and

    through classification. The architecture of the proposed system specifies the overall

    working of the system. Improvements are done in the architecture for better results.

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    23/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering,SJCET,Palai Page 23

    7.TIME SCHEDULE

  • 8/10/2019 Phishing Detection Using Named Entity Recognition

    24/24

    Phishing Detection Using Named Entity Recognition

    Dept of Computer Science And Engineering SJCET Palai Page 24

    8. REFERENCES

    [1]. Phishing Detection: A Literature Survey

    Mahmoud Khonji, Youssef Iraqi, Senior Member, IEEE, and Andrew Jones

    [2]. A multi-tier phishing detection and filtering approach

    Rafiqul Islam , Jemal Abawajy

    [3]. An efficacious method for detecting phishing webpages through target

    domain identification

    Gowtham Ramesh , Ilango Krishnamurthi , K. Sampath Sree Kumar

    [4]. Intelligent phishing detection and protection scheme for online transactionsP.A. Barraclough , M.A. Hossain , M.A. Tahir , G. Sexton , N. Aslam

    [5]. Learning to Detect Phishing Emails

    Ian Fette, Norman Sadeh, Anthony Tomasic