phishing detection using named entity recognition

8/10/2019 Phishing Detection Using Named Entity Recognition

1/24

Phishing Detection Using Named Entity Recognition

Dept of Computer Science And Engineering,SJCET,Palai Page 1

ABSTRACT

Phishing is a way of attempting to acquire sensitive information such as

usernames, passwords and credit card details by masquerading as a trustworthy entity in

an electronic communication. Phishing is a major security threat to the online

community. Phishing scams have been escalating in number and sophistication by the

day. A phishing attack today targets audience by using mass-mailings to millions of email

addresses around the world, as well as by communicating with highly targeted groups of

customers that have been enumerated through security faults in small clicks-and-mortar

retail websites .

This project proposes a methodology to detect phishing attacks and to discover theentity/organization that the attackers impersonate during phishing attacks. The

methodology first discovers

(i) named entities, which includes names of people, organizations, and

locations; and

(ii) hidden topics.

Utilizing topics and named entities as features, the next stage classifies eachmessage as phishing or non-phishing. For messages classified as phishing, the final stage

discovers the impersonated entity. The automatic discovery of impersonated entity from

phishing helps the legitimate organization to take down the offending phishing site. This

project also proposes a technique to discriminate phishing e-mails from the legitimate e-

mails using the distinct structural features present in them. The derived features can be

used to efficiently classify phishing emails before it reaches the users inbox .


2/24



TABLE OF CONTENTS

SL.NO TITLE PAGE NO

1. Introduction

2. Proposed System

3. Proposed System architecture

4. Literature Survey

5. Objectives

6. Statement of how the objectives are to be

tackled

7. Time Schedule

8. References


3/24



1.INTRODUCTION

Phishing is a new word produced from 'fishing', it refers to the act that the attacker

allure users to visit a faked Web site by sending them faked e-mails (or instant messages),

and stealthily get victim's personal information such as user name, password, and

national security ID, etc. This information then can be used for future target

advertisements or even identity theft attacks (e.g., transfer money from victims' bank

account). The frequently used attack method is to send e-mails to potential victims, which

seemed to be sent by banks, online organizations, or ISPs. In these e-mails, they will

make up some causes, e.g. the password of your credit card had been mis-entered for

many times, or they are providing upgrading services, to allure you visit their Web site to

conform or modify your account number and password through the hyperlink provided in

the e-mail. If you input the account number and password, the attackers then successfully

collect the information at the server side, and is able to perform their next step actions

with that information (e.g., withdraw money out from your account).Phishing itself is not

a new concept, but it's increasingly used by phishers to steal user information and

perform business crime in recent years. Within one to two years, the number of phishing

attacks increased dramatically.

Phishing is a type of deception designed to steal your valuable personal data, such as

credit card numbers, passwords, account data, or other information. It is a form of social

engineering that is executed via electronic means and can lead to identity threat and

fraud. Phishing email messages take a number of forms:

They might appear to come from your bank or financial institution, a company you

regularly do business with, such as Microsoft, or from your social networking site.

They might appear to be from someone you in your email address book.

They might ask you to make a phone call. Phone phishing scams direct you to call

a phone number where a person or an audio response unit waits to take your


4/24



account number, personal identification number, password, or other valuable

personal data.

They might include official-looking logos and other identifying information taken

directly from legitimate websites, and they might include convincing details about

your personal history that scammers found on your social networking pages.

They might include links to spoofed websites where you are asked to enter

personal information.


5/24



2. PROPOSED SYSTEM

In this project we propose a method for classifying emails as legitimate or not using

named entity recognition. We use email features for detecting phished mails. Emails that

are labeled as spam, ham or phishing are then classified using a classifier. The classifier

identifies the mails as phishing or not.

Phishing

Non -phishing

Emails Feature

comparison

Classifier


6/24



3. PROPOSED SYSTEM ARCHITECTURE

Phisher sends

e-mail

User

B-OnGuaRd

Inbox Alert

Legitimate Phished mail

Feature Comparison

Classifier


7/24



4. LITERATURE SURVEY

Topic:Phishing Detection: A Literature Survey[1]This article surveys the literature on the detection of phishing attacks. Phishing

attacks target vulnerabilities that exist in systems due to the human factor.This paper aims

at surveying many of the recently proposed phishing mitigation techniques. A high-level

overview of various categories of phishing mitigation techniques is also presented, such

as: detection, offensive defense, correction, and prevention.

The phishing detection survey begins by

defining the phishing problem

categorizing anti-phishing solutions from the perspective of phishing campaign

life cycle

presenting evaluation metrics that are commonly used in the phishing domain to

evaluate the performance of phishing detection techniques

presenting a literature survey of anti-phishing detection techniques

presenting a comparison of the various proposed phishing detection techniques in

the literature.

Definition

The definition of phishing attacks is not consistent in the literature, which is due to the

fact that the phishing problem is broad and incorporates varying scenarios.According to

Phishtank:

Phishing is a fraudulent attempt, usually made through email, to steal your personnel

information


8/24



Categorizing anti-phishing solutions

Fig 1:Life cycle of phishing campaign[1]

Detection approaches

User training approachesend-users can be educated to better understand the

nature of phishing attacks, which ultimately leads them into correctly

identifying phishing and non-phishing messages

Software classification approaches these mitigation approaches aim at

classifying phishing and legitimate messages on behalf of the user in an attempt

to bridge the gap that is left due to the human error or ignorance.


9/24



Fig 2:Overview of phishing detection approaches[1]

Evaluation Metrics

Based on our review of the literature, the following are the most

commonly used evaluation metrics:

True Positive (TP) ratemeasures the rate of correctly detected phishing attacks

in relation to all existing phishing attacks.

False Positive (FP) rate measures the rate of legitimate instances that are

incorrectly detected as phishing attacks in relation to all existing legitimateinstances.


10/24



True Negative (TN) ratemeasures the rate of correctly detected legitimate

instances in relation to all existing legitimate instances.

False Negative (FN) rate measures the rate of phishing attacks that are

incorrectly detected as legitimate in relation to all existing phishing attacks.

Precision (P) measures the rate of correctly detected phishing attacks in

relation to all instances that were detected as phishing.

Recall (R)equivalent to TP.

f1 scoreIs the harmonic mean betweenP andR.

Accuracy (ACC) measures the overall rate of correctly detected phishing and

legitimate instances in relation to all instances.

Weighted Error (WErr) measures the overall weighted rate of incorrectlydetected phishing and legitimate instances in relation to all instances.


11/24



Topic: Multi-tier phishing detection and filtering approach[2]

Phishing attacks continue to pose serious risks for consumers and businesses as

well as threatening global security and the economy. Therefore, developing

countermeasures against such attacks is an important step towards defending critical

infrastructures such as banking. This paper presents a phishing email filtering approach

using multi-tier classification technique that combines multiple classification algorithms.

The major contributions are summarised as follows:

Proposes a new method for extracting the features of phishing email based on

weighting of message content and message header and select the features

according to priority ranking.

Presents a new approach called multi-tier classification model for filtering

phishing emails.

Examines the impact of rescheduling the classifier algorithms in a multi-tier

classification process to classify the phishing email and to find out the optimum

scheduling.

Provides an empirical evidence that the proposed approach reduces the false

positive problems substantially with lower complexity.

The multi-tier model

In this approach, the email message will be classified in a sequential

fashion by using the first two tier ML algorithms and the outputs will be sent to the

analyser section. The analyser will analyse the outputs and send them to the

corresponding mail- boxes based on the labeling of the ML algorithms. If the email

messages are misclassified by any of the first two tier(T1 and T2) ML algorithms, then

the analyser will invoke the tier-3(T3) ML algorithm. The T3 ML algorithm will classify


12/24



the misclassified email messages and send them to the corresponding mail boxes based

on the identification.

Fig 3: Block diagram for multi-tier classification model[2]

Feature construction

Features are extracted from each email based on weighting of message content and

message header and select the features according to priority ranking. Each phishing email

is parsed as text file to identify each header element to distinguish them from the body of

the message. Every substring within the subject header and the message body that was

delimited by white space was considered to be a token, and an alphabetic word was

defined as a token delimited by white space that contains only English alphabetic

characters (AZ, az) or apostrophes.


13/24



Category 1: features from the message subject header

Binary feature indicating 3 or more repeated characters.

Number of words with all letters in uppercase.

Number of words with at least 15 characters.

Number of words with at least two of letters J, K, Q, X, Z.

Number of words with no vowels.

Number of words with non-English characters, special characters such as

punctuation, or digits at beginning or middle of word.

Category 2: features from the priority and content-type headers

Binary feature indicating whether the priority had been set to any level

besides normal or medium.

Binary feature indicating whether a content-type header appeared within the

message header.

Category 3: features from the message body

Proportion of alphabetic words with no vowels and at least 7 characters

Proportion of alphabetic words with at least two of letters J, K, Q, X, Z

Proportion of alphabetic words at least 15 characters long

Binary feature indicating whether the strings From: and To: were

both present

Number of HTML opening comment tags

Number of hyperlinks (href)

Number of clickable images represented in HTML


14/24



Binary feature indicating whether a text colour was set to white

Number of URLs in hyperlinks with digits or &, %, or @

Number of colour element (both CSS and HTML format)

Binary feature indicating whether JavaScript has been used or not

Binary feature indicating whether CSS has been used or not

Binary feature indicating opening tag of table

Multi-tier filtering algorithm


15/24



Topic: An efficacious method for detecting phishing webpages through target

domain identification[3]

Any anti-phishing technique becomes incomplete without identification of the

phishing target. Hence, there is a need for a holistic approach that can identify the right

phishing target even when attackers use any masquerading techniques .Such a method

would gain significant importance among anti-phishing techniques as it alerts the target

owners to take necessary counter measures and enhance security.

In this paper, a novel approach to detect the phishing webpages is proposed. The

webpage is taken under scrutiny and identify all the direct and indirect links associated

with the page and generate domain group sets S1 and S2 respectively. From these sets the

target domain set is identified , which is given as input to Target Identification (TID)

algorithm to identify the phishing target. Using DNS lookup, the domains of suspicious

webpage and phishing target are mapped to corresponding IP addresses. On comparing

both the IP addresses, the authenticity of the suspicious webpage can be concluded. As

this approach depends only on content of the suspicious webpage it requires neither a

prior knowledge about the site nor requires the training data.

System overview

This system identifies phishing websites based on the following certainty that for a

phishing website, the target will be a legitimate site, whereas for a genuine website, the

system will point to the genuine site itself as its own target. On this stand the phishing

webpage is identified by comparing the suspicious webpage with its target.

For a given suspicious page, our method first identifies all the direct and indirect

links associated with that page. The links which are directly associated with the webpage

are extracted from the HTML source of the page and grouped based on their domains, as

a set of domain S1. The indirectly associated links of the page are then retrieved by first

extracting the keywords in the webpage and feeding these keywords to a search engine.

The first n links returned by the search engine as indirectly associated links are retrieved


16/24



and group them as a second set of domain S2. A reduced domain set S3 is constructed by

extracting only the common domains present in both S1 and S2. This set S3 is fed as an

input to a TID algorithm, to identify the phishing target domain. DNS lookup is used to

map the domain of the identified phishing target to its corresponding IP address.Similarly, the domain of the suspicious webpage to its corresponding IP address is also

mapped. On comparing the two IP addresses the authenticity of the suspicious webpage

can be concluded.

Fig. 4. System design (A1

A3: Extract links present in webpage; group links according todomains;domain set S1 given for set comparison; B1B5: Extract keywords; keywords feed to

search engine;extract the results; group links according to domains; domain set S2 given for set

comparison; C1C4: Identified target domain set; input target domain set to TIDalgorithm;

identify the target domain;supply domain name of the target domain to third-party DNS server;

D1: Supply domain name of thesuspicious webpage to third-party DNS server; E1: Label

generation based on DNS comparison (phishing = 0, legitimate = 1).[3]

Identifying the target domain

The target domain is identified from the target domain set (S3) the authenticity of

the suspicious webpage is checked. The set S3 contains the predicted target domains and

depending on the number of domains in it two scenarios are possible.


17/24



Fig 5:TID Algorithm

Phishing detection using DNS lookup

Here the target domain and the domain of the suspicious webpage P is taken ,and

perform third-party DNS lookup. As a result the corresponding IP addresses isobtained

for both the domains. On comparing these two sets of IP addresses thelegitimacy of the

webpage of P can be concluded. If the IP addresses of the domain P are matched with

those retrieved for the target domain P is declared to be a legitimate webpage. Otherwise,

it can be concluded as a phishing webpage. Third party DNS lookup is used to avoid

pharming attack (The user is redirected to a phished page even though he enters a correct

URL. Attackers carry out this by exploiting the vulnerability in DNS server software). In

identifying the legitimacy of a webpage IP address is used in comparison instead of

domain names, to overcome the discrepancies in domain names.


18/24



Topic: Intelligent phishing detection and protection scheme for online

transactions[4].

Phishing is an instance of social engineering techniques used to deceive users into

giving their sensitive information using an illegitimate website that looks and feels

exactly like the target organization website.Most phishing detection approaches utilizes

Uniform Resource Locator (URL) blacklists or phishing website features combined with

machine learning techniques to combat phishing. Despite the existing approaches that

utilize URL blacklists, they cannot generalize well with new phishing attacks due to

human weakness in verifying blacklists, while the existing feature-based methods suffer

high false positive rates and insufficient phishing features. As a result, this leads to an

inadequacy in the online transactions.

To address the problem robustly, it is important to build a state of-the-art model

using Neuro-Fuzzy scheme with five inputs. Neuro-Fuzzy is a Fuzzy Logic and a Neural

Network.

Methodologies

The proposed approach utilized Neuro-Fuzzy with five inputs to detect phishingwebsite in online transaction while maximizing the accuracy of performance and

minimizing false positive and operation time.

Neuro-Fuzzy

Neuro-Fuzzy is a combination of a Fuzzy Logic and a neural network with ability

of reasoning and learning .This combination allows the use of numeric and linguistic

properties. The advantage of Neuro-Fuzzy approach is that it has universal

approximations with ability to use Fuzzy IF...THEN rules. While Neural Network

performs well when dealing with raw data, Fuzzy Logic deals with reasoning on a higher

level, using numerical and linguistic information from domain expert. Neuro-Fuzzy was

chosen because it has capabilities of data learning from Neural Network view point, and


19/24



forms linguistic rules from Fuzzy Inference point of view ,thus allowing the power of

intelligent systems to be used .

Five Inputs

Five inputs are five tables where features are extracted and stored for reference.These includes:

1. Legitimate site rules Legitimate site rules is a summary of law covering

phishing crime

2. User-behavior profile User-behavior profile is a list of peoples behavior when

interacting with phishing and legitimate websites.

3. PhishTank PhishTank is a free community website operated by Open Domain

Names where suspected websites are verified and voted as phish by the

community experts

4. User-specific site User-specific site contains binding requirements between a

user and online transaction service providers

5. Pop-Ups from Email Pop-Ups from Email are regular phrases that are used by

phishers as appears on screen.

These five inputs are used because they are wholly representative of phishing attack

techniques and strategies. From the five inputs, 288 features are extracted which are used

as training and testing input data into the Neuro-Fuzzy system to generate Fuzzy

IF...THEN rules, and to discriminate between phishing, suspicious and legitimate sites

accurately in real-time. If a phishing website is detected, then a voice alarm is generated.

For a suspicious website, the system generates red.


20/24


21/24



constant, we obtain zero-order Sugeno Fuzzy model in which the consequent of a rule is

specified by a singleton.When y is a first order polynomial (y = k0 + k1x1 + k2x2 + ... +

kmxm), we get a first-order Sugeno Fuzzy model.

Layer 1 is the input layer. Neurons in this layer easily transmit external crispindications straight to the next layer. Neurons in this layer undertake fuzzification.

Fuzzification neurons contain a bell activation function. The activation of a membership

function is a set that specifies the Fuzzy set. Thus, the activation for the neuron in layer 2

is set to generalization bell (gbell) membership functions. Layer 3 is the rule base. This

layer gets inputs from the individual fuzzification nodes and calculates the firing strength

of the rule it represents. Layer 4 is the normalization. Every neuron based in this layer is

connected to individual normalization neuron. The Neuron gets inputs from every neuron

in the rule layers and calculates the normalized firing strength of a given rule. The

normalized firing strength is the percentage of the firing strength of a given rule to the

sum of firing strengths of every rule.Layer 5 is defuzzification. This neuron computes the

sum of outputs of every combined neurons and produces the overall Adaptive Neuro-

Fuzzy Inference System output, y.

Fig:7 Intelligent phishing detection fuzzy inference system structure[4]


22/24



5.OBJECTIVES

The aim of this project is to provide the anti-phishing industry with a solution that can

detect more sophisticated phishing attacks as well as detecting simple phishing attacks.

To achieve this project aim, there are some detailed objectives and tasks that are required

to be performed:

To survey and examine the current techniques and solutions of anti-phishing and

gain further knowledge through the understanding of these techniques.

To conduct an investigation of new phishing attacks and potential threats.

To collect the proposed system requirements.

To design the proposed systems architecture.

To implement the designed architecture into a working program.

To evaluate the resulting system.

.

6. STATEMENT OF HOW THE OBJECTIVES ARE TO BE

TACKLED

Phishing is a continual threat that keeps growing to this day. The damage caused

by phishing ranges from denial of access to email to substantial financial loss.

To achieve the objectives a survey of various papers are done through which

different phishing techniques, anti-phishing techniques are identified and studied. A new

system , B-OnGuaRd was proposed that discriminates phished mails and legitimate

emails before it reaches the users inbox after comparing features present in the emails and

through classification. The architecture of the proposed system specifies the overall

working of the system. Improvements are done in the architecture for better results.


23/24



7.TIME SCHEDULE


24/24


Dept of Computer Science And Engineering SJCET Palai Page 24

8. REFERENCES

[1]. Phishing Detection: A Literature Survey

Mahmoud Khonji, Youssef Iraqi, Senior Member, IEEE, and Andrew Jones

[2]. A multi-tier phishing detection and filtering approach

Rafiqul Islam , Jemal Abawajy

[3]. An efficacious method for detecting phishing webpages through target

domain identification

Gowtham Ramesh , Ilango Krishnamurthi , K. Sampath Sree Kumar

[4]. Intelligent phishing detection and protection scheme for online transactionsP.A. Barraclough , M.A. Hossain , M.A. Tahir , G. Sexton , N. Aslam

[5]. Learning to Detect Phishing Emails

Ian Fette, Norman Sadeh, Anthony Tomasic

phishing detection using named entity recognition

Documents