phishing detection using named entity recognition
Post on 02-Jun-2018
239 Views
Preview:
TRANSCRIPT
-
8/10/2019 Phishing Detection Using Named Entity Recognition
1/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 1
ABSTRACT
Phishing is a way of attempting to acquire sensitive information such as
usernames, passwords and credit card details by masquerading as a trustworthy entity in
an electronic communication. Phishing is a major security threat to the online
community. Phishing scams have been escalating in number and sophistication by the
day. A phishing attack today targets audience by using mass-mailings to millions of email
addresses around the world, as well as by communicating with highly targeted groups of
customers that have been enumerated through security faults in small clicks-and-mortar
retail websites .
This project proposes a methodology to detect phishing attacks and to discover theentity/organization that the attackers impersonate during phishing attacks. The
methodology first discovers
(i) named entities, which includes names of people, organizations, and
locations; and
(ii) hidden topics.
Utilizing topics and named entities as features, the next stage classifies eachmessage as phishing or non-phishing. For messages classified as phishing, the final stage
discovers the impersonated entity. The automatic discovery of impersonated entity from
phishing helps the legitimate organization to take down the offending phishing site. This
project also proposes a technique to discriminate phishing e-mails from the legitimate e-
mails using the distinct structural features present in them. The derived features can be
used to efficiently classify phishing emails before it reaches the users inbox .
-
8/10/2019 Phishing Detection Using Named Entity Recognition
2/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 2
TABLE OF CONTENTS
SL.NO TITLE PAGE NO
1. Introduction
2. Proposed System
3. Proposed System architecture
4. Literature Survey
5. Objectives
6. Statement of how the objectives are to be
tackled
7. Time Schedule
8. References
-
8/10/2019 Phishing Detection Using Named Entity Recognition
3/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 3
1.INTRODUCTION
Phishing is a new word produced from 'fishing', it refers to the act that the attacker
allure users to visit a faked Web site by sending them faked e-mails (or instant messages),
and stealthily get victim's personal information such as user name, password, and
national security ID, etc. This information then can be used for future target
advertisements or even identity theft attacks (e.g., transfer money from victims' bank
account). The frequently used attack method is to send e-mails to potential victims, which
seemed to be sent by banks, online organizations, or ISPs. In these e-mails, they will
make up some causes, e.g. the password of your credit card had been mis-entered for
many times, or they are providing upgrading services, to allure you visit their Web site to
conform or modify your account number and password through the hyperlink provided in
the e-mail. If you input the account number and password, the attackers then successfully
collect the information at the server side, and is able to perform their next step actions
with that information (e.g., withdraw money out from your account).Phishing itself is not
a new concept, but it's increasingly used by phishers to steal user information and
perform business crime in recent years. Within one to two years, the number of phishing
attacks increased dramatically.
Phishing is a type of deception designed to steal your valuable personal data, such as
credit card numbers, passwords, account data, or other information. It is a form of social
engineering that is executed via electronic means and can lead to identity threat and
fraud. Phishing email messages take a number of forms:
They might appear to come from your bank or financial institution, a company you
regularly do business with, such as Microsoft, or from your social networking site.
They might appear to be from someone you in your email address book.
They might ask you to make a phone call. Phone phishing scams direct you to call
a phone number where a person or an audio response unit waits to take your
-
8/10/2019 Phishing Detection Using Named Entity Recognition
4/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 4
account number, personal identification number, password, or other valuable
personal data.
They might include official-looking logos and other identifying information taken
directly from legitimate websites, and they might include convincing details about
your personal history that scammers found on your social networking pages.
They might include links to spoofed websites where you are asked to enter
personal information.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
5/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 5
2. PROPOSED SYSTEM
In this project we propose a method for classifying emails as legitimate or not using
named entity recognition. We use email features for detecting phished mails. Emails that
are labeled as spam, ham or phishing are then classified using a classifier. The classifier
identifies the mails as phishing or not.
Phishing
Non -phishing
Emails Feature
comparison
Classifier
-
8/10/2019 Phishing Detection Using Named Entity Recognition
6/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 6
3. PROPOSED SYSTEM ARCHITECTURE
Phisher sends
e-mail
User
B-OnGuaRd
Inbox Alert
Legitimate Phished mail
Feature Comparison
Classifier
-
8/10/2019 Phishing Detection Using Named Entity Recognition
7/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 7
4. LITERATURE SURVEY
Topic:Phishing Detection: A Literature Survey[1]This article surveys the literature on the detection of phishing attacks. Phishing
attacks target vulnerabilities that exist in systems due to the human factor.This paper aims
at surveying many of the recently proposed phishing mitigation techniques. A high-level
overview of various categories of phishing mitigation techniques is also presented, such
as: detection, offensive defense, correction, and prevention.
The phishing detection survey begins by
defining the phishing problem
categorizing anti-phishing solutions from the perspective of phishing campaign
life cycle
presenting evaluation metrics that are commonly used in the phishing domain to
evaluate the performance of phishing detection techniques
presenting a literature survey of anti-phishing detection techniques
presenting a comparison of the various proposed phishing detection techniques in
the literature.
Definition
The definition of phishing attacks is not consistent in the literature, which is due to the
fact that the phishing problem is broad and incorporates varying scenarios.According to
Phishtank:
Phishing is a fraudulent attempt, usually made through email, to steal your personnel
information
-
8/10/2019 Phishing Detection Using Named Entity Recognition
8/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 8
Categorizing anti-phishing solutions
Fig 1:Life cycle of phishing campaign[1]
Detection approaches
User training approachesend-users can be educated to better understand the
nature of phishing attacks, which ultimately leads them into correctly
identifying phishing and non-phishing messages
Software classification approaches these mitigation approaches aim at
classifying phishing and legitimate messages on behalf of the user in an attempt
to bridge the gap that is left due to the human error or ignorance.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
9/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 9
Fig 2:Overview of phishing detection approaches[1]
Evaluation Metrics
Based on our review of the literature, the following are the most
commonly used evaluation metrics:
True Positive (TP) ratemeasures the rate of correctly detected phishing attacks
in relation to all existing phishing attacks.
False Positive (FP) rate measures the rate of legitimate instances that are
incorrectly detected as phishing attacks in relation to all existing legitimateinstances.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
10/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 10
True Negative (TN) ratemeasures the rate of correctly detected legitimate
instances in relation to all existing legitimate instances.
False Negative (FN) rate measures the rate of phishing attacks that are
incorrectly detected as legitimate in relation to all existing phishing attacks.
Precision (P) measures the rate of correctly detected phishing attacks in
relation to all instances that were detected as phishing.
Recall (R)equivalent to TP.
f1 scoreIs the harmonic mean betweenP andR.
Accuracy (ACC) measures the overall rate of correctly detected phishing and
legitimate instances in relation to all instances.
Weighted Error (WErr) measures the overall weighted rate of incorrectlydetected phishing and legitimate instances in relation to all instances.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
11/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 11
Topic: Multi-tier phishing detection and filtering approach[2]
Phishing attacks continue to pose serious risks for consumers and businesses as
well as threatening global security and the economy. Therefore, developing
countermeasures against such attacks is an important step towards defending critical
infrastructures such as banking. This paper presents a phishing email filtering approach
using multi-tier classification technique that combines multiple classification algorithms.
The major contributions are summarised as follows:
Proposes a new method for extracting the features of phishing email based on
weighting of message content and message header and select the features
according to priority ranking.
Presents a new approach called multi-tier classification model for filtering
phishing emails.
Examines the impact of rescheduling the classifier algorithms in a multi-tier
classification process to classify the phishing email and to find out the optimum
scheduling.
Provides an empirical evidence that the proposed approach reduces the false
positive problems substantially with lower complexity.
The multi-tier model
In this approach, the email message will be classified in a sequential
fashion by using the first two tier ML algorithms and the outputs will be sent to the
analyser section. The analyser will analyse the outputs and send them to the
corresponding mail- boxes based on the labeling of the ML algorithms. If the email
messages are misclassified by any of the first two tier(T1 and T2) ML algorithms, then
the analyser will invoke the tier-3(T3) ML algorithm. The T3 ML algorithm will classify
-
8/10/2019 Phishing Detection Using Named Entity Recognition
12/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 12
the misclassified email messages and send them to the corresponding mail boxes based
on the identification.
Fig 3: Block diagram for multi-tier classification model[2]
Feature construction
Features are extracted from each email based on weighting of message content and
message header and select the features according to priority ranking. Each phishing email
is parsed as text file to identify each header element to distinguish them from the body of
the message. Every substring within the subject header and the message body that was
delimited by white space was considered to be a token, and an alphabetic word was
defined as a token delimited by white space that contains only English alphabetic
characters (AZ, az) or apostrophes.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
13/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 13
Category 1: features from the message subject header
Binary feature indicating 3 or more repeated characters.
Number of words with all letters in uppercase.
Number of words with at least 15 characters.
Number of words with at least two of letters J, K, Q, X, Z.
Number of words with no vowels.
Number of words with non-English characters, special characters such as
punctuation, or digits at beginning or middle of word.
Category 2: features from the priority and content-type headers
Binary feature indicating whether the priority had been set to any level
besides normal or medium.
Binary feature indicating whether a content-type header appeared within the
message header.
Category 3: features from the message body
Proportion of alphabetic words with no vowels and at least 7 characters
Proportion of alphabetic words with at least two of letters J, K, Q, X, Z
Proportion of alphabetic words at least 15 characters long
Binary feature indicating whether the strings From: and To: were
both present
Number of HTML opening comment tags
Number of hyperlinks (href)
Number of clickable images represented in HTML
-
8/10/2019 Phishing Detection Using Named Entity Recognition
14/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 14
Binary feature indicating whether a text colour was set to white
Number of URLs in hyperlinks with digits or &, %, or @
Number of colour element (both CSS and HTML format)
Binary feature indicating whether JavaScript has been used or not
Binary feature indicating whether CSS has been used or not
Binary feature indicating opening tag of table
Multi-tier filtering algorithm
-
8/10/2019 Phishing Detection Using Named Entity Recognition
15/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 15
Topic: An efficacious method for detecting phishing webpages through target
domain identification[3]
Any anti-phishing technique becomes incomplete without identification of the
phishing target. Hence, there is a need for a holistic approach that can identify the right
phishing target even when attackers use any masquerading techniques .Such a method
would gain significant importance among anti-phishing techniques as it alerts the target
owners to take necessary counter measures and enhance security.
In this paper, a novel approach to detect the phishing webpages is proposed. The
webpage is taken under scrutiny and identify all the direct and indirect links associated
with the page and generate domain group sets S1 and S2 respectively. From these sets the
target domain set is identified , which is given as input to Target Identification (TID)
algorithm to identify the phishing target. Using DNS lookup, the domains of suspicious
webpage and phishing target are mapped to corresponding IP addresses. On comparing
both the IP addresses, the authenticity of the suspicious webpage can be concluded. As
this approach depends only on content of the suspicious webpage it requires neither a
prior knowledge about the site nor requires the training data.
System overview
This system identifies phishing websites based on the following certainty that for a
phishing website, the target will be a legitimate site, whereas for a genuine website, the
system will point to the genuine site itself as its own target. On this stand the phishing
webpage is identified by comparing the suspicious webpage with its target.
For a given suspicious page, our method first identifies all the direct and indirect
links associated with that page. The links which are directly associated with the webpage
are extracted from the HTML source of the page and grouped based on their domains, as
a set of domain S1. The indirectly associated links of the page are then retrieved by first
extracting the keywords in the webpage and feeding these keywords to a search engine.
The first n links returned by the search engine as indirectly associated links are retrieved
-
8/10/2019 Phishing Detection Using Named Entity Recognition
16/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 16
and group them as a second set of domain S2. A reduced domain set S3 is constructed by
extracting only the common domains present in both S1 and S2. This set S3 is fed as an
input to a TID algorithm, to identify the phishing target domain. DNS lookup is used to
map the domain of the identified phishing target to its corresponding IP address.Similarly, the domain of the suspicious webpage to its corresponding IP address is also
mapped. On comparing the two IP addresses the authenticity of the suspicious webpage
can be concluded.
Fig. 4. System design (A1
A3: Extract links present in webpage; group links according todomains;domain set S1 given for set comparison; B1B5: Extract keywords; keywords feed to
search engine;extract the results; group links according to domains; domain set S2 given for set
comparison; C1C4: Identified target domain set; input target domain set to TIDalgorithm;
identify the target domain;supply domain name of the target domain to third-party DNS server;
D1: Supply domain name of thesuspicious webpage to third-party DNS server; E1: Label
generation based on DNS comparison (phishing = 0, legitimate = 1).[3]
Identifying the target domain
The target domain is identified from the target domain set (S3) the authenticity of
the suspicious webpage is checked. The set S3 contains the predicted target domains and
depending on the number of domains in it two scenarios are possible.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
17/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 17
Fig 5:TID Algorithm
Phishing detection using DNS lookup
Here the target domain and the domain of the suspicious webpage P is taken ,and
perform third-party DNS lookup. As a result the corresponding IP addresses isobtained
for both the domains. On comparing these two sets of IP addresses thelegitimacy of the
webpage of P can be concluded. If the IP addresses of the domain P are matched with
those retrieved for the target domain P is declared to be a legitimate webpage. Otherwise,
it can be concluded as a phishing webpage. Third party DNS lookup is used to avoid
pharming attack (The user is redirected to a phished page even though he enters a correct
URL. Attackers carry out this by exploiting the vulnerability in DNS server software). In
identifying the legitimacy of a webpage IP address is used in comparison instead of
domain names, to overcome the discrepancies in domain names.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
18/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 18
Topic: Intelligent phishing detection and protection scheme for online
transactions[4].
Phishing is an instance of social engineering techniques used to deceive users into
giving their sensitive information using an illegitimate website that looks and feels
exactly like the target organization website.Most phishing detection approaches utilizes
Uniform Resource Locator (URL) blacklists or phishing website features combined with
machine learning techniques to combat phishing. Despite the existing approaches that
utilize URL blacklists, they cannot generalize well with new phishing attacks due to
human weakness in verifying blacklists, while the existing feature-based methods suffer
high false positive rates and insufficient phishing features. As a result, this leads to an
inadequacy in the online transactions.
To address the problem robustly, it is important to build a state of-the-art model
using Neuro-Fuzzy scheme with five inputs. Neuro-Fuzzy is a Fuzzy Logic and a Neural
Network.
Methodologies
The proposed approach utilized Neuro-Fuzzy with five inputs to detect phishingwebsite in online transaction while maximizing the accuracy of performance and
minimizing false positive and operation time.
Neuro-Fuzzy
Neuro-Fuzzy is a combination of a Fuzzy Logic and a neural network with ability
of reasoning and learning .This combination allows the use of numeric and linguistic
properties. The advantage of Neuro-Fuzzy approach is that it has universal
approximations with ability to use Fuzzy IF...THEN rules. While Neural Network
performs well when dealing with raw data, Fuzzy Logic deals with reasoning on a higher
level, using numerical and linguistic information from domain expert. Neuro-Fuzzy was
chosen because it has capabilities of data learning from Neural Network view point, and
-
8/10/2019 Phishing Detection Using Named Entity Recognition
19/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 19
forms linguistic rules from Fuzzy Inference point of view ,thus allowing the power of
intelligent systems to be used .
Five Inputs
Five inputs are five tables where features are extracted and stored for reference.These includes:
1. Legitimate site rules Legitimate site rules is a summary of law covering
phishing crime
2. User-behavior profile User-behavior profile is a list of peoples behavior when
interacting with phishing and legitimate websites.
3. PhishTank PhishTank is a free community website operated by Open Domain
Names where suspected websites are verified and voted as phish by the
community experts
4. User-specific site User-specific site contains binding requirements between a
user and online transaction service providers
5. Pop-Ups from Email Pop-Ups from Email are regular phrases that are used by
phishers as appears on screen.
These five inputs are used because they are wholly representative of phishing attack
techniques and strategies. From the five inputs, 288 features are extracted which are used
as training and testing input data into the Neuro-Fuzzy system to generate Fuzzy
IF...THEN rules, and to discriminate between phishing, suspicious and legitimate sites
accurately in real-time. If a phishing website is detected, then a voice alarm is generated.
For a suspicious website, the system generates red.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
20/24
-
8/10/2019 Phishing Detection Using Named Entity Recognition
21/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 21
constant, we obtain zero-order Sugeno Fuzzy model in which the consequent of a rule is
specified by a singleton.When y is a first order polynomial (y = k0 + k1x1 + k2x2 + ... +
kmxm), we get a first-order Sugeno Fuzzy model.
Layer 1 is the input layer. Neurons in this layer easily transmit external crispindications straight to the next layer. Neurons in this layer undertake fuzzification.
Fuzzification neurons contain a bell activation function. The activation of a membership
function is a set that specifies the Fuzzy set. Thus, the activation for the neuron in layer 2
is set to generalization bell (gbell) membership functions. Layer 3 is the rule base. This
layer gets inputs from the individual fuzzification nodes and calculates the firing strength
of the rule it represents. Layer 4 is the normalization. Every neuron based in this layer is
connected to individual normalization neuron. The Neuron gets inputs from every neuron
in the rule layers and calculates the normalized firing strength of a given rule. The
normalized firing strength is the percentage of the firing strength of a given rule to the
sum of firing strengths of every rule.Layer 5 is defuzzification. This neuron computes the
sum of outputs of every combined neurons and produces the overall Adaptive Neuro-
Fuzzy Inference System output, y.
Fig:7 Intelligent phishing detection fuzzy inference system structure[4]
-
8/10/2019 Phishing Detection Using Named Entity Recognition
22/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 22
5.OBJECTIVES
The aim of this project is to provide the anti-phishing industry with a solution that can
detect more sophisticated phishing attacks as well as detecting simple phishing attacks.
To achieve this project aim, there are some detailed objectives and tasks that are required
to be performed:
To survey and examine the current techniques and solutions of anti-phishing and
gain further knowledge through the understanding of these techniques.
To conduct an investigation of new phishing attacks and potential threats.
To collect the proposed system requirements.
To design the proposed systems architecture.
To implement the designed architecture into a working program.
To evaluate the resulting system.
.
6. STATEMENT OF HOW THE OBJECTIVES ARE TO BE
TACKLED
Phishing is a continual threat that keeps growing to this day. The damage caused
by phishing ranges from denial of access to email to substantial financial loss.
To achieve the objectives a survey of various papers are done through which
different phishing techniques, anti-phishing techniques are identified and studied. A new
system , B-OnGuaRd was proposed that discriminates phished mails and legitimate
emails before it reaches the users inbox after comparing features present in the emails and
through classification. The architecture of the proposed system specifies the overall
working of the system. Improvements are done in the architecture for better results.
-
8/10/2019 Phishing Detection Using Named Entity Recognition
23/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering,SJCET,Palai Page 23
7.TIME SCHEDULE
-
8/10/2019 Phishing Detection Using Named Entity Recognition
24/24
Phishing Detection Using Named Entity Recognition
Dept of Computer Science And Engineering SJCET Palai Page 24
8. REFERENCES
[1]. Phishing Detection: A Literature Survey
Mahmoud Khonji, Youssef Iraqi, Senior Member, IEEE, and Andrew Jones
[2]. A multi-tier phishing detection and filtering approach
Rafiqul Islam , Jemal Abawajy
[3]. An efficacious method for detecting phishing webpages through target
domain identification
Gowtham Ramesh , Ilango Krishnamurthi , K. Sampath Sree Kumar
[4]. Intelligent phishing detection and protection scheme for online transactionsP.A. Barraclough , M.A. Hossain , M.A. Tahir , G. Sexton , N. Aslam
[5]. Learning to Detect Phishing Emails
Ian Fette, Norman Sadeh, Anthony Tomasic
top related