spam detection in p2p systems team matrix abhishek ghagdarshan kapadia pratik singh
TRANSCRIPT
OVERVIEW
P2P Basics
Spam
The Spam Detection Problem
Approaches to the Spam Detection Problem
Proposal
References
P2P Basics Used to connect nodes or machines via large adhoc
connections.
No concept of a client or server.
All nodes or peers are equal.
The equal peer nodes function as both client and server.
Classification of P2P:- Centralized P2P network – Napster. Decentralized P2P network – KaZaA. Structured P2P network – CAN. Unstructured P2P network – Gnutella. Hybrid P2P network – JXTA.
Advantages of P2P:-
All peers provide resources like bandwidth, computing power, storage space, CPU cycles.
Replication of data over multiple peers eliminates single point of failure.
Applications of P2P:-
File Sharing
Internet Telephony e.g. Skype.
Streaming media files.
Spam
Spam is any file that is misrepresented deliberately.
A well known problem in P2P file sharing systems.
Used to manipulate established retrieval and ranking techniques.
Anonymous, decentralized and dynamic in nature.
Spam
Taken From Malware Prevalence in the KaZaA FileSharingTaken From Malware Prevalence in the KaZaA FileSharing
Network Research Paper ACMNetwork Research Paper ACM
Taken From Malware Prevalence in the KaZaA FileSharing
Network Research Paper ACM
VirusesViruses in P2Pin P2P
Why is Spam Harmful?
Degrades user search experience.
Assists the propagation of viruses in the network.
More than 200 viruses use P2P as a propagation vector.
Increases the load on the traffic in the network.
SpamSpam
Hard to detect spam automatically as:-
Insufficient and biased information returned as user query. Anonymous, decentralized and dynamic nature.
Naïve spam detection technique is download and check manually.
Approaches to Spam Detection Problem
Mainly two approaches to the spam detection problem.
Detection after downloading file
User compares the file with the known databases of genuine files. User filters the file so that other user don't get the spammed copy
Detection before downloading file
Rigid Trust Web of trust Reputation System Blocking IP address
Object Reputation:-
Involves the user to vote for a file either positively or negatively. Based on the voting evaluation and the voting protocol, the file is
regarded as genuine or spam.
Disadvantages: -
Consumes time and labor. Wastage of bandwidth and computing resources. Risk of opening malware.
Thus there arises a need to develop an effective automatic spam detection technique.
Query Processing
Client writes a query.
Server compares the result.
System Identifier and descriptor.
The client groups the individual groups by keys.
Ranking.
The client becomes the server.
Spamming
Steps 1, 3 and 5.
Object Reputation on step 1.
Feature based Spam Detection on steps 3 and 5.
Feature Based Spam Detection
Characterizing Spam.
Characterizing Spammers.
Then implement techniques that use this characterization to rank the query results.
Classification of Spam
Type 1:-
• Files whose replicas have semantically different descriptors.
• The Spammer might name a file after a currently popular song or might give multiple names to the same file descriptor.
Eg: different song titles for a same key 26NZUBS655CC66COLKMWHUVJGUXRPVUF:
“12 days after christmas.mp3”
“i want you thalia.mp3”
“come on be my girl.mp3” …
Classification of Spam
Type 2:-
• Files with long descriptors
• In this a Spammer inserts a single long descriptor for the file.
• E.g., a single replica descriptor for key 1200473A4BB17724194C5B9C271F3DC4: “Aerosmith, Van Halen, Quiet Riot, Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”
Classification of Spam
Type 3:-
Files with descriptors with no query terms.
In this, if a server is wishing to share a file, it may return the file regardless of whether it matches the query results.
Eg. “ Can you afford 0.09 www.BuyLegalMP3.com.mp3”
Classification of Spam
Type 4:-
Files that are highly replicated on a single peer.
Normal users do not create multiple replicas of the same file on a single server. This is aimed at manipulating the group size. It retards processing of query routing techniques used for finding hard to find data.
E.g..177 replicas of the file DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer.
Proposal
We plan to implement the Feature based Spam Detection technique that characterizes the spam based on various features.
It includes a probing technique that aggregates more descriptive information of result files and statistics of peer and ranking functions.
Our implementation requires little new functionality in the existing P2P file sharing systems, thus it can be combined easily with other existing techniques.
Papers. Author – Dongmei Jia
Title – Cost Effective Spam Detection Techniques in P2P File Sharing Systems.
Conference -- Proceeding of the 2008 ACM workshop on Large scale Distributed Systems for information
retrieval.
Date -- October 2008.
Publisher -- ACM.
URL -- http://portal.acm.org.ezproxy.rit.edu/results.cfm?coll=portal&dl=ACM&CFID=14901064&CFTOKEN=96029385
References
References Author – Dongmei Jia, Wai Gen Yee, Ophir
Frieder
Title – Spam Characterization and Detection in Peer to Peer File Sharing Systems.
Conference -- Proceeding of the 17th ACM conference on Information and knowledge mining
Date -- October 2008.
Publisher -- ACM.
URL -- http://portal.acm.org.ezproxy.rit.edu/citation.cfm?id=1458082.1458128&coll=portal&dl=ACM&CFID=14901064&CFTOKEN=96029385
References Author – Jia Liang, Rakesh Kumar, Yongjian
Xi, Keith W RossTitle – Pollution in P2P File Sharing
Systems. Conference --
INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE Date -- March 2005.Publisher -- ACM.URL -- http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?arnumber=1498344&isnumber=32100