lsds-ir’08 1 cost-effective spam detection in p2p file-sharing systems dongmei jia information...

28
LSDS-IR’08 www.ir.iit.e du 1 Cost-Effective Spam Detection in P2P File- Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology [email protected]

Upload: shauna-mosley

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 1

Cost-Effective Spam Detection in P2P File-Sharing Systems

Dongmei JiaInformation Retrieval Lab

Illinois Institute of [email protected]

Page 2: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 2

Goal

• Create cost-effective ways of automatically detecting P2P spam results w/o actual file downloading

Page 3: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 3

Introduction

• Spam: – Any file that is misrepresented deliberately or

in a way of manipulating established retrieval and ranking techniques

• Spam is harmful– Degrade user search experience– Assist the propagation of viruses in network– Have significant impact on P2P traffic load

Page 4: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 4

Problem Statement

• Naïve spam detection method– Download and manually check– Cons:

• Time and labor consuming• Wastes bandwidth and storage resources• Risks of opening malware

• Hence, automatic spam detection is needed!

Page 5: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 5

Emule Example

Query (number of results)

Descriptors Group Size

File Key

Hard to detect spam automatically in query result set!

Page 6: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 6

Types of Spam

• Type 1: Files whose replicas have semantically different descriptors– E.g., different song titles for a same key

26NZUBS655CC66COLKMWHUVJGUXRPVUF:

“12 days after christmas.mp3”

“i want you thalia.mp3”

“comon be my girl.mp3”

Page 7: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 7

Types of Spam (Cont’d)

• Type 2: Files with long descriptors that contain semantically nonsensical term combinations– Single-descriptor problem– E.g., a single replica descriptor for key

1200473A4BB17724194C5B9C271F3DC4: “Aerosmith,Van Halen,Quiet Riot,Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”

Page 8: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 8

Types of Spam (Cont’d)

• Type 3: Files with descriptors that contain no query terms– Ads or warning on the illegal distribution of

copyrighted materials– E.g., “Can you afford 0.09

www.BuyLegalMP3.com.mp3”

Page 9: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 9

Types of Spam (Cont’d)

• Type 4: Files that are highly replicated on a single peer– Normal users do not create multiple replicas of a same

file on a single server – Manipulate “group size” ranking– E.g., 177 replicas of the file

DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer

Page 10: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Feature-Based Spam Detection

• Basic idea– To detect spam results by P2P features that

are strongly correlated with spam• Vocabulary size of a file’s group descriptor• Variance of terms in replica descriptors D of a file

group G– Jaccard distance: 1 - |D ∩ G| / |D G |

– Cosine distance: 1 - (VG·VD) / (|VG| |VD|)

• Per-host replication degree of a file– numRep / numHost

• …

10

Page 11: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 11

Probe Query

• Problem: – Results have insufficient and biased description info

• Conjunctive query matching

• Solution: – Gather more info for a result from network

• Other replica descriptors of the file• Statistics of peers who share the file

– Num of files, num of unique files, peer ID

– Implementation• Contains only a file key, not a “term” query

– Intuition• Probing helps to create a more complete view of a file• Ranking is more effective with more adequate file info

Page 12: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Evaluation

• Dataset– P2P audio files crawled from Gnutella network:

• numRep = 25,137,217; numFile = 9,575,113; numPeer = 226,786

– 50 most popular queries in the crawled dataset• Representative of most users, more likely target for spam

• Metric– Num spam in top-N ranked results, esp. for a small N

• Effectiveness– Improves performance by 9% for top-200 results, by

92.5% for top-20 results• Base case: noprobe+numRep

12

Page 13: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Cost Control

• Tradeoff– Performance vs. cost

• Cost– Num of responses for regular query and probe query

• Problem– Network cost is dramatically increased by probing

• How to reduce the cost?

13

Page 14: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Cost Control Approaches

• Random sampling of probe query results

• Piggy-backing of descriptor data in probe queries

• Limiting the scope of probing

14

Page 15: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Random Sampling

• Server-side random sampling of probe query results– A predefined probability P, 0 ≤ P ≤ 1

– Reduces cost by a factor P predictably– Impact on effectiveness of spam detection?

15

Page 16: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 16

Experimental Results

0

1

2

3

4

5

6

7

8

1 20 39 58 77 96 115 134 153 172 191

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

Cost is reduced significantly by sampling fewer probe results

In all sampling cases, overall performance is still 1.7%-9% better than noprobe

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1

Probe Query Sampling Rate

Avg

Tot

al C

ost

But the cost is still high With 25% sampling, cost is ~7 times higher than noprobe

Performance for top-20 results is 71%-92% better than noprobe

`

Page 17: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Piggy-backing of Descriptor Data

• Piggy-backing of descriptor data in probe queries– New type of probe query

• file key + descriptor of result file being probed

– Server’s descriptor will not respond if it contains no new term compared with the descriptor in probe query

• To limit num of probe results returned to client

17

Page 18: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 18

Experimental Results

0

1

2

3

4

5

6

7

8

1 19 37 55 73 91 109 127 145 163 181 199

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

Compared with the original type of probe, total cost is decreased by 35%-39% for all sampling rates

Compared with the original type of probe, overall performance is dropped by ~15%

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1

Probe Query Sampling Rate

Avg

Tot

al C

ost

E.g., the cost with sampling rate 0.25 is ~4 times higher than noprobe

`

However, performance for top-20 results is improved by 71%-88% in all sampling cases

Page 19: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu

Limiting Probing Scope

• Limiting the scope of probing– Only probe a few top-ranked (i.e., top-20) regular

query results– Intuition

• User tends to only consider downloading a file from a few top-ranked results

19

Page 20: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 20

Experimental Results

Performance of probing only top-20 results is always 22%-56% better over noprobe

Probing only the top-20 results significantly reduces cost

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 3 5 7 9 11 13 15 17 19

Top N Results

Avg

Num

Spa

m

0.250.50.751noprobe

0

2000

4000

6000

8000

10000

12000

14000

16000

noprobe 0.25 0.5 0.75 1

Probe Query Sampling Rate

Avg

Tot

al C

ost

E.g., cost with sampling rate 0.25 is only twice as much as that of noprobe

`

Page 21: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 21

Conclusion

• Feature-based spam detection techniques successfully decrease the amount of spam – 9% in top-200 results; 92% in top-20 results

• Cost control methods are effective in reducing network cost– Factor increase of cost is dropped from 7 to 2 over

noprobe– At the same time, performance is at least 22%

better over noprobe for top-20 results

Page 22: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 22

References• Limewire junk filter. http://wiki.limewire.org/index.php?title=Junk_Filter• J. Liang, R. Kumar, Y. Xi and K. Ross. Pollution in P2P File Sharing Systems. In

INFOCOM’05, May 2005.• K. Svore, Q. Wu, C.J.C. Burges and A. Raman. Improving Web spam classification using

Rank-time features. In Proc. AIRWeb workshop in WWW, 2007• Shlomo Hershkop, Salvatore j Stolfo. Combining Email Models for False Positive

Reduction. In proc. KDD’05. Chicago, Aug. 2005. • P. A. Chirita, J. Diederich, and W. Nejdl. MailRank: Using ranking for spam detection. In

proc. CIKM’05, Bremen, Germany, 2005.• Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly. Detecting spam web

pages through content analysis. In Proc. of WWW'06.• Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust

Algorithm for Reputation Management in P2P Networks. In Proc. of WWW, 2003. • Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on

mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450.

• Limewire. www.limewire.org• Runfang Zhou and Kai Hwang. Gossip-based Reputation Aggregation for Unstructured

Peer-to-Peer Networks. 21th IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Los Angeles, March 26-30, 2007

• Kevin Walsh, Emin Gun Sirer. Experience with an Object Reputation System for Peer-to-Peer Filesharing. In 3rd Symposium on Networked Systems Design & Implementation (NSDI), 2006

• Uichin Lee, Min Choi, Junghoo Cho, Medy. Y. Sanadidi, Mario Gerla. Understanding Pollution Dynamics in P2P File Sharing. In Proc. IPTPS'06.

Page 23: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 23

• Questions?

• Contact info:– WWW: www.ir.iit.edu– Email: [email protected]

Thanks fromIIT’s IR Lab!

Page 24: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 24

Related Work

• Email spam detection– Hershkop et al., KDD’05

• Analyze email content and syntax

– Chirita et al., CIKM’05• Construct social networks for email address

• Web spam detection– Ntoulas et al., WWW’06

• Analyze content of Web pages

– Gyongyi et al., VLDB’06• Analyze link structure of Web pages

Page 25: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 25

Related Work (Cont’d)

• P2P spam detection– Spam filter in Limewire

• User-controlled spam learning

– Liang et al., INFOCOM’05• Detect spam using extra info, i.e., official CD

length of a media file

– Kamvar et al., WWW’03• Build reputation systems to rank peers

Page 26: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 26

Simulating P2P search

• Built a system to simulate P2P search on client side

• Simulating query routing– A query is randomly sent to 50 peers– Repeat until either stop condition is satisfied

• Condition 1: num of unique results reaches 200 results• Condition 2: num of peers that have received query reaches

50K peers

– Threshold values chosen based on specifications of real-world P2P systems (i.e. Limewire’s Gnutella)

Page 27: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 27

Experimental Results

0

1

2

3

4

5

6

7

8

1 22 43 64 85 106 127 148 169 190

Top N Results

Avg

Nu

m S

pa

m

noprobe+numRep

noprobe+CosineQD

probe+numRep

probe+Cosine

probe+Jaccard

probe+numUniqueTerms

Compared with noprobe+numRep, probe+Cosine improves performance by 9% for top-200 results, by 92.5% for top-20 results

Compared with noprobe+CosineQD, 21.6% and 97.8%

noprobe+numRep

probe+Cosine

noprobe+CosineQD

probe+numUniqueTerms

probe+Jaccard

Page 28: LSDS-IR’08  1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology

LSDS-IR’08 www.ir.iit.edu 28

Experimental Results (Cont’d)

Compare Cosine/Jaccard distance with numUniqueTerms in a fair way by only considering multi-replica files

0

2

4

6

8

10

12

1 15 29 43 57 71 85 99 113

Top N Results

Avg

Nu

m S

pa

m

probe+Cosineprobe+Jaccardprobe+numUniqueTerms