air web 05 slides

27
Web Spam, Propaganda and Trust P. Takis Metaxas Computer Science Department Wellesley College Joint work with Joe DeStefano

Upload: p-takis-metaxas

Post on 14-May-2015

331 views

Category:

Documents


2 download

DESCRIPTION

Web Spam, Propoganda and Trust

TRANSCRIPT

Page 1: Air Web 05 Slides

Web Spam, Propagandaand Trust

P. Takis MetaxasComputer Science Department

Wellesley College

Joint work with Joe DeStefano

Page 2: Air Web 05 Slides

Outline of the Talk

The Web and its Spam •••••

A Short History of the Search Engines •••••••••

Web Spam as Propaganda •••

Propaganda Primer

Anti-propagandistic techniques on Spam ••••

Experimental Results

Conclusions and Next Steps ••

Page 3: Air Web 05 Slides

The Web …

Has changed the way we get informedHas changed the way we make decisions

(financial, medical, political, …)Is huge 2-10 billion static pages publicly available,

doubling every year Three times this, if you count the “deep web” Infinite, if you count dynamically created pages

Will be omnipresent Computers, Cell phones, PDA’s, thermostats, toasters ...

Can be unreliable

Page 4: Air Web 05 Slides

… and its Spam

Page 5: Air Web 05 Slides

… and its Spam

Page 6: Air Web 05 Slides

What is Web Spam?

The practice of manipulating web pagesin order to cause search engines rank them higherthan they would without manipulation“…than they deserve”“… unjustifiably favorable [ranking wrt] the page’strue value”“…unethical web page positioning”It is a problem, not only for search engines Primarily for users As well as for content providers

It is first a social problem, then a technical one

Page 7: Air Web 05 Slides

Who is Spamming and Why?

Companies Big companies Small businesses

Advertisers and Promoters Search Engine Optimizers

Special interest groups Religious interests Financial interests Medical interests Political interests etc

Everybody could/would My doctor You (?), Me (!)

85% of searchersdo not go beyondtop-10

People (still) trustthe written word

People trust thesearch engines

Page 8: Air Web 05 Slides

A Short History of Search Engines

1st Generation (ca 1994): AltaVista, Excite, Infoseek… Ranking based on Content

Pure Information Retrieval2nd Generation (ca 1996): Lycos Ranking based on Content + Structure

Site Popularity3rd Generation (ca 1998): Google, Teoma Ranking based on Content + Structure + Value

Page ReputationIn the Works Ranking based on “the need behind the query”

??

Page 9: Air Web 05 Slides

1st Generation: Content Similarity

Boolean operations on query terms did not go very far

Content Similarity Ranking:The more rare words two documents share, the more similar they are

Similarity is measured by vector angles

Query Results are rankedby sorting the anglesbetween query and documents

How To Spam?t 1

d2

d 1

t 3

t 2

_

Page 10: Air Web 05 Slides

1st Generation: How to Spam

Add keywords so as to confuse page relevanceHide them from human eyesSearching for Jennifer Aniston?SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORDJENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATEMOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDERVALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDIKLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFERLOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSONMADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUEKATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBERVALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTESKELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDYCRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLEMACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAMANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTABETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICALEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTONGILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRABANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOMHARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISYFUENTES KELLY BROOK

Page 11: Air Web 05 Slides

2nd Generation: Site Popularity

A link from a page in site Ato some page in site Bis considered a popularityvote from A to B

Rank similar pagesaccording to popularity

Related implementationof Popularity:DirectHit’s Click-throughs

Rich get richer:users will always tryfirst few links returned

How To Spam?

www.aa.com1

www.bb.com2

www.cc.com1 www.dd.com

2

www.zz.com0

Page 12: Air Web 05 Slides

2nd Generation: How to Spam

Heavily interconnected“link farms”spam popularity

Clicking robotsspam click-throughs

Page 13: Air Web 05 Slides

3rd Generation: Page Reputation

A link from a page Px to page Py is considered aconfidence vote from Px to Py Confidence builds reputation

(as in academic co-citations)

The reputation “PageRank” of a page Pi =the sum

of a fraction of the reputationsof all pages Pj that point to Pi

Beautiful Math behind it PR = principal eigenvector

of the web’s link matrix PR equivalent to the chance

of randomly surfing to the pageHITS algorithm tries to recognize

“authorities” and “hubs”

How To Spam?

Page 14: Air Web 05 Slides

3rd Generation: How to Spam

Organize “mutual admiration societies”of irrelevant reputable sites

Page 15: Air Web 05 Slides

An Industry is Born

“SE Optimizer” CompaniesAdvertisement ConsultantsConferences

Page 16: Air Web 05 Slides

Web Spam as a major forcebehind Search Engines Evolution

Search Engine’s Action

1st Generation: Pure IR Content

2nd Generation: Popularity Content + Structure

3rd Generation: Reputation Content + Structure + Value

In the Works Ranking based on

“the need behind the query”

Web Spammers Response

Add keywords so asto confuse page relevanceCreate “link farms” of heavilyinterconnected sitesOrganize “mutual admirationsocieties” of irrelevant sites??

Is there a pattern on how to spam?

Can you guesswhat they will

do?

They will try tomodify the Web Graph

for their benefit

Page 17: Air Web 05 Slides

And Now For Something Completely Different(?)

Propaganda: Attempt to modify human behavior,

and thus influence their actionsin ways beneficial to propagandists

Theory of Propaganda Developed by the Institute for Propaganda Analysis 1938-1942

Propagandistic Techniques (and ways of detecting propaganda) Word games

Name Calling Glittering Generalities

Transfer Testimonial Bandwagon

Page 18: Air Web 05 Slides

Societal Trust is a Network

A Simplified Description of Societal Trust:

Weighted Directed Graph of Nodes and Weighted Arcs Nodes = Societal Entities (People, Ideas, …) Arcs = Recommendation from an entity to another Arc weight = Degree of entrustment

Then what is Propaganda? Attempt to modify the Trust Social Network

in ways beneficial to propagandist

And what is Web Spam? Attempt to modify the Web Graph

in ways beneficial to spammer

Page 19: Air Web 05 Slides

Web Spam as Propaganda

+ Testimonials+ mutualadmirationsocieties

+ Pagereputation

3rd Gen

+ Bandwagon+ link farms+ Sitepopularity

2nd Gen

Glitteringgeneralities

Keywordstuffing

Doc Similarity1st Gen

PropagandaSpammingRankingSE’s

Web Spam is a major force behind Search Engine evolution

So what?Can this understanding help us defend against web spam?

Page 20: Air Web 05 Slides

Anti-Propagandistic Lessons for Web

How do you deal with propaganda in reallife?

Backward propagation of distrustThe recommender of an untrustworthymessage becomes untrustworthy

Can you transfer this technique to the web?

Page 21: Air Web 05 Slides

An Anti-Propagandistic Algorithm

Start from untrustworthy site sS = {s}Using BFS for depth D do: Find the set U of sites

linking to sites in S(using the Google APIfor up to B b-links/site)

Ignore blogs, directories, edu’s S = S + U

Find the bi-connected componentBCC of U

that includes s

BCC shows multiple pathsto boost the reputation of s

Page 22: Air Web 05 Slides

An Anti-Propagandistic Algorithm

Start from untrustworthy site sS = {s}Using BFS for depth D do: Find the set U of sites

linking to sites in S(using the Google APIfor up to B b-links/site)

Ignore blogs, directories, edu’s S = S + U

Find the bi-connected componentBCC of U

that includes s

BCC shows multiple pathsto boost the reputation of s

Page 23: Air Web 05 Slides

Explored neighborhoods

Page 24: Air Web 05 Slides

Evaluated Experimental Results

15% =2/13

14% = 1/34

70% = 28/40

100% = 32/32

60% = 28/47

64% = 14/22

69% = 9/13

80% = 16/20

78% = 42/54

74% = 34/46

Untrstwrth

7%4% = 2/542661380coral-calcium-benefits.com

0%0% = 0/323281genf20.com

13%9% = 4/47228312coral1.com

241

1429

1547

716

457

875

1307

|G|

advice-hgh.com

hgfound.org

1stHGH.com

maxsportsmag.com

hardcorebodybuilding.com

vespro.com

renuva.net

Target

8%77% = 10/1313

26%56% = 19/34164

10%5% = 2/40200

27%0% = 0/22105

15%0% = 0/1363

15%0% = 0/2097

13%2% = 1/46228

DirectoryTrustworth|BCC|

Page 25: Air Web 05 Slides

Evaluated Experimental Results

Page 26: Air Web 05 Slides

Conclusions and Next Steps

Web Spam / Cyberworld = Propaganda / SocietyParticular spamming techniques can be uncovered - then what?Spam becomes a necessity as web grows “I spent all my life searching for the meaning of life…” “If you cannot find it on eBay or Google, it does not exist”

Spam to you, treasure to meWho do you trust is the right question to ask

and provide tools for managing trusted and distrustedPersonalization of search a search engine (component) per browser Or: specialized search engines

Education, critical thinking What we believe, why we believe it

Cyber-social structures and networks I inherit the trusted/distrusted networks of the societies I join

Page 27: Air Web 05 Slides

How (not) To Solve The Problem