databases computer security software engineering computer graphics networking distributed systems...

42
Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: echnologies, Applications, and Opportunities Torsten Suel Associate Professor CSE Department Polytechnic Institute of NYU [email protected]

Post on 21-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Databases

Computer SecuritySoftware Engineering

Computer Graphics

Networking

Distributed SystemsWeb Search Engines:

Technologies, Applications, and Opportunities

Torsten SuelAssociate ProfessorCSE DepartmentPolytechnic Institute of [email protected]

Page 2: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

What this talk is about:

- web search engines - how they work

- underlying technologies

- applications and impact

- what is next?

- opportunities and education

Page 3: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

The Basics: • type in some words• get back results (usually 10 at a time)

• hopefully good results among these• if not, change your query, try again

Page 4: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

The Basics: • type in some words• get back results (usually 10 at a time)

• hopefully good results among these• if not, change your query, try again

Page 5: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• as the web grew, we needed a way to find sites

• has grown into many-billion-$ industry at heart of the web

• search engines: largest supercomputers of the world

• 100’s of millions of queries per day, 0.1 sec latency/query

• evaluated over tens of billions of documents

• the majors: Google, MSFT Bing, Yahoo!(?), Baidu, Yandex

• but many other businesses use similar technologies, or rely on search engines for their business

Web Search Technology

Page 6: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• 1. Search engines in a nutshell: - How does the web work?

- How do search engines work?

- Basic search architecture

- Historical background

• 2. Technical challenges and opportunities: - link analysis

- computational advertising

•3. Education and opportunities - the search landscape

- search engines and the curriculum

Overview of this Lecture:

Page 7: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

1. Search Engines in a Nutshell

The Web:

text …

A lot of text …

>100 billionpages of text

and other stuff …

Page 8: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• pages containing (fairly unstructured) text

• images, audio, etc. embedded in (hanging off) pages

• structure defined using HTML (Hypertext Markup Language)

• hyperlinks between pages!

• over 100 billion pages

• over 3 trillion hyperlinks

a giant graph!

What is the web? (another view)

Page 9: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• pages reside in servers

• sites often contain related pages

• site/host structure

• local versus global links

How the web is organized: site structure

Web Server (Host)

Web Server (Host)

Web Server (Host)

www.poly.edu

www.cnn.com

www.irs.gov

Page 10: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

How Browsing Works

Desktop(with browser)

give me the file “/world/index.html”

here is the file: “...”

Web Server

www.cnn.com

Fetching “www.cnn.com/world/index.html”

Page 11: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

HTTP:

desktop or crawler

web server

GET /world/index.html HTTP/1.0User-Agent: Mozilla/3.0 (Windows 95/NT)Host: www.cnn.comFrom: …Referer: …If-Modified-Since: ...

HTTP/1.0 200 OKServer: Netscape-Communications/1.1Date: Tuesday, 8-Feb-99 01:22:04 GMTLast-modified: Thursday, 3-Feb-99 10:44:11 GMTContent-length: 5462Content-type: text/html

<the html file>

Page 12: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Basic structure of a search engine:

Crawler

disks

Index

mining &indexing

Search.comQuery: “computer”

look up

Page 13: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Crawling

Crawler

disks

• crawler: also called spider, web robot• fetches pages from the web• starts at set of “seed pages”• parses fetched pages for hyperlinks• then follows those links (e.g., BFS)

• until all pages fetches (i.e., never)

Page 14: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Data Mining:

• data fetched by the crawler is analyzed

• many different tasks: - link analysis (later) - detection of spam pages and dangerous pages - analyzing data about past queries (clicks etc.) - data extraction (products, people, locations, …)

• mining becoming increasingly important

• needs experts in statistics & machine learning

• large-scale mining platforms:

mapReduce (Google), Hadoop, Pig (Yahoo!), Dryad (MSFT), …

Page 15: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Indexing

disks

• parse & build lexicon & build index

• index very large

I/O-efficient techniques needed

“inverted index”

indexing

aardvark 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zebra 602, 1189, 3209, ...

Page 16: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Querying

Boolean queries: (zebra AND armadillo) OR armani

compute unions/intersections of lists

Ranked queries: zebra, armadillo

give scores to all docs in union or intersection of lists

look up

aardvark 3452, 11437, ….......arm 4, 19, 29, 98, 143, ...armada 145, 457, 789, ...armadillo 678, 2134, 3970, ...armani 90, 256, 372, 511, ........zebra 602, 1189, 3209, ...

Page 17: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Ranked Querying: • return best pages first• term- vs. link vs. log-based approaches• multiple phases, many features• also add meaningful “snippets”

Page 18: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• coverage (need to cover large part of the web)

• good ranking (in the case of broad and narrow queries)

• freshness (need to update content)

• user load (> 50000 queries/sec - Google)

• manipulation (sites want to be listed first)

Challenges for search engines:

need to crawl and store massive data sets

smart information retrieval techniques

frequent recrawling of content

many queries on massive data

naïve techniques will be exploited quickly

Page 19: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Summary:

• Four main tasks: - crawling (data acquisition) - data/web mining - index building and maintenance - query execution

• highly optimized ranking functions and systems

• tens to hundreds of thousands of machines

• major engines need hundreds of engineers

• and PhD-level experts in stats, ML, systems, etc.

• many more jobs in related companies

Page 20: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

“IR is concerned with the representation, storage, organization of, and access to information items”

• focus on automatic processing (indexing, clustering, search) of unstructured data (text, images, audio, ...)

• subfield of Computer Science, but with roots in Library Science, Information Science, and Linguistics

• main focus on text data, but also images, audio, video

• applications: - searching in a library catalog

- categorizing a collection of medical articles by area - web search engines

Historical Roots: Information Retrieval (IR)

Page 21: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• WW2 era computers built for number crunching: ballistics computations, code breaking

• since earliest days, also used to “organize information” - Memex (Vannevar Bush, 1945)

• today, this is the main application! - store and organize data and documents - model organizations and work processes

• Computer Organizer (also: communications/media)

• … however, no clean separation

Historical Perspective

Page 22: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• IR: lesser known cousin of field of Databases

• Databases: focus on structured data

• IR: unstructured data: “documents”

• IR focused on providing info directly to human users• IR has false positives and negatives (is fuzzy)

Structured vs. Unstructured Data

- scientific articles, novels, poems, jokes, web pages

Page 23: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• Babylonians, Greeks, Romans, etc.• Indexing & creation of concordances (by hand) - “algorithms for full-text indexing” !

• Lib. of Cong. and Dewey Library Classifications• Documentalism

• Bibliometric and Informetric distributions: - Bradford, Lotka, Zipf, Pareto, Yule (1920s-40s)

• Citation Analysis and Social Network Analysis• Microfilm rapid selectors: (e.g., E. Goldberg 1931)

• Memex (Vannevar Bush, 1939/45)

IR Before 1945:

Page 24: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• “As We May Think”, Atlantic Monthly, 1945 (mostly written 1939)

Memex: Vannevar Bush (1890-1974)

Page 25: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• Querying documents by keywords• Classifying documents by topic

• Hypertext and analyzing links between documents

• Library catalogs and digital libraries

• Searching a collection of news or medical articles

• National security: - analyzing communication streams, financial networks

• Most widely used application: web search!

IR Techniques and Applications:

Page 26: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

2. Technical Challenges and Opportunities:

• What is a graph? - nodes and edges - edges directed or undirected

• Graphs are used to model many scenarios - social relationships: who knows whom, who is friends with whom?

- citations in literature: which physicist cites which other physicist?

- email, telephone: who communicates with whom?

- follow the money: who gives money to whom?

- the web: who links to whom?

Graphs and Social Networks

Page 27: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• nodes are researchers• connected by edge if one cites the other

• From: University of Cottbus

Example: Scientific Literature

Page 28: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• nodes are employees• connected by edge if they exchanged more than 5 emails

• From: Shetty/Adibi (USC)

Example: Enron Email Network

Page 29: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• search engines: use hyperlink graph to improve ranking

• national security: who calls whom and what does it mean?

• social sciences: understanding societies

Social Networks: Why do we care?

Page 30: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• Basic idea: exploits judgments by millions of web pages

• A page that is highly referenced should be better or more important

• Pagerank (Brin&Page at Google)

“significance of a page depends on significance of those referencing it”

• s(a) = s(b)/2 + s(c)/3 + s(d)/1

• System of equations

• Unique solution under some assumptions

Link-Based Ranking Techniques

Page 31: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• initialize the rank value of each node to 1/n (0.2 for 5 nodes)

• a node with k outgoing links transmits a 1/k fraction of its current rank value over that edge to its neighbor

• iterate this process many times until it converges

• NOTE: this is a random walk on the link graph

• Pagerank: stationary distribution of this random walk

1/21/2

1/21/2

1/2

1/2 1

1

Pagerank

0.2

0.2

0.2

0.2

0.2

Page 32: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

1/21/2

1/21/2

1/2

1/2 1

1 0.2

0.2

0.2

0.2

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.1

0.2

0.2

0.3

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.1

0.3

0.15

0.25

0.2

1/21/2

1/21/2

1/2

1/2 1

1 0.143

0.286

0.143

0.286

0.143

..

(1) (2)

(3) (n)

Page 33: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• stationary distribution: vector x with xA = x• A is primitive, and x Eigenvector of A

• computed using Jacobi or Gauss Seidel iteration

1/21/2

1/21/2

1/2

1/2 1

1

Matrix notation

0.2

0.2

0.2

0.2

0.2

0 1/2 0 0 1/2

0 0 1/2 0 1/2

0 0 0 1 0

0 0 1/2 0 1/2

1 0 0 0 0

A

Page 34: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• Now every engine is using link-based ranking

• But the idea is much older!

• Who is the most important physicist? - citation analysis, 1960s

• Who is the most important person in town? - social network analysis, 1950s

• Who is the most important person on Facebook? - the most friends? - or the most important friends? (recursive) - or the best friends? …. and on Twitter ?

Page 35: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Computational Advertising:

• New field dealing with mathematical techniques for targeting ads in electronic media• a multi-billion $ business• economic foundation of search engines and web• many startups in New York City

• ITV: TV is moving to the internet

• how to match ads with the right consumers

• scary privacy implications

Page 36: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Comp. Adv. Basics

• Search ads: matching ads to queries - ads on top and right hand side of search results

- based on query and past behavior (cookies etc)

- pay-per-click model with bids by advertisers

- or pay-per-action: goal is immediate action by user

• Display ads: large, shiny ads for brands - banner ads on major sites for cars, movies, etc.

- per-par-display: e.g., $100 per million impressions

- sold by contract: e.g., ads for upcoming movies

• AdSense (Google): text ads on 3rd-party pages - placing ads on pages based on content and user

- Google sharing money with site owner

- danger of manipulation by site owner

Page 37: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Comp. Adv. Marketplace

• Very complicated zoo of companies and roles - ad networks - ad campaign coordination/optimization - companies providing user data - arbitrage & manipulation - hundreds/thousands of companies - real-time auctions in tens of millisecs

• Emerging ads scenarios - monetizing social networks (facebook, linkedIn etc.) - mobile ads and ads in app-space (e.g., flurry) - ITV and internet radio: ads not a broadcast (hulu) - ads in games and virtual worlds

• Privacy: looking bad at the moment - (almost) everything is for sale …

Page 38: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• search engine positions is $$$• web pages are cheap• make lots of automatic junk …• use ads to make $ (or $$$)

Adversarial Information Retrieval

Page 39: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

3. Education and Opportunities:

• Web search at center of the online world• interesting technical challenges• many professional opportunities• NYC is a center of this industry• Google, plus media, ads, e-commerce, mobile

• what qualifications are needed?

Page 40: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Perspective:

algorithms

systemsinformation retrieval

databases

machine learning

natural languageprocessin

g

AI

library &information

science

websearch

Page 41: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

• Search is somewhat Computer Science oriented

• … it’s all software (mostly)

• also, media/design exp. becoming important

• courses/topics to learn: - algorithms

- distributed systems

- databases

- web search & information retrieval

- machine learning, data mining, and statistics

- natural language processing

In the curriculum

Page 42: Databases Computer Security Software Engineering Computer Graphics Networking Distributed Systems Web Search Engines: Technologies, Applications, and Opportunities

Course Offering & Student Activities

• CS6913: Web Search Engines (Spring Sem.)

- web and search engine architecture how does it all work?

- working with massive data sets storing and analyzing terabytes

- introduction to Information Retrieval unstructured data, text

- system building skills building distributed systems

- the Web as a social network adversarial behavior, spam, communities

• Course Objectives

• Student Activities

- course projects build your own (small-scale) search engine

- independent research projects, theses, assistantships