web research: open problemsyura/en/web-talk.pdf · outline 1 intro: criteria and questionnaire 2...
TRANSCRIPT
![Page 1: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/1.jpg)
Web Research: Open Problems
Yury Lifshits
Steklov Institute of Mathematics at St.Petersburg
November 2006
1 / 33
![Page 2: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/2.jpg)
Objective
To find and state key open algorithmicproblems for future web technologies
2 / 33
![Page 3: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/3.jpg)
Outline
1 Intro: Criteria and Questionnaire
2 Problem 1: Large-Scale Filtering
3 Problem 2: Large-Scale Matching
4 Problem 3: Tag Propagation
5 Problem 4: Structure Discovery
3 / 33
![Page 4: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/4.jpg)
Outline
1 Intro: Criteria and Questionnaire
2 Problem 1: Large-Scale Filtering
3 Problem 2: Large-Scale Matching
4 Problem 3: Tag Propagation
5 Problem 4: Structure Discovery
3 / 33
![Page 5: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/5.jpg)
Outline
1 Intro: Criteria and Questionnaire
2 Problem 1: Large-Scale Filtering
3 Problem 2: Large-Scale Matching
4 Problem 3: Tag Propagation
5 Problem 4: Structure Discovery
3 / 33
![Page 6: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/6.jpg)
Outline
1 Intro: Criteria and Questionnaire
2 Problem 1: Large-Scale Filtering
3 Problem 2: Large-Scale Matching
4 Problem 3: Tag Propagation
5 Problem 4: Structure Discovery
3 / 33
![Page 7: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/7.jpg)
Outline
1 Intro: Criteria and Questionnaire
2 Problem 1: Large-Scale Filtering
3 Problem 2: Large-Scale Matching
4 Problem 3: Tag Propagation
5 Problem 4: Structure Discovery
3 / 33
![Page 8: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/8.jpg)
INTRO
What are my personal criteria for choosing openproblems?
What kind of questions should I answer about proposedproblems?
4 / 33
![Page 9: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/9.jpg)
Criteria
Ultimate relation to technology challenge
Familiarity with the corresponding applied field
Interplay of several basic fields
Freshness (hence, badly formalized)
I do not use:
Difficulty
Popularity and age of the problem
Famous author
Your favorite criteria?
5 / 33
![Page 10: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/10.jpg)
Criteria
Ultimate relation to technology challenge
Familiarity with the corresponding applied field
Interplay of several basic fields
Freshness (hence, badly formalized)
I do not use:
Difficulty
Popularity and age of the problem
Famous author
Your favorite criteria?
5 / 33
![Page 11: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/11.jpg)
Criteria
Ultimate relation to technology challenge
Familiarity with the corresponding applied field
Interplay of several basic fields
Freshness (hence, badly formalized)
I do not use:
Difficulty
Popularity and age of the problem
Famous author
Your favorite criteria?
5 / 33
![Page 12: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/12.jpg)
Questionnaire
Technology challenge?
Sample formalization?
Basic fields involved?
Research workflow?
Your constructive feedback?
References? Similar Ideas? [To be done]
6 / 33
![Page 13: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/13.jpg)
Disclaimer
My style is
1 At first, think independently (e.g. pose new problems)
2 Only after that look into literature
Hence, the following problems might be already knownand heavily studied!
7 / 33
![Page 14: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/14.jpg)
Disclaimer
My style is
1 At first, think independently (e.g. pose new problems)
2 Only after that look into literature
Hence, the following problems might be already knownand heavily studied!
7 / 33
![Page 15: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/15.jpg)
PROBLEM 1
Large-Scale Filtering
What are the fastest algorithms for personal newsaggregation?
8 / 33
![Page 16: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/16.jpg)
1.1. Challenge
Personal news aggregation:Every user has a preference profile:specified information sources, keywords, tags(topics),popularity, references to the preferences of others
Every news item has its own description:text, votes and recommendations, tags,author reputation, comments
Filtering problem:To find, say, ten most appropriate news itemsfor every user
9 / 33
![Page 17: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/17.jpg)
1.1. Challenge
Personal news aggregation:Every user has a preference profile:specified information sources, keywords, tags(topics),popularity, references to the preferences of others
Every news item has its own description:text, votes and recommendations, tags,author reputation, comments
Filtering problem:To find, say, ten most appropriate news itemsfor every user
9 / 33
![Page 18: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/18.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 19: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/19.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 20: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/20.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 21: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/21.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 22: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/22.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 23: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/23.jpg)
1.2. Formalization
Every profile is a normalized red vector (point onsphere) in n-dimensional space
As well, every news description is a normalized bluevector in the same space
We use cosine measure (scalar product) for similarity
Computational problem: after preprocessing all bluepoints, for every incoming red point compute quicklyten closest blue points
Data structures for storing all profiles and all news?
10 / 33
![Page 24: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/24.jpg)
1.3. Fields Involved
Text classification, kNN algorithms
Computational Geometry
Data Structures
Compression (sparse sets)
Linear Algebra (singular decomposition trick)
What else?
11 / 33
![Page 25: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/25.jpg)
1.4. Workflow
1 Find fast algorithms for all-to-all filtering problem
2 Suggest data structures for storing profiles and news
3 Study filtering in dynamic settings: with profiles anddescriptions quickly evolving in time
4 Describe spam prevention mechanisms for largefiltering systems
12 / 33
![Page 26: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/26.jpg)
1.5. Constructive Feedback
Do you know related results?
What is the most important theoretical question in thisproblem?
How to make my formalization better?
13 / 33
![Page 27: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/27.jpg)
PROBLEM 2
Large-Scale Matching
What is the most effective algorithm for distributingsponsored links among all websites?
14 / 33
![Page 28: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/28.jpg)
2.1. Challenge
Effective sponsored links (ads) distribution:Every ad has a target descriptionEvery website has an audience description
Business objective:Maximize ratio clicks/displays
15 / 33
![Page 29: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/29.jpg)
2.1. Challenge
Effective sponsored links (ads) distribution:Every ad has a target descriptionEvery website has an audience description
Business objective:Maximize ratio clicks/displays
15 / 33
![Page 30: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/30.jpg)
2.2. Formalization
Every website’s audience profile is a normalized redvector in n-dimensional space
As well, every ad target is a normalized blue vector inthe same space
We use cosine measure for similarity
Computational problem: compute matching betweenads and websites that satisfy some constraints andminimize the sum of distances (ad - website)
16 / 33
![Page 31: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/31.jpg)
2.2. Formalization
Every website’s audience profile is a normalized redvector in n-dimensional space
As well, every ad target is a normalized blue vector inthe same space
We use cosine measure for similarity
Computational problem: compute matching betweenads and websites that satisfy some constraints andminimize the sum of distances (ad - website)
16 / 33
![Page 32: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/32.jpg)
2.2. Formalization
Every website’s audience profile is a normalized redvector in n-dimensional space
As well, every ad target is a normalized blue vector inthe same space
We use cosine measure for similarity
Computational problem: compute matching betweenads and websites that satisfy some constraints andminimize the sum of distances (ad - website)
16 / 33
![Page 33: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/33.jpg)
2.2. Formalization
Every website’s audience profile is a normalized redvector in n-dimensional space
As well, every ad target is a normalized blue vector inthe same space
We use cosine measure for similarity
Computational problem: compute matching betweenads and websites that satisfy some constraints andminimize the sum of distances (ad - website)
16 / 33
![Page 34: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/34.jpg)
2.3. Fields Involved
Computational Geometry
Linear Algebra (singular decomposition trick)
Data Structures
Compression (sparse sets)
Game theory
Optimization
What else?
17 / 33
![Page 35: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/35.jpg)
2.4. Workflow
1 State ads distribution as an optimization problem
2 Find algorithms that can approximately solve thisproblem faster than (#websites)×(#ads)
3 Introduce feedback to the model: after every click onany ad we receive some additional knowledge aboutthe world and can use it for improvement of ourmatching
18 / 33
![Page 36: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/36.jpg)
2.5. Constructive Feedback
Do you know related results?
What is the most important theoretical question in thisproblem?
How to make my formalization better?
19 / 33
![Page 37: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/37.jpg)
PROBLEM 3
Tag Propagation
How to extend partial categorization of websites to thewhole web?
20 / 33
![Page 38: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/38.jpg)
3.1. Challenge
Web categorization:People use millions of keywords (tags)There are billions of webpagesWe have very sparse training collectionof pairs (website,tag)
Goal:Get a fast algorithm that can characterizeany given website
Applications:Ads targetingSearch results annotationsAutomatic web directories
21 / 33
![Page 39: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/39.jpg)
3.1. Challenge
Web categorization:People use millions of keywords (tags)There are billions of webpagesWe have very sparse training collectionof pairs (website,tag)
Goal:Get a fast algorithm that can characterizeany given website
Applications:Ads targetingSearch results annotationsAutomatic web directories
21 / 33
![Page 40: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/40.jpg)
3.1. Challenge
Web categorization:People use millions of keywords (tags)There are billions of webpagesWe have very sparse training collectionof pairs (website,tag)
Goal:Get a fast algorithm that can characterizeany given website
Applications:Ads targetingSearch results annotationsAutomatic web directories 21 / 33
![Page 41: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/41.jpg)
3.2. Formalization
We have the graph of hyperlinks
Fix a tag. For every initially labelled website letT0(i) = 1, for others T0(i) = 0
Then we use recursive equation and take a limit:
Tk(i) = Tk−1(i) + α∑
j links to i
Tk−1(j)
Computational problem: use some preprocessing forinitial tag distribution and then for every given websitecompute quickly ten tags with the highest rank
22 / 33
![Page 42: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/42.jpg)
3.2. Formalization
We have the graph of hyperlinks
Fix a tag. For every initially labelled website letT0(i) = 1, for others T0(i) = 0
Then we use recursive equation and take a limit:
Tk(i) = Tk−1(i) + α∑
j links to i
Tk−1(j)
Computational problem: use some preprocessing forinitial tag distribution and then for every given websitecompute quickly ten tags with the highest rank
22 / 33
![Page 43: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/43.jpg)
3.2. Formalization
We have the graph of hyperlinks
Fix a tag. For every initially labelled website letT0(i) = 1, for others T0(i) = 0
Then we use recursive equation and take a limit:
Tk(i) = Tk−1(i) + α∑
j links to i
Tk−1(j)
Computational problem: use some preprocessing forinitial tag distribution and then for every given websitecompute quickly ten tags with the highest rank
22 / 33
![Page 44: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/44.jpg)
3.2. Formalization
We have the graph of hyperlinks
Fix a tag. For every initially labelled website letT0(i) = 1, for others T0(i) = 0
Then we use recursive equation and take a limit:
Tk(i) = Tk−1(i) + α∑
j links to i
Tk−1(j)
Computational problem: use some preprocessing forinitial tag distribution and then for every given websitecompute quickly ten tags with the highest rank
22 / 33
![Page 45: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/45.jpg)
3.3. Fields Involved
Data Structures
Compression (sparse sets)
Numerical Analysis (speed of convergence)
What else?
23 / 33
![Page 46: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/46.jpg)
3.4. Workflow
1 Define formulas for tag “propagation”
2 Construct a fast algorithm for computing, say, tenmost relevant tags of arbitrary website
24 / 33
![Page 47: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/47.jpg)
3.5. Constructive Feedback
Do you know related results?
What is the most important theoretical question in thisproblem?
How to make my formalization better?
25 / 33
![Page 48: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/48.jpg)
PROBLEM 4
Structure Discovery
Consider keywords we use in everyday life. Can wesuggest an algorithm that computes the most appropriate
hierarchy of these keywords?
26 / 33
![Page 49: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/49.jpg)
4.1. Challenge
We can collect many huge data sets:call graphs, shopping histories, search historiessocial networks, RSS subscription graphHOW TO BENEFIT FROM THEM?
Example: hierarchy discoveryWe have some folksonomyHow to compute “optimal” tags hierarchy?
Applications:Visualization and better navigationSolving synonymy problem
27 / 33
![Page 50: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/50.jpg)
4.1. Challenge
We can collect many huge data sets:call graphs, shopping histories, search historiessocial networks, RSS subscription graphHOW TO BENEFIT FROM THEM?
Example: hierarchy discoveryWe have some folksonomyHow to compute “optimal” tags hierarchy?
Applications:Visualization and better navigationSolving synonymy problem
27 / 33
![Page 51: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/51.jpg)
4.1. Challenge
We can collect many huge data sets:call graphs, shopping histories, search historiessocial networks, RSS subscription graphHOW TO BENEFIT FROM THEM?
Example: hierarchy discoveryWe have some folksonomyHow to compute “optimal” tags hierarchy?
Applications:Visualization and better navigationSolving synonymy problem
27 / 33
![Page 52: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/52.jpg)
4.2. Formalization
Every tag is characterized by corresponding set ofwebsites
We want to compute the optimal AND-OR tree oftags
Optimal means minimal correctness violation
Correctness: sons of OR vertex should be disjoint,parent set contains children sets, etc...
28 / 33
![Page 53: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/53.jpg)
4.2. Formalization
Every tag is characterized by corresponding set ofwebsites
We want to compute the optimal AND-OR tree oftags
Optimal means minimal correctness violation
Correctness: sons of OR vertex should be disjoint,parent set contains children sets, etc...
28 / 33
![Page 54: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/54.jpg)
4.2. Formalization
Every tag is characterized by corresponding set ofwebsites
We want to compute the optimal AND-OR tree oftags
Optimal means minimal correctness violation
Correctness: sons of OR vertex should be disjoint,parent set contains children sets, etc...
28 / 33
![Page 55: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/55.jpg)
4.2. Formalization
Every tag is characterized by corresponding set ofwebsites
We want to compute the optimal AND-OR tree oftags
Optimal means minimal correctness violation
Correctness: sons of OR vertex should be disjoint,parent set contains children sets, etc...
28 / 33
![Page 56: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/56.jpg)
4.3. Fields Involved
Computational Biology (phylogeny algorithms)
Approximate algorithms
What else?
29 / 33
![Page 57: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/57.jpg)
4.4. Workflow
1 Fix a format of tag description and define anoptimality criteria for hierarchy of tags
2 Construct a fast algorithm for computing optimalhierarchy
3 Study interplay with algorithms for constructingphylogeny tree
30 / 33
![Page 58: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/58.jpg)
4.5. Constructive Feedback
Do you know related results?
What is the most important theoretical question in thisproblem?
How to make my formalization better?
31 / 33
![Page 59: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/59.jpg)
Voting
We discuss four problems. Which one do you like themost?
1 Large-Scale Filtering
2 Large-Scale Matching
3 Tag Propagation
4 Structure Discovery
32 / 33
![Page 60: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/60.jpg)
Main points
My homepage: http://logic.pdmi.ras.ru/~yura/
Today we learn:
Technology challenges: personal aggregation, effectiveads, usage of huge data collection
Key algorithmic challenge: large-scale algorithms thatare faster than naive (usually quadratic) approaches
Next steps: (1) survey, (2) formalizations, (3) publicdiscussion
Thanks! Questions?
33 / 33
![Page 61: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/61.jpg)
Main points
My homepage: http://logic.pdmi.ras.ru/~yura/
Today we learn:
Technology challenges: personal aggregation, effectiveads, usage of huge data collection
Key algorithmic challenge: large-scale algorithms thatare faster than naive (usually quadratic) approaches
Next steps: (1) survey, (2) formalizations, (3) publicdiscussion
Thanks! Questions?
33 / 33
![Page 62: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/62.jpg)
Main points
My homepage: http://logic.pdmi.ras.ru/~yura/
Today we learn:
Technology challenges: personal aggregation, effectiveads, usage of huge data collection
Key algorithmic challenge: large-scale algorithms thatare faster than naive (usually quadratic) approaches
Next steps: (1) survey, (2) formalizations, (3) publicdiscussion
Thanks! Questions?
33 / 33
![Page 63: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/63.jpg)
Main points
My homepage: http://logic.pdmi.ras.ru/~yura/
Today we learn:
Technology challenges: personal aggregation, effectiveads, usage of huge data collection
Key algorithmic challenge: large-scale algorithms thatare faster than naive (usually quadratic) approaches
Next steps: (1) survey, (2) formalizations, (3) publicdiscussion
Thanks! Questions?
33 / 33
![Page 64: Web Research: Open Problemsyura/en/web-talk.pdf · Outline 1 Intro: Criteria and Questionnaire 2 Problem 1: Large-Scale Filtering 3 Problem 2: Large-Scale Matching 4 Problem 3: Tag](https://reader033.vdocuments.net/reader033/viewer/2022050515/5f9f8c7040353f7b9a344547/html5/thumbnails/64.jpg)
Main points
My homepage: http://logic.pdmi.ras.ru/~yura/
Today we learn:
Technology challenges: personal aggregation, effectiveads, usage of huge data collection
Key algorithmic challenge: large-scale algorithms thatare faster than naive (usually quadratic) approaches
Next steps: (1) survey, (2) formalizations, (3) publicdiscussion
Thanks! Questions?
33 / 33