x-informatics web search; text mining b 2013 geoffrey fox [email protected] associate dean for

74
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox [email protected] http:// www.infomall.org/X-InformaticsSpring2013/index.html Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington 2013

Upload: gabriel-patrick

Post on 12-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

X-Informatics Web Search; Text Mining B

2013

Geoffrey [email protected]

http://www.infomall.org/X-InformaticsSpring2013/index.html

Associate Dean for Research and Graduate Studies,  School of Informatics and Computing

Indiana University Bloomington

2013

Page 2: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 3: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

The Course in One Sentence

Study Clouds running Data Analytics processing Big Data to solve problems in X-Informatics

Page 4: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Document Preparation

Page 5: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 6: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 7: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 8: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 9: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 10: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 11: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 12: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 13: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 14: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 15: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Inverted Index

Page 16: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 17: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 18: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 19: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 20: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Index Construction

Page 21: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 22: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 23: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Then sort by termID and then docIDhttp://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 24: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 25: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 26: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 27: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Query Structure and Processing

Page 28: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 29: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 30: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 31: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 32: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 33: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 34: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Link Structure Analysisincluding PageRank

Page 35: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 36: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 37: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 38: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 39: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 40: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 41: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 42: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 43: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 44: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Size of face proportional to PageRank

Page 45: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

PageRank d=0.85

Page 46: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

d = 0.85

Page 47: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

PageRank• PageRank is probability that Page will be visited by a surfer is

clicks each link on page with equal probability– minor corrections for pages with no outgoing links

• Found Iteratively with each page getting at each iteration a contribution equal to its page rank divided by #Links on page

• PR(Page i) = Page j pointing at I PR(Page j)/(Number of Pages linked on Page j)

• One adds to this the chance 1-d that surfer types a random URL into web browser.

• That takes PageRank to d times above plus (1 - d) divided by total number of pages on web

• On general principles, this will converge whatever the starting point– It can be written as iterative matrix multiplication

Page 48: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Related Applications• Thinking of Page Rank as reputation• A version of PageRank has recently been proposed as a

replacement for the traditional Institute for Scientific Information (ISI) impact factor, and implemented at eigenfactor.org. Instead of merely counting total citation to a journal, the "importance" of each citation is determined in a PageRank fashion.– Impact Factor is number of citations of each article– The Eigenfactor score of a journal is an estimate of the percentage of time

that library users spend with that journal. The Eigenfactor algorithm corresponds to a simple model of research in which readers follow chains of citations as they move from journal to journal.

• A similar new use of PageRank is to rank academic doctoral programs based on their records of placing their graduates in faculty positions. In PageRank terms, academic departments link to each other by hiring their faculty from each other (and from themselves).

Page 49: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

EF= EigenfactorAI = Article Influence over the first five years after publication

Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson's Journal Citation Reports (JCR) is 100

Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports database has an article influence of 1.00

Page 50: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

None done here!

Page 51: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 52: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Summary Issues

Page 53: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 54: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 55: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 56: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 57: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Crawling the Web

Page 58: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 59: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 60: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 61: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Web Advertising and Search

Page 62: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 63: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 64: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 65: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 66: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 67: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 68: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for
Page 69: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

CS236621 Technion

Page 70: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Clustering and Topics

Page 71: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Grouping Documents Together• The responses to a search query give you a group

documents• If we represent documents as points in a space, we can try

to identify regions– Clustering: Nearby regions of points– Support Vector Machine: Chop space up into parts– (Gaussian) Mixture Models: A type of fuzzy clustering– K-Nearest Neighbors (if have examples)

• Alternatively we can determine “hidden meaning” with a topic model– Latent Semantic Indexing– Latent Dirichlet Allocation– With lots of variants of these methods to find “latent factors”

Page 72: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

Topic Models• Illustrated by Google News• These try to group documents by Topics such as

“Presidential Election” and not by inclusion of particular phrases

• You imagine each document is a set of topics (the latent factors) and each topic is a bag of words.

• Find the best set of topics and best set of words in topics

Page 73: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

A Latent Factor Finding Method

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

Page 74: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu  Associate Dean for

https://portal.futuregrid.org

An example of DA-PLSA

Top 10 popular words of the AP news dataset for 30 topics. Processed by DA-PLSI and showing only 5 topics among 30 topics