![Page 1: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/1.jpg)
http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm
http://en.wikipedia.org/wiki/Proton
Which of the two appears simple to
you?
1
2
![Page 2: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/2.jpg)
Search for a keyword
Results – Sometimes irrelevantand mixed order of readability
![Page 3: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/3.jpg)
Our Objective
Query
Retrieve web pages(considering relevance)
Re-rank web pages based on readability
Automatically accomplished
![Page 4: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/4.jpg)
An Unsupervised Technical Readability Ranking Model by Building a Conceptual
Terrain in LSI
Shoaib Jameel
Xiaojun Qian
The Chinese University of Hong Kong
This is me!
![Page 5: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/5.jpg)
What has been done so far?
• Heuristic Readability formulae• Unsupervised approaches• Supervised approaches
My focus in this talk would be to cover some popular works in this area. Exhaustive listof references can be found in my paper.
![Page 6: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/6.jpg)
Heuristic Readability Methods
• Have been there since 1940’s
• Semantic Component – Number of syllables per word, length of the syllables per word etc.
• Syntactic Component – Length of sentences etc.
![Page 7: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/7.jpg)
Example – Flesch Reading Ease
Semantic componentSyntactic component
Manually tuned numerical parameters
![Page 8: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/8.jpg)
Supervised Learning Methods
• Language Models• SVMs (Support Vector Machines)• Use of query Log and user profiles
![Page 9: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/9.jpg)
Smoothed Unigram Model [1]
[1] K. Collins-Thompson and J. Callan. (2005.) "Predicting reading difficulty with statistical language models". Journal of the American Society for Information Science and Technology 56(13) (pp. 1448-1462).
•Recast the well-studied problem of readability in terms of text categorization and used straightforward techniques from statistical language modeling.
![Page 10: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/10.jpg)
Smoothed Unigram Model
Limitation of their method: Requires training data, which sometimes may be difficult to obtain
![Page 11: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/11.jpg)
Domain-specific Readability• Jin Zhao and Min-Yen Kan. 2010. Domain-specific iterative readability
computation. In Proceedings of the 10th annual joint conference on Digital libraries (JCDL '10).
Based on web-link structure algorithm HITS and SALSA.
• Xin Yan, Dawei Song, and Xue Li. 2006. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06).
Based on an ontology. Tested only in the medical domain
Hypertext Induced Topic Search Stochastic Approach for Link-Structure Analysis
I will focus on this work.
![Page 12: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/12.jpg)
Overview
• The authors state that Document Scope and Document Cohesion are an important parameters in finding simple texts.
• The authors have used a controlled vocabulary thesaurus termed as Medical Subject Headings (MeSH).
• Authors have pointed out the readability based formulae are not directly applicable to web pages.
![Page 13: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/13.jpg)
MeSH Ontology
Concept difficulty increases
Concept difficulty decreases
![Page 14: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/14.jpg)
Overall Concept Based Readability Score
where,DaCw = Dale-Chall Readability MeasurePWD = Percentage of difficult wordsAvgSL = Average sentence length in di
Their work focused on word level readability, hence considered only the PWD
len(ci,cj)=function to compute shortest path between concepts c i cj in the MeSH hierarchyN = total number of domain concepts in document d i
Depth(ci)=depth of the concept ci in the concept hierarchyD= Maximum depth of concept hierarchy
![Page 15: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/15.jpg)
Our “Terrain-based” Method
So, what’s the connection?
![Page 16: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/16.jpg)
Latent Semantic Indexing
• Core component – Singular Value Decomposition
• SVD(C) = USVT
C =
![Page 17: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/17.jpg)
SVD(C) = USVT
U S VT
![Page 18: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/18.jpg)
![Page 19: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/19.jpg)
Three components
• Term Centrality – Is the term central to the document’s theme?
• Term Cohesion – Is the term closely related with the other terms in the document?
• Term Difficulty – Will the reader find it difficult to comprehend the meaning of the term?
![Page 20: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/20.jpg)
Term Centrality
• Closeness of the term vector with the document vector in the latent space.
T1 T2
D
More central
Less central
LSI latent space
Term Centrality = 1 / {Euclidean distance (T1,D)+small constant}
![Page 21: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/21.jpg)
Term Cohesion
T2T1
T3T4
D
LSI Latent SpaceDistance
Normalization is done to standardize the values.
Term cohesion is obtained by computing the Euclidean distance between two consecutive terms T1 and T2
![Page 22: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/22.jpg)
Term Difficulty
• Term Difficulty = Term Centrality x Inverse Document Frequency (idf)
• Idea of idf – If a term is used less often in the document collection, it should be regarded as important.
• For example, ‘’proton’’ will not occur too often but it is an important term.
![Page 23: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/23.jpg)
So, what we now obtain?Te
rm D
ifficu
lty
Term Cohesion
A reader now has to hop from one term to the other in the LSI latent space
Something like this ->
![Page 24: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/24.jpg)
How ranking is done?
• Keep aggregating the individual transition scores.
• Finally, obtain a real number which will be used in ranking.
![Page 25: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/25.jpg)
Experiments and Results
• Collect web pages from domain-specific sites such as Science and Psychology websites.
• Test in two domains.• Used NDCG as an evaluation metric.• Retrieve relevant web pages given a query• Annotate top ten web pages• Re-rank search results based on readability
![Page 26: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/26.jpg)
Results - Psychology
NDCG@3 NDCG@5 NDCG@7 NDCG@100
0.1
0.2
0.3
0.4
0.5
0.6
0.7 ARI Coleman-Liau Flesch Fog LIXSMOG Terrain(Stop) Terrain(No Stop)
![Page 27: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/27.jpg)
Results - Science
NDCG@3 NDCG@5 NDCG@7 NDCG@100
0.1
0.2
0.3
0.4
0.5
0.6
0.7 ARI Coleman-Liau Flesch Fog LIXSMOG Terrain(Stop) Terrain(No Stop)
![Page 28: Http://scienceforkids.kidipede.com/chemistry/atoms/proton.htm Which of the two appears simple to you? 1 2](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649ea25503460f94ba613d/html5/thumbnails/28.jpg)
END