domain-specific iterative readability computation

22
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011

Upload: pembroke

Post on 22-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Domain-Specific Iterative Readability Computation. Jin Zhao 13/05/2011. Domain-Specific Resources. Domain-Specific Resources. Domain-specific resources targets at varying audiences. Modular arithmetic page from Wikipedia. Modular arithmetic page from Interactivate.com. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Domain-Specific Iterative Readability Computation

Domain-Specific Iterative Readability Computation

Jin Zhao13/05/2011

Page 2: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 222WING, NUS

Domain-Specific Resources

Page 3: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Domain-Specific Resources

3WING, NUS

Modular arithmetic page from Wikipedia

Modular arithmetic page from Interactivate.com

Domain-specific resources targets at varying audiences.

Page 4: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Challenge for a Domain-Specific Search Engine

4WING, NUS

How to measure readability for domain-specific resources?

Page 5: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Literature Review• Heuristic-based Readability Measures– Weighted sum of text feature values

– Examples: Flesch Kincaid Reading Ease (FKRE): [Flesch48]

Dale-Chall readability formula: [Dale&Chall48]

5WING, NUS

Quick and indicative but often oversimplify

Page 6: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Literature Review• Natural Language Processing and Machine Learning

Approaches– Extract deep text features and use supervised learning

methods to generate models for readability measurement

– Text Features Unigram [Collins-Thompson04],

Parse tree height [Schwarm05], Discourse relations [Pitler08]

– Supervised learning techniques Support Vector Machine (SVM) [Schwarm05],

k-Nearest Neighbor (KNN) [Heilman07]

6WING, NUS

More accurate but annotated corpus required and ignorant of the domain-specific concepts

Page 7: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Literature Review• Domain-Specific Readability Measures– Derive information of domain-specific concepts from expert

knowledge sources

– Examples: Open Access and Collaborative Consumer Health Vocabulary

[Kim07] Medical Subject Headings ontology [Yan06]

– Handles domain-specific concepts but expert knowledge sources are still expensive and not always available

7WING, NUS

Key qualities of a good readability measure: effective, portable and domain-aware.

Page 8: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Intuitions

• Use an iterative computation algorithm to estimate these two scores from each other

• Example:– Pythagorean theorem vs. ring theory

8WING, NUS

A domain-specific resource is less readableif it contains more difficult concepts

A domain-specific concept is more difficult if it appears in less readable resources

Page 9: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Iterative Computation (IC) Algorithm• Graph Construction– Construct a graph representing resources, concepts and

occurrence information

• Score Computation– Initialize and iteratively compute the readability score of domain-

specific resources and the difficulty score of domain-specific concepts

– Two versions: heuristic and probabilistic

• Required Input– A collection of domain-specific resources– A list of domain-specific concepts

9WING, NUS

Page 10: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Graph Construction

10WING, NUS

…Pythagorean theorem can be written as a2 + b2 = c2, where c represents the length of the hypotenuse…

…The sine function (sin) can be defined as the ratio of the side opposite the angle to the hypotenuse…

…right trianglePythagorean theoremhypotenusesine functioncosine function…

Resource 1

Resource 2

Concept List

Pythagorean Theorem

hypotenuse sine function

Resource 1 Resource 2

right trianglecosine

function

Page 11: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Score Computation (Heuristic)

11WING, NUS

w x y z

a b c

Resource Nodes

Concept Nodes

• Initialization– Resource Node (FKRE)– Concept Node (Average

score of its adjacent nodes)

1.00 3.00 2.00 4.00

2.00 2.50 3.00

w x y z

a b c

Resource Nodes

Concept Nodes

3.00 5.25 4.75 7.00

4.00 5.00 6.00

• Iterative Computation– Each node(Original score + average of the original scores of its adjacent nodes)

Initialization

Iteration 1

Page 12: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Score Computation (Heuristic)

12WING, NUS

w x y z

a b c

Resource Nodes

Concept Nodes

7.00 9.75 10.25 13.00

8.13 10.00 11.88

w x y z

a b c

Resource Nodes

Concept Nodes

15.13 18.82 21.19 24.88

16.51 20.00 23.51

• Termination Condition– The rank order of the resource

nodes stabilizes

Iteration 2

Iteration 3

Page 13: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Score Computation (Heuristic)• Single-valued score for each node– Unable to handle concepts of varying difficulties

• Simple averaging in score computation– Difficult to incorporate sophisticated computational

mechanisms

13WING, NUS

Page 14: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Score Computation (Probabilistic)

14

w x y z

a b c

Resource Nodes

Concept Nodes

• Initialization– Resource Node (Sentence

Sampling)– Concept Node (Resource

Sampling)Initialization

Page 15: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Score Computation (Probabilistic)

15

• Iterative Computation– Modified Naïve Bayes Classification

Original:

Modified:

Direct Adaptation:

Resource Nodes

Concept Nodes

Page 16: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Evaluation• Key qualities of a good readability measure– Effectiveness

– Portability

– Domain-awareness

16WING, NUS

Page 17: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Effectiveness• Corpus of Math Webpages

• Metrics:– Pairwise accuracy– Spearman’s rho

• Baseline:– Heuristic

FKRE– Supervised learning

NB, SVM, MaxEsnt Binary concept features only

17WING, NUS

Pairwise Spearman IterationsFKRE .72 .48 -NB .72 .52 -SVM .80 .70 -Maxent .82 .67 -HIC .87 .75 18PIC .85 .73 7

Page 18: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Portability• Different selection strategies– Resource selection at random– Concept selection at random

– Resource selection by quality– Concept selection by TF.IDF

• Performance measurement at 5 levels– 20%, 40%, 60%, 80% and 100% of the original resource

collection / concept list

18WING, NUS

Page 19: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Portability

19WING, NUS

Resource Selection StrategiesConcept Selection Strategies

Page 20: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Portability

20WING, NUS

Pairwise SpearmanFKRE .63 .28NB .73 .53SVM .82 .70Maxent .76 .60HIC .74 .49PIC .75 .55

Page 21: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Domain-awareness• Handling of domain-specific concepts– Simple yet effective

– Concepts of multiple difficulty levels? Converge to single value even in PIC Splitting? (K-Means, GMM, etc.) Other computational mechanisms?

21WING, NUS

Page 22: Domain-Specific Iterative Readability Computation

Jin Zhao and Min-Yen Kan

13/05/2011 / 22

Conclusion• Iterative Computation– Estimate the readability of domain-specific resources and

difficulty of domain-specific concepts in a iterative manner– Effective, Portable and Domain-aware

• Future Work– Handling of concepts of multiple difficulty levels

22WING, NUS