automatic blog monitoring and summarization ka cheung “richard” sia phd prospectus
Post on 22-Dec-2015
219 views
TRANSCRIPT
![Page 1: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/1.jpg)
Automatic Blog Monitoring and Summarization
Ka Cheung “Richard” Sia
PhD Prospectus
![Page 2: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/2.jpg)
With/without organized access
![Page 3: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/3.jpg)
Inaccessible?
% of Feeds Vs # of Subscribers
0%
20%
40%
60%
80%
100%
1+ 20+ 50+ 1000+ 5000+
# of Subscribers
% o
f F
ee
ds
By AskJeeves
![Page 4: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/4.jpg)
Introduction
Organized access to blogs Full coverage Reflect changes quickly Filtered and organized presentation
Intended Contributions Efficient techniques to harvest blogs Algorithms to monitor frequently changing data sources Algorithms to reconstruct implicit networks and compose
topic summaries
![Page 5: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/5.jpg)
Modules
Monitoring Collection (future work) Topic detection and tracking (future work) Conclusion
![Page 6: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/6.jpg)
Monitoring
Preliminary results
![Page 7: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/7.jpg)
Framework
A central server monitors data source changes and provides succinct summaries to users
![Page 8: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/8.jpg)
Overview
New challenges Content change more rapidly with recurring pattern More time-sensitive requirements
Modeling of posting update Definition of delay Strategies for allocation and scheduling
![Page 9: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/9.jpg)
Characteristics
Homogeneous Poisson modelλ(t) = λ at any t
Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…
![Page 10: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/10.jpg)
Definition of metrics
Delay of a data sourcesum of elapsed time for every post
Delay experienced by the aggregator
iji ttD )(
k
iitDOD
1
)()(
n
iii ODwAD
1
)()(
![Page 11: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/11.jpg)
Definition of metrics
τj – retrieval timeλ(t) – posting rate
Expected delay Homogeneous Poisson model
Inhomogeneous Poisson model
2
)()(
21
jjOD
j
j
dtttOD j
1
))(()(
![Page 12: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/12.jpg)
Problem formulation
Minimization of expected delay experienced by the aggregator under constraint of limited resources.
Schedule τj’s such that
is minimized.
n
iii ODwAD
1
)()(
![Page 13: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/13.jpg)
Approach
Resource allocation How often to contact data sources? O1 is more active than O2, how much more often should we
contact O1 than O2?
Retrieval scheduling When to contact a data source? 3 retrievals are allocated for O1, when should these 3
retrievals be located?
![Page 14: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/14.jpg)
Resource allocation
Consider n data source O1, …, On
λi – posting rate of Oi
wi – weight of Oi
N – total number of retrievals per day mi – number of retrievals per day allocated to Oi
Optimal allocationiii wm
![Page 15: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/15.jpg)
Retrieval scheduling
m retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals?
m=1 m>1
![Page 16: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/16.jpg)
Single retrieval per period
λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2
τ = 0.5, expected delay = 0.75 τ = 1, expected delay = 0.5 τ = 2, expected delay = 1.5
![Page 17: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/17.jpg)
Single retrieval per period
For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:
0)(
and )(1
)(
optimalityfor Criteria
0 dt
ddtt
T
T
T
dttTtdtttD
))(())(()(
0
![Page 18: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/18.jpg)
Multiple retrievals per period
m retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:
11
11
1
))(()(
T
dtttOD
m
m
ii
i
i
j
j
dttjjj
1
)())((
optimalityfor Criteria
1
![Page 19: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/19.jpg)
Example
6 retrievals for λ(t)=2+2sin(2πt)
j
j
dttjjj
1
)())((
optimalityfor Criteria
1
![Page 20: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/20.jpg)
Experiment
Data – 10k RSS feeds over Oct – Dec 2004
![Page 21: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/21.jpg)
Performance
CGM03 – optimize for “age” Ours – both resource allocation and retrieval scheduling
![Page 22: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/22.jpg)
Size of estimation window
Resource constraint: 4 retrievals per day per feeds on average 2 weeks is an appropriate choice
![Page 23: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/23.jpg)
Predictability of posting rate
90% of the RSS feeds post consistently
![Page 24: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/24.jpg)
Summaries and extensions
Resource allocation is more aggressive Retrieval scheduling optimizes within individual data
source
Include user access pattern Variable retrieval cost
![Page 25: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/25.jpg)
Collection
Future work
![Page 26: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/26.jpg)
Collection
Blog hosting website Central repository
~5.3M URLs from weblogs.comlimited and contaminated
CrawlingRetrieve maximum number of blog while reducing number of irrelevant pages downloaded
Domain Count Category
spaces.msn.com 839,663 Blog
blogspot.com 362,957 Blog
wretch.cc 116,161 Blog
search-net101.com 89,750 Spam/ads
abalty.com 86,329 Spam/ads
search-now854.com 80,109 Spam/ads
bigebiz.org 79,059 Spam/ads
![Page 27: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/27.jpg)
Collection
Blogs are inter-connected (blogrolls) Selectively following links, discovering hubs for blogs
blog blog
[1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999
![Page 28: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/28.jpg)
Relinquishment of blogs
Detection of abandoned blog to save resource
[2] D.R. Cox “Regression models and life-tables (with discussion)”Journal of the Royal Statistical Society, B(34), 1972[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft Research
![Page 29: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/29.jpg)
Topic detection and tracking
Future work
![Page 30: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/30.jpg)
Overview
Characteristics Document stream Traces of information propagation among blogs
Challenges Modeling growth and death of a topic Ranking of blog articles Malicious content
![Page 31: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/31.jpg)
Influence network in blogs
Information are “diffused” among blogs
Indicator of popularity Social relationship among
bloggers
![Page 32: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/32.jpg)
Influence network in blogs
Four major patterns of propagation
Reconstruction of implicit network Ranking (source authority) Advertising campaign
![Page 33: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/33.jpg)
Data characteristics
~ 97 - 98 % daily content are new
![Page 34: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/34.jpg)
Data characteristics
Same content last for ~8 days
![Page 35: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/35.jpg)
Topics
Topics with different lifespan Bursty Mid-range Sustaining
Evolving of topic
[4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams”in SIGKDD 2002[5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams”Data Stream Management: Processing High-Speed Data Stream, Springer 2005
![Page 36: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/36.jpg)
Document similarity
Sparse and diverse ~400 articles clustered into 21 clusters out of 10,000
daily articles (by DBSCAN)
![Page 37: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/37.jpg)
Framework
Document stream approach Filtering Aggregation
![Page 38: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/38.jpg)
Problems
Selecting a representative subset of documents from a topic cluster Coverage Distinctiveness among subset
Ranking of documents Time Source authority
![Page 39: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/39.jpg)
Conclusion
1. Efficient collection of blogs and modeling the relinquishment
2. Monitoring and retrieval scheduling of rapidly changing data sources
3. Composing topic summary1. Reconstruction of an implicit influence network2. Representative document selection problem
![Page 40: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/40.jpg)
End
Questions?
![Page 41: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/41.jpg)
More examples
![Page 42: Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649d7d5503460f94a5fe59/html5/thumbnails/42.jpg)
Major posting patterns
K – means clustering