internals of an aggregated web news feed

16
Internals of an Aggregated Web News Feed newsfeed.ijs.si Mitja Trampuš and Blaž Novak AI Lab, Jozef Stefan Institute

Upload: render-project

Post on 24-Jun-2015

534 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Internals Of An Aggregated Web News Feed

Internals of anAggregated Web News Feed

newsfeed.ijs.si

Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute

Page 2: Internals Of An Aggregated Web News Feed

Monitor.Download.

txt

Clean.Enrich.

Expose.Use.

Page 3: Internals Of An Aggregated Web News Feed

Monitor.Download.

txt

Expose.Use.

Clean.Enrich.

Page 4: Internals Of An Aggregated Web News Feed

Monitor. Download.

• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers

• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries

Page 5: Internals Of An Aggregated Web News Feed

Monitor. Download.

• Quality management:– Punish technical errors– Adjustable crawl time

• Discovery delay for articles: 3 hours

Page 6: Internals Of An Aggregated Web News Feed

txt

Expose.Use.

Clean.Enrich.

Monitor.Download.

Page 7: Internals Of An Aggregated Web News Feed

Clean.1/2

• Methods in published papers work great– If evaluated on 10 sites

• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent

• Support for rNews/Schema.org

Page 8: Internals Of An Aggregated Web News Feed

Clean.2/2

• Pitfalls– Pages with no content– Comments– Copyright notices

• Evaluation– 150 sites, one page per site• include content-less pages

– 95% precision, 95% recall

Page 9: Internals Of An Aggregated Web News Feed

txt

Expose.Use.

Clean.

Enrich.Monitor.

Download.

Page 10: Internals Of An Aggregated Web News Feed

Enrich.1/2

• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams

• Language stats:– English 52%, German 7%, Spanish 7%,

French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%

– 40 languages with >100 articles daily– 99% accuracy

Page 11: Internals Of An Aggregated Web News Feed

Enrich.2/2

• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming

• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)

Page 12: Internals Of An Aggregated Web News Feed

txt

Monitor.Download.

Expose.Use.

Clean.Enrich.

Page 13: Internals Of An Aggregated Web News Feed

Expose. Use.

• XML, gzip filesystem cache• HTTP service (polling)• Command-line client

• Live demo, API:http://newsfeed.ijs.si/

Page 14: Internals Of An Aggregated Web News Feed

Technology.• Data volume: 100 000 articles/day

Peak throughput: 10 articles/second

• One machine for semantic processingOne machine for everything else

• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through

Page 15: Internals Of An Aggregated Web News Feed

The Bright Future.

• Feed quality management

• Increase the number of sources– Non-western in particular

• Compute news clusters