Download - Internals Of An Aggregated Web News Feed
![Page 1: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/1.jpg)
Internals of anAggregated Web News Feed
newsfeed.ijs.si
Mitja Trampuš and Blaž NovakAI Lab, Jozef Stefan Institute
![Page 2: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/2.jpg)
Monitor.Download.
txt
Clean.Enrich.
Expose.Use.
![Page 3: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/3.jpg)
Monitor.Download.
txt
Expose.Use.
Clean.Enrich.
![Page 4: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/4.jpg)
Monitor. Download.
• Sources: RSS, Google News, private feeds– 150 000 feeds– 15 000 publishers
• Sources of sources:– Bootstrap from public listings– Parse news articles for <link> entries
![Page 5: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/5.jpg)
Monitor. Download.
• Quality management:– Punish technical errors– Adjustable crawl time
• Discovery delay for articles: 3 hours
![Page 6: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/6.jpg)
txt
Expose.Use.
Clean.Enrich.
Monitor.Download.
![Page 7: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/7.jpg)
Clean.1/2
• Methods in published papers work great– If evaluated on 10 sites
• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs– failing that, a <td> or <div> with lots of text– avoid elements with lots of markup– site-independent
• Support for rNews/Schema.org
![Page 8: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/8.jpg)
Clean.2/2
• Pitfalls– Pages with no content– Comments– Copyright notices
• Evaluation– 150 sites, one page per site• include content-less pages
– 95% precision, 95% recall
![Page 9: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/9.jpg)
txt
Expose.Use.
Clean.
Enrich.Monitor.
Download.
![Page 10: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/10.jpg)
Enrich.1/2
• Language detection:– 50 common languages: Chromium CLD– Long tail: Naive Bayes on character trigrams
• Language stats:– English 52%, German 7%, Spanish 7%,
French 4%, Russian 3%, ...,Chinese 1%, Slovene 0.2%
– 40 languages with >100 articles daily– 99% accuracy
![Page 11: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/11.jpg)
Enrich.2/2
• enrycher.ijs.si– DMOZ categorization– Named entity detection, resolution– (Sentiment)– (Deep parsing)– English, Slovene, more languages coming
• Geo-tagging– Publisher (WHOIS, public listings)– Content (named entities)
![Page 12: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/12.jpg)
txt
Monitor.Download.
Expose.Use.
Clean.Enrich.
![Page 13: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/13.jpg)
Expose. Use.
• XML, gzip filesystem cache• HTTP service (polling)• Command-line client
• Live demo, API:http://newsfeed.ijs.si/
![Page 14: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/14.jpg)
Technology.• Data volume: 100 000 articles/day
Peak throughput: 10 articles/second
• One machine for semantic processingOne machine for everything else
• Processing: Python, (Java, C++)Infrastructure: PostgreSQL, zeromq– Downloaders communicate through the DB– Processing strictly sequential, service-oriented• Each service: In case of errors, pass through
![Page 15: Internals Of An Aggregated Web News Feed](https://reader034.vdocuments.net/reader034/viewer/2022050818/558a664bd8b42a5f4a8b466c/html5/thumbnails/15.jpg)
The Bright Future.
• Feed quality management
• Increase the number of sources– Non-western in particular
• Compute news clusters