jsi news crawler
Post on 23-Feb-2016
76 Views
Preview:
DESCRIPTION
TRANSCRIPT
JSI News CrawlerBlaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik
JSI
JSI News Crawler0 The goal is to collect most of worlds news articles including
relevant blog posts
0 Why collecting data?0 To be independent of commercial data providers0 Since commercial data providers (like Spinn3r, GNIP, DataSift) are
expensive and not flexible in terms of data sources and additional services
0 To provide data stream free of charge for research
0 What data is available?0 Database dumps0 Articles annotated with Enrycher metadata0 Similar articles clusters0 Real-time feed
Architecture
Open Web
JSI Crawler
Database of Collected Articles
Web Service API
ArchiveExplorer
Content in form:• Clean text• Linguistics• Social Graph• LOD Links• Time
Control Panel
Enrycher
Real-TimeAnalytics
Developers
XML/RDF
Current statistics
0Data sources: ~110.000 unique websites0Stream size: ~192.000 articles/day
0 ~150 distinct languages0 good coverage of minority languages
0Current archive of ~35.000.000 articles
0Clear-text and language identification available
Sample Article from the stream
Download volume, yearly scale (2010)
Todays download volume, after adding 3k new sources + 1 week of backlog
Average and maximum number of story articles in a cluster (today)
Control Panel
Plans
0 In the first half of 2012 the plan is to release the service for public use
0…in the future additional semantic annotation services will be added to providing additional value to the streamed data
top related