jsi news crawler

JSI News CrawlerBlaz Novak, Mitja Trampus, Blaz Fortuna, Marko Grobelnik

JSI News Crawler0 The goal is to collect most of worlds news articles including

relevant blog posts

0 Why collecting data?0 To be independent of commercial data providers0 Since commercial data providers (like Spinn3r, GNIP, DataSift) are

expensive and not flexible in terms of data sources and additional services

0 To provide data stream free of charge for research

0 What data is available?0 Database dumps0 Articles annotated with Enrycher metadata0 Similar articles clusters0 Real-time feed

Architecture

Open Web

JSI Crawler

Database of Collected Articles

Web Service API

ArchiveExplorer

Content in form:• Clean text• Linguistics• Social Graph• LOD Links• Time

Control Panel

Enrycher

Real-TimeAnalytics

Developers

XML/RDF

Current statistics

0Data sources: ~110.000 unique websites0Stream size: ~192.000 articles/day

0 ~150 distinct languages0 good coverage of minority languages

0Current archive of ~35.000.000 articles

0Clear-text and language identification available

Sample Article from the stream

Download volume, yearly scale (2010)

Todays download volume, after adding 3k new sources + 1 week of backlog

Average and maximum number of story articles in a cluster (today)

Control Panel

0 In the first half of 2012 the plan is to release the service for public use

0…in the future additional semantic annotation services will be added to providing additional value to the streamed data

jsi news crawler

streamed data

terms of data sources

worlds news articles

data stream free of

articles cleartext

additional servicesto

new sources

additional value

Documents

jsi newsletter

melrose - jsi furniture

d2.2 early version of social media based policy...

jsi sensor middleware

jsi 124 pathway

jsi generic tariff

avalon - jsi furniture

news manitowoc launches 150t crawler - amazon s3

jsi-brocas tricónicas.pdf

newport news shipbuilding 4101 … news shipbuilding 4101...

jsi rocktools

jsi swish brochure

focus sheets - jsi

jsi catalog contents

ecolead 1 jermol/jsi © jermol/jsi training activities wp7...

jsi matematik14

rota talk - jsi

jsi publications

jsi - aegean airlines

jsi vítěz ale je čas vrátit se domů. našel jsi prá