blogforever project presentation at mtsr2013

Post on 10-May-2015

859 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

BlogForever, a collaborative European Commission funded project, developed an exciting new system to harvest, preserve, manage and reuse blog content.

TRANSCRIPT

1

The BlogForever Project

http://blogforever.eu

MTSR 2013, 22 Nov 2013, Thessaloniki

Vangelis Banos,BlogForever Project Manager

Contents

The Disappearing Web

Web Archiving

The BlogForever Project

BlogForever Applications

MTSR 2013, 22 Nov 2013, Thessaloniki 2

Web content disappears

MTSR 2013, 22 Nov 2013, Thessaloniki 3

Web content disappears

MTSR 2013, 22 Nov 2013, Thessaloniki 4

Web content disappears

MTSR 2013, 22 Nov 2013, Thessaloniki 5

Web Archiving

MTSR 2013, 22 Nov 2013, Thessaloniki 6

The InternetArchive comesto the rescue!

Web Archiving

The process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public.

MTSR 2013, 22 Nov 2013, Thessaloniki 7

The challenge of web archiving

MTSR 2013, 22 Nov 2013, Thessaloniki 8

File(s) Software Hardware RECORD

Generic file archiving operation

The challenge of web archiving

MTSR 2013, 22 Nov 2013, Thessaloniki 9

File(s)

Software

Hardware Website

File(s)

File(s)

File(s)

File(s)

File(s)

File(s)

Software

Software

???

Record(s)

???

Web archiving operation

We are focusing on blogs Blogs have become fairly established as an online

communication and web publishing tool. Hundreds of millions of blogs are published about every

conceivable subject.

MTSR 2013, 22 Nov 2013, Thessaloniki 10

3 414.2 19.6 27.2 34.5

50 5770

133

156164

182

0

20

40

60

80

100

120

140

160

180

200

July 2004

Oct 2004

Aug 2005

Oct 2005

Feb 2006

Apr 2006

Aug 2006

Nov 2006

Apr 2007

Sep 2008

Feb 2011

July 2011

Jan 2012

mill

ions

Number of blogs (blogpulse.com)

Examples 12/9/2013

70+ million sites in the world369 million people viewing more than 11.8 billion pages each month38 million new posts and 62.3 million new comments each month

136.5 million blogs61 billion posts83.7 million daily posts

Blog Archiving: Objectives & Concerns

Blog characteristics: Database driven, dynamic websites, High frequency of updates, Special structure, metadata, semantics &

communication protocols, Highly interconnected, Quantity and range of resources, Ownership and DRM.

Our aims: harvest, preserve, manage and reuse blogs and their

resources.MTSR 2013, 22 Nov 2013, Thessaloniki 11

The BlogForever Project Collaborative EC funded project, Duration: 1 Mar 11’ – 31 Aug 13’, Aims: Theoretic and applied research on blog

archiving Coordinated by AUTH. Partners:

MTSR 2013, 22 Nov 2013, Thessaloniki 12

BlogForever project achievements

MTSR 2013, 22 Nov 2013, Thessaloniki 13

BlogForever has created a novel blog archiving approach.It is not only about archiving pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc.).

Blog modelling and semantics

Cases studies and validation

Preservation strategies

Implementation of the BlogForever platform

BlogForever project achievements

MTSR 2013, 22 Nov 2013, Thessaloniki 14

Blog crawlers

Real-time monitoring Html data extraction engine Spam filtering Web services extraction

engine

Unstructured information

Web servicesBlog APIs

Original data andXML metadata

Blog digital repository

Digital preservation Quality assurance Collections curation Public access APIs Personalised services Information retreival Public web interface /

Browse, search, export

Harvesting

PreservingManaging and reusing

Web servicesWeb interface

BlogForever Added Value

MTSR 2013, 22 Nov 2013, Thessaloniki 15

BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on a special data model.

BlogForever is based on Invenio an open source state-of-the-art digital library management system developed by CERN.

Better metadata and higher information granularity. Open Standards and Interoperability (MARCXML, Web Services) Better management of archived information, increasing the

utility of the web archive. Easy to facilitate added value services e.g. analytics.

BlogForever ImpactBlog archiving methods and policies which

are reusable and generic.A blog archiving solution that any institution

could use to preserve their collections of blogs ensuring authenticity, integrity, completeness, usability, long term accessibility

A blog archiving solution that any researcher could use to gather, analyse and reuse blog data.

MTSR 2013, 22 Nov 2013, Thessaloniki 16

BlogForever ApplicationsCERN is currently implementing a high energy

physics blogs repository.AUTH is designing an academic blogs repository.The Linguistics Department of the University of

Hannover is doing a diachronic analysis on certain linguistic and textual phenomena / features using German blogs.

The University of Warwick Computer Science Department is doing social web analytics using blog data.

MTSR 2013, 22 Nov 2013, Thessaloniki 17

Thank you!

Visit http://blogforever.eu Access all BlogForever Deliverables (Open Access). Download the Open Source BlogForever Platform.

Contact us: Project Manager: Vangelis Banos vbanos@gmail.com Exploitation Manager: Efstratios Arampatzis

sa@tero.gr

MTSR 2013, 22 Nov 2013, Thessaloniki 18

top related