blogforever project presentation at mtsr2013
DESCRIPTION
BlogForever, a collaborative European Commission funded project, developed an exciting new system to harvest, preserve, manage and reuse blog content.TRANSCRIPT
1
The BlogForever Project
http://blogforever.eu
MTSR 2013, 22 Nov 2013, Thessaloniki
Vangelis Banos,BlogForever Project Manager
Contents
The Disappearing Web
Web Archiving
The BlogForever Project
BlogForever Applications
MTSR 2013, 22 Nov 2013, Thessaloniki 2
Web content disappears
MTSR 2013, 22 Nov 2013, Thessaloniki 3
Web content disappears
MTSR 2013, 22 Nov 2013, Thessaloniki 4
Web content disappears
MTSR 2013, 22 Nov 2013, Thessaloniki 5
Web Archiving
MTSR 2013, 22 Nov 2013, Thessaloniki 6
The InternetArchive comesto the rescue!
Web Archiving
The process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public.
MTSR 2013, 22 Nov 2013, Thessaloniki 7
The challenge of web archiving
MTSR 2013, 22 Nov 2013, Thessaloniki 8
File(s) Software Hardware RECORD
Generic file archiving operation
The challenge of web archiving
MTSR 2013, 22 Nov 2013, Thessaloniki 9
File(s)
Software
Hardware Website
File(s)
File(s)
File(s)
File(s)
File(s)
File(s)
Software
Software
???
Record(s)
???
Web archiving operation
We are focusing on blogs Blogs have become fairly established as an online
communication and web publishing tool. Hundreds of millions of blogs are published about every
conceivable subject.
MTSR 2013, 22 Nov 2013, Thessaloniki 10
3 414.2 19.6 27.2 34.5
50 5770
133
156164
182
0
20
40
60
80
100
120
140
160
180
200
July 2004
Oct 2004
Aug 2005
Oct 2005
Feb 2006
Apr 2006
Aug 2006
Nov 2006
Apr 2007
Sep 2008
Feb 2011
July 2011
Jan 2012
mill
ions
Number of blogs (blogpulse.com)
Examples 12/9/2013
70+ million sites in the world369 million people viewing more than 11.8 billion pages each month38 million new posts and 62.3 million new comments each month
136.5 million blogs61 billion posts83.7 million daily posts
Blog Archiving: Objectives & Concerns
Blog characteristics: Database driven, dynamic websites, High frequency of updates, Special structure, metadata, semantics &
communication protocols, Highly interconnected, Quantity and range of resources, Ownership and DRM.
Our aims: harvest, preserve, manage and reuse blogs and their
resources.MTSR 2013, 22 Nov 2013, Thessaloniki 11
The BlogForever Project Collaborative EC funded project, Duration: 1 Mar 11’ – 31 Aug 13’, Aims: Theoretic and applied research on blog
archiving Coordinated by AUTH. Partners:
MTSR 2013, 22 Nov 2013, Thessaloniki 12
BlogForever project achievements
MTSR 2013, 22 Nov 2013, Thessaloniki 13
BlogForever has created a novel blog archiving approach.It is not only about archiving pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc.).
Blog modelling and semantics
Cases studies and validation
Preservation strategies
Implementation of the BlogForever platform
BlogForever project achievements
MTSR 2013, 22 Nov 2013, Thessaloniki 14
Blog crawlers
Real-time monitoring Html data extraction engine Spam filtering Web services extraction
engine
Unstructured information
Web servicesBlog APIs
Original data andXML metadata
Blog digital repository
Digital preservation Quality assurance Collections curation Public access APIs Personalised services Information retreival Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web servicesWeb interface
BlogForever Added Value
MTSR 2013, 22 Nov 2013, Thessaloniki 15
BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on a special data model.
BlogForever is based on Invenio an open source state-of-the-art digital library management system developed by CERN.
Better metadata and higher information granularity. Open Standards and Interoperability (MARCXML, Web Services) Better management of archived information, increasing the
utility of the web archive. Easy to facilitate added value services e.g. analytics.
BlogForever ImpactBlog archiving methods and policies which
are reusable and generic.A blog archiving solution that any institution
could use to preserve their collections of blogs ensuring authenticity, integrity, completeness, usability, long term accessibility
A blog archiving solution that any researcher could use to gather, analyse and reuse blog data.
MTSR 2013, 22 Nov 2013, Thessaloniki 16
BlogForever ApplicationsCERN is currently implementing a high energy
physics blogs repository.AUTH is designing an academic blogs repository.The Linguistics Department of the University of
Hannover is doing a diachronic analysis on certain linguistic and textual phenomena / features using German blogs.
The University of Warwick Computer Science Department is doing social web analytics using blog data.
MTSR 2013, 22 Nov 2013, Thessaloniki 17
Thank you!
Visit http://blogforever.eu Access all BlogForever Deliverables (Open Access). Download the Open Source BlogForever Platform.
Contact us: Project Manager: Vangelis Banos [email protected] Exploitation Manager: Efstratios Arampatzis
MTSR 2013, 22 Nov 2013, Thessaloniki 18