automatic preservation watch using information extraction on the web
DESCRIPTION
iPRES 2013 presentation of a proof-of-concept experiment of using Information Extraction Technologies to do automatic preservation watch using natural language information on the Web.TRANSCRIPT
![Page 1: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/1.jpg)
Luis Faria [email protected]
KEEP SOLUTIONS www.keep-‐solu:ons.com
Alan Akbik, Barbara Sierman, Marcel Ras, Miguel Ferreira, José Carlos Ramalho
iPRES 2013Lisbon, September 2, 2013
Automa0c Preserva0on WatchUsing Informa-on Extrac-on on the Web
![Page 2: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/2.jpg)
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
2
Why do we need monitoring?
![Page 3: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/3.jpg)
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
3
Why do we need monitoring?
RisksOpportunities
![Page 4: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/4.jpg)
60%
40%
Yes but manual and adhocNone
Risk Assessment
Survey on:
4
![Page 5: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/5.jpg)
Scout: a preserva-on watch system
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Monitors aspects of the world to detect preserva:on risks and opportuni:es
5
![Page 6: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/6.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 6
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
![Page 7: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/7.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 7
Currently supported information sources
• PRONOM
• Repository content and events
• Web archive content
• Web archive renderability experiments
• SCAPE Policy model
![Page 8: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/8.jpg)
8
Define triggers
• Notify me when there are tools that can render the format X.
![Page 9: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/9.jpg)
9
Define triggersSimple query with templates
![Page 10: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/10.jpg)
10
Receive notifications
HTTP Push API
There are tools that can render format X.
![Page 11: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/11.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Automa-c Watch Limita-ons
11
Machine readable data
• Explicit and formal specified information
• Controlled vocabulary
• Ontology
• All instances use same structure and set of values
![Page 12: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/12.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Case study: e-‐Depot coverage
12
0
100
200
300
400
500
600
40% 50% 60% 70% 80% 90% 100%
% of journal titles
Publishers Titles per publisher
97%publishers
1-10titles
![Page 13: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/13.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
e-‐journal coverage ques-ons
13
• Which publisher provides which journal -tles• Publisher changes:
• Ceases to provide journal• Transfers journal to other publisher(s)• Publishers merge
• Journal changes:• Name changes• ISSN changes• Ceased to exist
![Page 14: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/14.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Where is this informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.”
In the publisher website!
![Page 15: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/15.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Where is this informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.”
In the publisher website!
Not machine
readable!
![Page 16: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/16.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Informa-on Extrac-on
• Extract structural information from unstructured data• Pattern-based information extraction
• Some training and supervision may be needed
15
“[X] acquired [Y]”
![Page 17: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/17.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Experiment
1. Data acquisition and pre-processing
2. Relation discovery
3. Information extraction
4. Validation of results
16
![Page 18: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/18.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
1. Data acquisi-on and pre-‐processing
• Focused crawler with seed words (12.000 entries)• Publisher names
• Journal titles
➡500.000 Web pages
• Pre-process with NLP tools
➡18 million sentences➡8 GB
17
![Page 19: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/19.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
2. Rela-on discovery
18
Prominent pattern Rank[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
![Page 20: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/20.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
2. Rela-on discovery
19
Prominent pattern Rank[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
![Page 21: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/21.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
3. Informa-on extrac-on
20
2.000 journal titles
500 journal-publisher attributions
![Page 22: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/22.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
4. Valida-on of results
21
4%
10%
86%
Journal titles in eDepot
15%
50%
35%
Title-publisher in the Keepers registry
Should add ExistingFalse-positives
![Page 23: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/23.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
False-‐posi-ves
• Detecting boundaries of titles and publisher names
• Using abbreviations on titles and publisher names
• Technical problems like encoding
22
“European Journal of Nuclear Medicine and Molecular Imaging”
IAAE - “International Association of Agricultural Economists”
“├ó╦å┼buda University”
![Page 24: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/24.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Conclusions
• We need data to support digital preservation
• Explicit and formal specified for automation
• Registries tend to be incomplete and outdated
• Information Extraction Technologies can help
• Still, some supervision may be needed
23
![Page 25: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/25.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Send us your use cases!
24
Alan [email protected]
Luis [email protected]
Preservation WatchWhat risks to monitor?
Information ExtractionWhat to extract from the web?
![Page 26: Automatic Preservation Watch Using Information Extraction on the Web](https://reader033.vdocuments.net/reader033/viewer/2022052522/54833d03b4af9f870d8b4992/html5/thumbnails/26.jpg)
This work was par,ally supported by the SCAPE Project.The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Thank you, ques-ons?
• Scout - a preservation watch system• Site: http://openplanets.github.io/scout/
• Demo: http://scout.scape.keep.pt
• SCAPE Planning and Watch suite iPRES poster• http://bit.ly/scape-pw
• SCAPE• http://www.scape-project.eu
25