large scale preservation workflows with taverna – scape training event, guimarães 2012
DESCRIPTION
Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.TRANSCRIPT
SCAPE
Sven Schlarb Austrian National Library
Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012
Large scale preservation workflows with Taverna
SCAPE
What do you mean by „Workflow“?
• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share
SCAPE
Modularise complex preservation tasks
• Assuming that complex preservation tasks can be separated into processing steps
• Together the steps represent the automated processing pipeline
Migrate Characterise Quality Assurance Ingest
SCAPE
Experimental workflow development
• Easy to execute a workflow on standard platforms from anywhere
• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity
SCAPE
Taverna
• Workflow language and computational model for creating composite data-intensive processing chains
• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK
• Available for Windows/Linux/OSX and as open source (LGPL)
SCAPE
SCUFL/T2FLOW/SCUFL2
• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)
• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API
• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)
• Adopts Linked Data technology
SCAPE
Creating workflows using Taverna
• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the
workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows
SCAPE
Processors
• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)
SCAPE List handling: Implicit iteration over multiple
inputs • A „single value“ input port (list depth 0) processes
values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross
product or dot product
SCAPE
Example: Tika Preservation Component
• Input: „file“
• Processor: Tika web service (SOAP)
• Output: Mime-Type
SCAPE
Workflow development and execution • Local development: Taverna Workbench
SCAPE
Workflow registry • Web 2.0 style registry: myExperiment
SCAPE
Remote Workflow Execution • Web client using REST API of Taverna Server
SCAPE
Hadoop
• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)
• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB
(default) blocks
SCAPE
Hadoop
• Job tracker (master) manages job execution on task trackers (workers)
• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)
• Name node manages HDFS, i.e. distribution of data blocks on data nodes
SCAPE
Hadoop job building blocks
Map/reduce Application
(JAR)
Job configuration Set or overwrite configuration parameters.
Map method Create intermediate key/value pair output
Reduce method Aggregate intermediate key/value pair output from map
SCAPE
Cluster
SCAPE
Dette billede kan ikke vises i øjeblikket.
Apache Tomcat Web Application
Taverna Server (REST API)
Hadoop Jobtracker
File server
Cluster
Large scale execution environment
SCAPE Example: Characterisation on a large document
collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-
API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:
http://www.myexperiment.org/workflows/3105 • See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna
SCAPE
20
Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.
Reading image metadata
21
find
/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books
24 Million pages
SCAPE Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...
SCAPE
22
Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.
SequenceFile creation
23
find
/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books
24 Million pages
SCAPE HtmlPathCreator SequenceFileCreator
SCAPE
24
Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.
HTML Parsing
25
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005 ...
: ~ 6 h 60.000 books
24 Million pages
Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400
Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400
Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400
Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400
Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400
SCAPE
Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250
Map Reduce HadoopAvBlockWidthMapReduce
SequenceFile Textfile
SCAPE
26
Create hive table and load generated data into the Hive database.
Analytic Queries
27 : ~ 6 h
60.000 books 24 Million pages
SCAPE HiveLoadExifData & HiveLoadHocrData
Dette billede kan ikke vises i øjeblikket.
Dette billede kan ikke vises i øjeblikket.
htmlwidth
jp2width
Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700
Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250
CREATE TABLE jp2width (hid STRING, jwidth INT)
CREATE TABLE htmlwidth (hid STRING, hwidth INT)
Analytic Queries
28 : ~ 6 h
60.000 books 24 Million pages
SCAPE HiveSelect
Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.
htmlwidth jp2width
Dette billede kan ikke vises i øjeblikket.
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
SCAPE
29
Do a simple hive query in order to test if the database has been created successfully.
SCAPE
Example: Web Archiving
30
SCAPE
Hands on – Virtual machine
• 0.20.2+923.421 Pseudo-distributed Hadoop configuration
• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop
Command for DEMO1) • Latest patches
SCAPE
Hands on – VM setup
• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>
select VBOX file • VM instance login:
• user: scape • pw: scape123
SCAPE
Hands on – Demo1
• Using Hadoop for analysing ARC files • Located at:
/example/sampleIN/ (HDFS) • Execution via command in:
SampleHadoopCommand.txt (on Desktop)
• Result can then be found at: /example/sample_OUT/
SCAPE
Hands on – Demo2
• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt
• Result: ~/scanARC/outputCSV/fullTIKAReport.csv
• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-