large scale preservation workflows with taverna – scape training event, guimarães 2012

34
Sven Schlarb Austrian National Library Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012 Large scale preservation workflows with Taverna

Upload: scape-project

Post on 05-Dec-2014

608 views

Category:

Technology


2 download

DESCRIPTION

Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

TRANSCRIPT

Page 1: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Sven Schlarb Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012

Large scale preservation workflows with Taverna

Page 2: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

What do you mean by „Workflow“?

• Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share

Page 3: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Modularise complex preservation tasks

• Assuming that complex preservation tasks can be separated into processing steps

• Together the steps represent the automated processing pipeline

Migrate Characterise Quality Assurance Ingest

Page 4: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Experimental workflow development

• Easy to execute a workflow on standard platforms from anywhere

• Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity

Page 5: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Taverna

• Workflow language and computational model for creating composite data-intensive processing chains

• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK

• Available for Windows/Linux/OSX and as open source (LGPL)

Page 6: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)

• SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API

• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)

• Adopts Linked Data technology

Page 7: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Creating workflows using Taverna

• Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the

workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows

Page 8: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Processors

• Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)

Page 9: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE List handling: Implicit iteration over multiple

inputs • A „single value“ input port (list depth 0) processes

values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross

product or dot product

Page 10: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Example: Tika Preservation Component

• Input: „file“

• Processor: Tika web service (SOAP)

• Output: Mime-Type

Page 11: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Workflow development and execution • Local development: Taverna Workbench

Page 12: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Workflow registry • Web 2.0 style registry: myExperiment

Page 13: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Remote Workflow Execution • Web client using REST API of Taverna Server

Page 14: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop

• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)

• Hadoop = MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB

(default) blocks

Page 15: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop

• Job tracker (master) manages job execution on task trackers (workers)

• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)

• Name node manages HDFS, i.e. distribution of data blocks on data nodes

Page 16: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hadoop job building blocks

Map/reduce Application

(JAR)

Job configuration Set or overwrite configuration parameters.

Map method Create intermediate key/value pair output

Reduce method Aggregate intermediate key/value pair output from map

Page 17: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Cluster

Page 18: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Dette billede kan ikke vises i øjeblikket.

Apache Tomcat Web Application

Taverna Server (REST API)

Hadoop Jobtracker

File server

Cluster

Large scale execution environment

Page 19: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE Example: Characterisation on a large document

collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming-

API, Hadoop Map/Reduce, and Hive) • Available on myExperiment:

http://www.myexperiment.org/workflows/3105 • See Blogpost:

http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-processing-chaining-hadoop-jobs-using-taverna

Page 20: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

20

Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API.

Page 21: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Reading image metadata

21

find

/NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ...

...

NAS

reading files from NAS

1,4 GB 1,2 GB

: ~ 5 h + ~ 38 h = ~ 43 h 60.000 books

24 Million pages

SCAPE Jp2PathCreator HadoopStreamingExiftoolRead

Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...

Page 22: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

22

Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS.

Page 23: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SequenceFile creation

23

find

/NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ...

Z119585409/00000707

Z119585409/00000708

Z119585409/00000709

Z119585409/00000710

Z119585409/00000711

Z119585409/00000712

NAS

reading files from NAS

1,4 GB 997 GB (uncompressed)

: ~ 5 h + ~ 24 h = ~ 29 h 60.000 books

24 Million pages

SCAPE HtmlPathCreator SequenceFileCreator

Page 24: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

24

Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width.

Page 25: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

HTML Parsing

25

Z119585409/00000001

Z119585409/00000002

Z119585409/00000003

Z119585409/00000004

Z119585409/00000005 ...

: ~ 6 h 60.000 books

24 Million pages

Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400

Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400

Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400

Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400

Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400

SCAPE

Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250

Map Reduce HadoopAvBlockWidthMapReduce

SequenceFile Textfile

Page 26: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

26

Create hive table and load generated data into the Hive database.

Page 27: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Analytic Queries

27 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveLoadExifData & HiveLoadHocrData

Dette billede kan ikke vises i øjeblikket.

Dette billede kan ikke vises i øjeblikket.

htmlwidth

jp2width

Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700

Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250

CREATE TABLE jp2width (hid STRING, jwidth INT)

CREATE TABLE htmlwidth (hid STRING, hwidth INT)

Page 28: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Analytic Queries

28 : ~ 6 h

60.000 books 24 Million pages

SCAPE HiveSelect

Dette billede kan ikke vises i øjeblikket. Dette billede kan ikke vises i øjeblikket.

htmlwidth jp2width

Dette billede kan ikke vises i øjeblikket.

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Page 29: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

29

Do a simple hive query in order to test if the database has been created successfully.

Page 30: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Example: Web Archiving

30

Page 31: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop configuration

• Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop

Command for DEMO1) • Latest patches

Page 32: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – VM setup

• Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder =>

select VBOX file • VM instance login:

• user: scape • pw: scape123

Page 33: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Demo1

• Using Hadoop for analysing ARC files • Located at:

/example/sampleIN/ (HDFS) • Execution via command in:

SampleHadoopCommand.txt (on Desktop)

• Result can then be found at: /example/sample_OUT/

Page 34: Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAPE

Hands on – Demo2

• Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2flow • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt

• Result: ~/scanARC/outputCSV/fullTIKAReport.csv

• See ~/scanARC/outputGraphics/ graphicsTIKA/tika-