large-scale data storage and processing for scientists with hadoop

24
Large-Scale Data Storage and Processing for Scientists in The Netherlands [email protected] January 21, 2011 NBIC BioAssist Programmers' Day

Upload: evert-lammerts

Post on 20-Jan-2015

10.947 views

Category:

Documents


3 download

DESCRIPTION

A presentation for the NBIC BioAssist Programmers' day on Friday January 21st, 2011

TRANSCRIPT

Page 1: Large-Scale Data Storage and Processing for Scientists with Hadoop

Large-Scale Data Storage and Processing

for Scientists in The Netherlands

[email protected] 21, 2011NBIC BioAssist Programmers' Day

Page 2: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Super computingSuper computing

Cluster computingCluster computing

Grid computingGrid computingCloud computingCloud computing

GPU computingGPU computing

Page 3: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Status Quo:Storage separate from Compute

Page 4: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Case Study:Virtual Knowledge Studio

● How do categories in WikiPedia evolve over time? (And how do they relate to internal links?)

● 2.7 TB raw text, single file

● Java application, searches for categories in Wiki markup, like [[Category:NAME]]

● Executed on the Grid

http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

Page 5: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

1.1) Copy file from local Machine to Grid storage

2.1) Stream file from Grid Storage to single machine2.2) Cut into pieces of 10 GB2.3) Stream back to Grid Storage

3.1) Process all files in parallel: N machines run the Java application, fetch a 10GB file as input, processing it, and putting the result back

Page 6: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Status Quo:Arrange your own (Data-)parallelism

● Cut the dataset up in “processable chunks”:● Size of chunk depending on local space on processing node...

● … on the total processing capacity available …● … on the smallest unit of work (“largest grade of parallelism”)...

● … on the substance (sometimes you don't want many output files, e.g. when building a search index).

● Submit the amount of jobs you consider necessary:● To a cluster close to your data (270x10GB over WAN is a bad idea)● Amount might depend on cluster capacity, amount of chunks, smallest

unit of work, substance...

When dealing with large data, let's say 100GB+, this isERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!

Page 7: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Nutch*2002 2004

MR/GFS**20062004

Hadoop

*  http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html   http://labs.google.com/papers/gfs.html

Page 8: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

http://wiki.apache.org/hadoop/PoweredBy

2010/2011: A Hype in Production

Page 9: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Page 10: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Hadoop Distributed File System (HDFS)● Very large DFS. Order of magnitude:

– 10k nodes– millions of files– PetaBytes of storage

● Assumes “commodity hardware”:– redundancy through replication– failure handling and recovery

● Optimized for batch processing:– locations of data exposed to computation– high aggregate bandwidth

http://www.slideshare.net/jhammerb/hdfs-architecture

Page 11: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

HDFS Continued...● Single Namespace for the entire cluster● Data coherency

– Write-once-read-many model

– Only appending is supported for existing files

● Files are broken up in chunks (“blocks”)– Blocksizes ranging from 64 to 256 MB, depending on configuration

– Blocks are distributed over nodes (a single FILE, existing of N blocks, is stored on M nodes)

– Blocks are replicated and replications are distributed

● Client accesses the blocks of a file at the nodes directly– This creates high aggregate bandwidth!

http://www.slideshare.net/jhammerb/hdfs-architecture

Page 12: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

HDFS NameNode & DataNodes

NameNode

● Manages File System Namespace

– Mapping filename to blocks

– Mapping blocks to DataNode

● Cluster Configuration

● Replication Management

DataNode

● A “Block Server”

– Stores data in local FS

– Stores metadata of a block (e.g. hash)

– Serve (meta)data to clients

● Facilitates pipeline to other DN's

Page 13: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Metadata operations● Communicate with NN only

– ls (see above), lsr, df, du, chmod, chown... etc.

R/W (block) operations● Communicate with NN and DN's

– put, copyFromLocal, CopyToLocal, tail... etc.

http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

Page 14: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

HDFSApplication Programming Interface (API)

● Enables programmers to access any HDFS from their code

● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html

● Written in (and thus available for) Java, but...

● Is also exposed through Apache Thrift, so can be accessed from:

● C++, Python, PHP, Ruby, and others● See http://wiki.apache.org/hadoop/HDFS-APIs

● Has a separate C-API (libhdfs)

So: you can enable your services to work with HDFS

Page 15: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

MapReduce● Is a framework for distributed (parallel) processing

of large datasets● Provides a programming model● Lets users plug-in own code● Uses a common pattern:

cat   | grep | sort    | unique > fileinput | map  | shuffle | reduce > output

● Is useful for large scale data analytics and processing

Page 16: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

MapReduce Continued...● Is great for processing large datasets!

– Send computation to data, so little data over lines– Uses blocks stored in the DFS, so no splitting required

(this is a bit more sophisticated depending on your input)● Handles parallelism for you

– One map per block, if possible● Scales basically linearly

– time_on_cluster = time_on_single_core / total_cores● Java, but streaming possible (plus others, see later)

Page 17: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

MapReduceJobTracker & TaskTrackers

JobTracker

● Holds job metadata

– Status of job

– Status of Tasks running on TTs

● Decides on scheduling

● Delegates creation of 'InputSplits'

TaskTracker

● Requests work from JT

– Fetch the code to execute from the DFS

– Apply job-specific configuration

● Communicate with JT on tasks:

– Sending output, Killing tasks, Task updates, etc

Page 18: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

MapReduce client

Page 19: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

MapReduceApplication Programming Interface (API)

● Enables programmers to write MapReduce jobs● More info on MR jobs:

http://www.slideshare.net/evertlammerts/infodoc-6107350

● Enables programmers to communicate with a JobTracker● Submitting jobs, getting statuses, cancelling jobs,

etc

● Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html

Page 20: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Case Study:Virtual Knowledge Studio

1) Load file into HDFS

2) Submit code to MR

Page 21: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

What's more on Hadoop?

Lots!

● Apache Pig http://pig.apache.org

– Analyze datasets in a high level language, “Pig Latin”

– Simple! SQL like. Extremely fast experiments.

– N-stage jobs (MR chaining!)

● Apache Hive http://hive.apache.org

– Data Warehousing

– Hive QL

● Apache Hbase http://hbase.apache.org

– BigTable implementation (Google)

– In-memory operation

– Performance good enough for websites (Facebook built its Messaging Platform on top of it)

● Yahoo! Oozie http://yahoo.github.com/oozie/

– Hadoop workflow engine

● Apache [AVRO | Chukwa | Hama | Mahout] and so on

● 3rd Party:

– ElephantBird

– Cloudera's Distribution for Hadoop

– Hue

– Yahoo's Distribution for Hadoop

Page 22: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Hadoop @ SARA

A prototype cluster● Since December 2010

● 20 cores for MR (TT's)

● 110 TB gross for HDFS (DN) (55TB net)

● Hue web-interface for job submission & management

● SFTP interface to HDFS

● Pig 0.8

● Hive

● Available for scientists / scientific programmers until May / June 2011

Towards a production infrastructure?● Depending on results

It's open for you all as well: ask me for an account!

Page 23: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Page 24: Large-Scale Data Storage and Processing for Scientists with Hadoop

BioAssist Programmers' Day, January 21, 2011

Hadoop for:● Large-scale data storage and processing● Fundamental difference: data locality!● Small files? Don't, but... Hadoop Archives (HAR)● Archival? Don't. Use tape storage. (We have lots!)● Very fast analytics (Pig!)● For data-parallel applications (not good at crossproducts – use

Huygens or Lisa!)● Legacy applications possible through piping / streaming (Weird

dependencies? Use Cloud!)

We'll do another Hackathon on Hadoop. Interested? Send me a mail!