attacking data intensive science with distributed computing

Attacking Data Intensive Attacking Data Intensive Science with Distributed Science with Distributed

ComputingComputing

Prof. Douglas ThainProf. Douglas Thain

University of Notre DameUniversity of Notre Dame

http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain

OutlineOutline

Large Scale Distributed ComputingLarge Scale Distributed Computing– Plentiful Computing Resources World-WidePlentiful Computing Resources World-Wide– Challenges: Data and DebuggingChallenges: Data and Debugging

The Cooperative Computing LabThe Cooperative Computing Lab– Distributed Data ManagementDistributed Data Management– Applications to Scientific ComputingApplications to Scientific Computing– Debugging Complex SystemsDebugging Complex Systems

Open Problems in Distributed ComputingOpen Problems in Distributed Computing– Proposal: The All-Pairs AbstractionProposal: The All-Pairs Abstraction

As of 04 Sep 2006:As of 04 Sep 2006:

TeragridTeragrid– 21,972 CPUs / 220 TB / 6 sites21,972 CPUs / 220 TB / 6 sites

Open Science GridOpen Science Grid– 21,156 CPUs / 83 TB / 61 sites21,156 CPUs / 83 TB / 61 sites

Condor Worldwide:Condor Worldwide:– 96,352 CPUs / 1608 sites96,352 CPUs / 1608 sites

At Notre Dame:At Notre Dame:– CRC: 500 CPUsCRC: 500 CPUs– BOB: 212 CPUsBOB: 212 CPUs– Lots of little clusters!Lots of little clusters!

Plentiful Computing PowerPlentiful Computing Power

Who is using all of this?Who is using all of this?Anyone with Anyone with unlimitedunlimited computing needs! computing needs!

High Energy Physics:High Energy Physics:– Simulating the detector a particle accelerator before Simulating the detector a particle accelerator before

turning it on allows one to understand the output.turning it on allows one to understand the output.

Biochemistry:Biochemistry:– Simulate complex molecules under different forces to Simulate complex molecules under different forces to

understand how they fold/mate/react.understand how they fold/mate/react.

Biometrics:Biometrics:– Given a large database of human images, evaluate Given a large database of human images, evaluate

matching algorithms by comparing all to all.matching algorithms by comparing all to all.

Climatology:Climatology:– Given a starting global climate, simulate how climate Given a starting global climate, simulate how climate

develops under varying assumptions or events. develops under varying assumptions or events.

BuzzwordsBuzzwords

Distributed ComputingDistributed ComputingCluster ComputingCluster ComputingBeowulfBeowulfGrid ComputingGrid ComputingUtility ComputingUtility ComputingSomething@HomeSomething@Home

= A bunch of computers.= A bunch of computers.

Some Outstanding SuccessesSome Outstanding Successes

TeraGrid:TeraGrid:– AMANDA project uses 1000s of CPUs over months to AMANDA project uses 1000s of CPUs over months to

calibrate and process data from a neutrino telescope.calibrate and process data from a neutrino telescope.

PlanetLab:PlanetLab:– Hundreds of nodes used to test and validate a wide Hundreds of nodes used to test and validate a wide

variety of dist. and P2P systems: Chord, Pastry, etc...variety of dist. and P2P systems: Chord, Pastry, etc...

Condor:Condor:– MetaNEOS project solves a 30-year-old optimization MetaNEOS project solves a 30-year-old optimization

problem using brute force on 1000 heterogeneous problem using brute force on 1000 heterogeneous CPUs across multiple sites over several weeks.CPUs across multiple sites over several weeks.

Seti@Home:Seti@Home:– Millions of CPUs used to analyze celestial signals.Millions of CPUs used to analyze celestial signals.

And now the bad news...And now the bad news...

Large distributed systemsfall to pieces

when you have lots of data!

Example: Grid3 (OSG)Example: Grid3 (OSG)

Robert Gardner, et al. (102 authors)Robert Gardner, et al. (102 authors)The Grid3 Production GridThe Grid3 Production Grid

Principles and PracticePrinciples and PracticeIEEE HPDC 2004IEEE HPDC 2004

The Grid2003 Project has deployed a multi-virtual The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory organization, application-driven grid laboratory

that has sustained for several months the that has sustained for several months the production-level services required by…production-level services required by…

ATLAS, CMS, SDSS, LIGO…ATLAS, CMS, SDSS, LIGO…

Problem: Data ManagementProblem: Data ManagementThe good news:The good news:

– 27 sites with 2800 CPUs27 sites with 2800 CPUs– 40985 CPU-days provided over 6 months40985 CPU-days provided over 6 months– 10 applications with 1300 simultaneous jobs10 applications with 1300 simultaneous jobs

The bad news:The bad news:– 40-70 percent utilization40-70 percent utilization– 30 percent of jobs would fail30 percent of jobs would fail– 90 percent of failures were site problems90 percent of failures were site problems– Most site failures were due to disk space!Most site failures were due to disk space!

Problem: DebuggingProblem: Debugging

““Most groups reported problems in which a Most groups reported problems in which a job had been submitted... and something job had been submitted... and something

had not performed correctly, but they were had not performed correctly, but they were unable to determine where, why, or how to unable to determine where, why, or how to

fix that problem...”fix that problem...”

Jennifer Schopf and Steven Newhouse,“State of Grid Users: 25 Conversations with UK eScience Users”

Argonne National Lab Tech Report ANL/MCS-TM-278, 2004.

Both Problems: Debugging I/OBoth Problems: Debugging I/O

A user submits 1000 jobs to a grid.A user submits 1000 jobs to a grid.

Each requires 1 GB of input.Each requires 1 GB of input.

100 start at once. (Quite normal.)100 start at once. (Quite normal.)

The interleaved transfers all fail.The interleaved transfers all fail.

The “robust” system retries...The “robust” system retries...

(Happened last week in this department!) (Happened last week in this department!)

A Common Thread:A Common Thread:

Each of these problems:Each of these problems:– ““I can’t make storage do what I want!”I can’t make storage do what I want!”– ““I have no idea why this system is failing!”I have no idea why this system is failing!”

Arises from the following:Arises from the following:– Both service providers and users are lacking Both service providers and users are lacking

the the tools and modelstools and models that they need to that they need to harness and analyze complex environments.harness and analyze complex environments.

OutlineOutline




Cooperative Computing LabCooperative Computing Labat the University of Notre Dameat the University of Notre Dame

Basic Computer Science ResearchBasic Computer Science Research– Overlapping categories: Operating systems, distributedOverlapping categories: Operating systems, distributed systems, grid computing, filesystems, databases...systems, grid computing, filesystems, databases...

Modest Local OperationModest Local Operation– 300 CPUs, 20 TB of storage, 6 stakeholders300 CPUs, 20 TB of storage, 6 stakeholders– Keeps us honest + eat our own dog food.Keeps us honest + eat our own dog food.

Software Development and PublicationSoftware Development and Publication– http://www.cctools.orghttp://www.cctools.org– Students learn engineering as well as science.Students learn engineering as well as science.

Collaboration with External UsersCollaboration with External Users– High energy physics, bioinformatics, molecular dynamics...High energy physics, bioinformatics, molecular dynamics...

http://www.cse.nd.edu/~ccl

Computing EnvironmentComputing Environment

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU CPU CPU CPU

Disk Disk Disk Disk

Fitzpatrick Workstation Cluster

CCL Research ClusterCVRL Research Cluster

Miscellaneous CSE Workstations

CPU

CPU CPU

Disk

I will only run jobs when there is no-one working at

the keyboard

I will only run jobs between midnight and 8 AM

I prefer to run a job submitted by a CCL

student.

CondorMatchMakerJob

JobJob

Job

Job Job

Job

Job

CPU

Disk

JobJob

JobJob

Job Job Job Job

CPU HistoryCPU History

Storage HistoryStorage History

Flocking Between UniversitiesFlocking Between Universities

Notre Dame300 CPUs

Wisconsin1200 CPUs

Purdue A541 CPUs

Purdue B1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/

Problems and SolutionsProblems and Solutions

““I can’t make storage do what I want!”I can’t make storage do what I want!”– Need root access, configure, reboot, etc...Need root access, configure, reboot, etc...– Solution: Tactical Storage SystemsSolution: Tactical Storage Systems

I have no idea why this system is failing!I have no idea why this system is failing!– Multiple services, unreliable networks...Multiple services, unreliable networks...– Solution: Debugging Via Data MiningSolution: Debugging Via Data Mining

Why is Storage Hard?Why is Storage Hard?

Easy within one cluster:Easy within one cluster:– Shared filesystem on all nodes.Shared filesystem on all nodes.– But, limited to a few disks provided by admin.But, limited to a few disks provided by admin.– Even a “macho” file server has limited BW.Even a “macho” file server has limited BW.

Terrible across two or more clusters:Terrible across two or more clusters:– No shared filesystem on all nodes.No shared filesystem on all nodes.– Too hard to move data back and forth.Too hard to move data back and forth.– Limited to using storage on head nodes.Limited to using storage on head nodes.– Unable to become root to configure.Unable to become root to configure.

Conventional ClustersConventional Clusters

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

Disk Disk

Tactical Storage Systems (TSS)Tactical Storage Systems (TSS)

A TSS allows any node to serve as a file server A TSS allows any node to serve as a file server or as a file system client.or as a file system client.

All components can be deployed without special All components can be deployed without special privileges – but with standard grid security (GSI)privileges – but with standard grid security (GSI)

Users can build up complex structures.Users can build up complex structures.– Filesystems, databases, caches, ...Filesystems, databases, caches, ...– Admins need not know/care about larger structures.Admins need not know/care about larger structures.

Takes advantage of two resources:Takes advantage of two resources:– Total Storage (200 disks yields 20TB)Total Storage (200 disks yields 20TB)– Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)

Tactical Storage SystemTactical Storage System

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPUDisk Disk Disk Disk

Disk Disk Disk Disk

Disk Disk Disk Disk

Disk Disk Disk Disk Disk Disk Disk Disk

Disk Disk Disk Disk

Disk Disk Disk Disk

Disk Disk Disk Disk

1 – Uniform access between any nodes in either cluster

2 – Ability to group together multiple disks for a common purpose.

Logical Volume

Logical Volume

Appl

Secured byGrid GSI Credentials

WAN File System

Disk

Server

Adapter

Tactical Storage StructuresTactical Storage Structures

Adapter

Appl

Adapter

Appl

Disk

Server

Replicated File System

Disk

Server

Disk

Server

Appl

ApplAppl

Scalable Bandwidthfor Small Data

Adapter

AdapterAdapter

Disk

Server

Expandable File System

Disk

Server

Disk

Server

Adapter

Appl

Scalable Capacity/BWfor Large Data

Logical Volume

Applications and ExamplesApplications and Examples

Bioinformatics:Bioinformatics:– A WAN Filesystem for BLAST on EGEE grid.A WAN Filesystem for BLAST on EGEE grid.

Atmospheric PhysicsAtmospheric Physics– A cluster filesystem for scalable data analysis.A cluster filesystem for scalable data analysis.

Biometrics:Biometrics:– Dist. I/O for high-xput image comparison.Dist. I/O for high-xput image comparison.

Molecular Dynamics:Molecular Dynamics:– GEMS: Scalable distributed database.GEMS: Scalable distributed database.

High Energy Physics:High Energy Physics:– Global access to software distributions.Global access to software distributions.

Simple Wide Area File SystemSimple Wide Area File System

Bioinformatics on the European GridBioinformatics on the European Grid– Users want to run BLAST on standard DBs.Users want to run BLAST on standard DBs.– Cannot copy every DB to every node of the grid!Cannot copy every DB to every node of the grid!

Many databases of biological data in different Many databases of biological data in different formats around the world:formats around the world:– Archives: Swiss-Prot, TreMBL, NCBI, etc...Archives: Swiss-Prot, TreMBL, NCBI, etc...– Replicas: Public, Shared, Private, ???Replicas: Public, Shared, Private, ???

Goal: Refer to data objects by logical name.Goal: Refer to data objects by logical name.– Access the nearest copy of the non-redundant protein Access the nearest copy of the non-redundant protein

database, don’t care where it is.database, don’t care where it is.

Credit: Christophe Blanchet, Bioinformatics Center of Lyon, CNRS IBCP, Francehttp://gbio.ibcp.fr/cblanchet, [email protected]

Wide Area File SystemWide Area File System

BLAST

Adapter

RFIO gLite HTTP FTP

RFIOServer

FTPServer

HTTPServer

EGEE FileLocation Service

Run BLAST onLFN://ncbi.gov/nr.data

open(LFN://ncbi.gov/nr.data)

Where isLFN://ncbi.gov/nr.data?

Find it at:FTP://ibcp.fr/nr.data

nr.data

nr.data

nr.dataRETR nr.data

open(FTP://ibcp.fr/nr.data)

Performance of Bio Apps on EGEEPerformance of Bio Apps on EGEE

0

50

100

150

200

250

300

350

400

450

0 200 000 400 000 600 000 800 000 1 000 000 1 200 000

Protein Database Size (sequences)

Ru

nti

me (

sec)

BLAST+Parrot

FastA+Parrot

SSearch+Parrot

BLAST+copy

FastA+copy

SSearch+copy

Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data

Credit: John Poirer @ Notre Dame Astrophysics Dept.

bufferdisk

2 GB/day todaycould be lots more!

dailytape

dailytapedaily

tapedailytapedaily

tape

30-yeararchive

analysiscode

Can only analyzethe most recent data.

Project GRANDhttp://www.nd.edu/~grand

Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data

Credit: John Poirer @ Notre Dame Astrophysics Dept.

bufferdisk

2 GB/day todaycould be lots more!

dailytape

dailytapedaily

tapedailytapedaily

tape

30-yeararchive

Project GRANDhttp://www.nd.edu/~grand

fileserver

fileserver

fileserver

fileserver

Logical Volume

Adapter

analysiscode

Can analyze all dataover large time scales.

Scalable I/O for BiometricsScalable I/O for Biometrics

Computer Vision Research Lab in CSEComputer Vision Research Lab in CSE– Goal: Develop robust algorithms for identifying Goal: Develop robust algorithms for identifying

humans from (non-ideal) images.humans from (non-ideal) images.– Technique: Collect lots of images. Think up Technique: Collect lots of images. Think up

clever new matching function. Compare them.clever new matching function. Compare them.

How do you test a matching function?How do you test a matching function?– For a set S of images,For a set S of images,– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.– Compare the result matrix to known functions.Compare the result matrix to known functions.

Credit: Patrick Flynn at Notre Dame CSE

Computing SimilaritiesComputing Similarities

11 .8.8 .1.1 00 00 .1.1

11 00 .1.1 .1.1 00

11 00 .1.1 .7.7

11 00 00

11 .1.1

11

F

A Big Data ProblemA Big Data Problem

Data Size: 10k images of 1MB = 10 GBData Size: 10k images of 1MB = 10 GB

Total I/O: 10k * 10k * 2 MB *1/2 = Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB100 TB

Would like to repeat many times!Would like to repeat many times!

In order to execute such a workload, we In order to execute such a workload, we must be careful to partition both the I/O must be careful to partition both the I/O and the CPU needs, taking advantage of and the CPU needs, taking advantage of distributed capacity. distributed capacity.

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

Conventional SolutionConventional Solution

DiskDisk

DiskDisk

Job JobJobJob Job JobJobJob

Move 200 TB at Runtime!

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

Using Tactical StorageUsing Tactical Storage

1. Break array into MB-size chunks.

3. Jobs find nearby data copy, and make full use before discarding.

Job Job Job Job

2. Replicate data to many disks.

It’s Ugly in the Real WorldIt’s Ugly in the Real WorldMachine related failures:Machine related failures:– Power outages, network outages, faulty memory, corrupted file Power outages, network outages, faulty memory, corrupted file

system, bad config files, expired certs, packet filters...system, bad config files, expired certs, packet filters...

Job related failures:Job related failures:– Crash on some args, bad executable, missing input files, mistake Crash on some args, bad executable, missing input files, mistake

in args, missing components, failure to understand in args, missing components, failure to understand dependencies...dependencies...

Incompatibilities between jobs and machines:Incompatibilities between jobs and machines:– Missing libraries, not enough disk/cpu/mem, wrong software Missing libraries, not enough disk/cpu/mem, wrong software

installed, wrong version installed, wrong memory layout...installed, wrong version installed, wrong memory layout...

Load related failures:Load related failures:– Slow actions induce timeouts; kernel tables: files, sockets, procs; Slow actions induce timeouts; kernel tables: files, sockets, procs;

router tables: addresses, routes, connections; competition with router tables: addresses, routes, connections; competition with other users...other users...

Non-deterministic failures:Non-deterministic failures:– Multi-thread/CPU synchronization, event interleaving across Multi-thread/CPU synchronization, event interleaving across

systems, random number generators, interactive effects, cosmic systems, random number generators, interactive effects, cosmic rays...rays...

A “Grand Challenge” Problem:A “Grand Challenge” Problem:

A user submits one million jobs to the grid.A user submits one million jobs to the grid.

Half of them fail.Half of them fail.

Now what?Now what?– Examine the output of every failed job?Examine the output of every failed job?– Login to every site to examine the logs?Login to every site to examine the logs?– Resubmit and hope for the best?Resubmit and hope for the best?

We need some way of getting the big picture.We need some way of getting the big picture.

Need to identify problems not seen before.Need to identify problems not seen before.

An Idea:An Idea:

We have lots of structured information about the We have lots of structured information about the components of a grid.components of a grid.Can we perform some form of data mining to Can we perform some form of data mining to discover the big picture of what is going on?discover the big picture of what is going on?– User: Your jobs work fine on RH Linux 12.1 and 12.3 User: Your jobs work fine on RH Linux 12.1 and 12.3

but they always seem to crash on version 12.2.but they always seem to crash on version 12.2.– Admin: Joe is running 1000s of jobs with 10 TB of data Admin: Joe is running 1000s of jobs with 10 TB of data

that fail immediately; perhaps he needs help.that fail immediately; perhaps he needs help.

Can we act on this information?Can we act on this information?– User: Avoid resources that aren’t working for you.User: Avoid resources that aren’t working for you.– Admin: Assist the user in understand and fixing the Admin: Assist the user in understand and fixing the

problem.problem.

Job ClassAdMyType = "Job"TargetType = "Machine"ClusterId = 11839QDate = 1150231068CompletionDate = 0Owner = "dcieslak“JobUniverse = 5Cmd = "ripper-cost-can-9-50.sh"LocalUserCpu = 0.000000LocalSysCpu = 0.000000ExitStatus = 0ImageSize = 40000DiskUsage = 110000NumCkpts = 0NumRestarts = 0NumSystemHolds = 0CommittedTime = 0ExitBySignal = FALSEPoolName = "ccl00.cse.nd.edu"CondorVersion = "6.7.19 May 10 2006"CondorPlatform = I386-LINUX_RH9RootDir = "/"Iwd = "/tmp/dcieslak/smotewrap1"MinHosts = 1WantRemoteSyscalls = FALSEWantCheckpoint = FALSEJobPrio = 0User = "[email protected]"NiceUser = FALSEEnv = "LD_LIBRARY_PATH=."EnvDelim = ";"JobNotification = 0WantRemoteIO = TRUEUserLog = "/tmp/dcieslak/smotewrap1/ripper-cost-can-9-50.log"CoreSize = -1KillSig = "SIGTERM"Rank = 0.000000In = "/dev/null"TransferIn = FALSEOut = "ripper-cost-can-9-50.output"StreamOut = FALSEErr = "ripper-cost-can-9-50.error"StreamErr = FALSEBufferSize = 524288BufferBlockSize = 32768ShouldTransferFiles = "YES"WhenToTransferOutput = "ON_EXIT_OR_EVICT"TransferFiles = "ALWAYS"TransferInput = "scripts.tar.gz,can-ripper.tar.gz"TransferOutput = "ripper-cost-50-can-9.tar.gz"ExecutableSize_RAW = 1ExecutableSize = 10000Requirements = (OpSys == "LINUX") && (Arch == "INTEL") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)JobLeaseDuration = 1200PeriodicHold = FALSEPeriodicRelease = FALSEPeriodicRemove = FALSEOnExitHold = FALSEOnExitRemove = TRUELeaveJobInQueue = FALSEArguments = ""GlobalJobId = "cclbuild02.cse.nd.edu#1150231069#11839.0"ProcId = 0AutoClusterId = 0AutoClusterAttrs = "Owner,Requirements"JobStartDate = 1150256907LastRejMatchReason = "no match found"LastRejMatchTime = 1150815515TotalSuspensions = 73CumulativeSuspensionTime = 8179RemoteWallClockTime = 432493.000000LastRemoteHost = "hobbes.helios.nd.edu"LastClaimId = "<129.74.221.168:9359>#1150811733#2"MaxHosts = 1WantMatchDiagnostics = TRUELastMatchTime = 1150817352NumJobMatches = 34OrigMaxHosts = 1JobStatus = 2EnteredCurrentStatus = 1150817354LastSuspensionTime = 0CurrentHosts = 1ClaimId = "<129.74.20.20:9322>#1150232335#157"RemoteHost = "[email protected]"RemoteVirtualMachineID = 2ShadowBday = 1150817355JobLastStartDate = 1150815519JobCurrentStartDate = 1150817355JobRunCount = 24WallClockCheckpoint = 65927RemoteSysCpu = 52.000000ImageSize_RAW = 31324DiskUsage_RAW = 102814RemoteUserCpu = 62319.000000LastJobLeaseRenewal = 11

Machine ClassAdMyType = "Machine"TargetType = "Job"Name = "ccl00.cse.nd.edu"CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)MachineGroup = "ccl"MachineOwner = "dthain"CondorVersion = "6.7.19 May 10 2006"CondorPlatform = "I386-LINUX_RH9"VirtualMachineID = 1ExecutableSize = 20000JobUniverse = 1NiceUser = FALSEVirtualMemory = 962948Memory = 498Cpus = 1Disk = 19072712CondorLoadAvg = 1.000000LoadAvg = 1.130000KeyboardIdle = 817093ConsoleIdle = 817093StartdIpAddr = "<129.74.153.164:9453>"Arch = "INTEL"OpSys = "LINUX"UidDomain = "nd.edu"FileSystemDomain = "nd.edu"Subnet = "129.74.153"HasIOProxy = TRUECheckpointPlatform = "LINUX INTEL 2.4.x normal"TotalVirtualMemory = 962948TotalDisk = 19072712TotalCpus = 1TotalMemory = 498KFlops = 659777Mips = 2189LastBenchmark = 1150271600TotalLoadAvg = 1.130000TotalCondorLoadAvg = 1.000000ClockMin = 347ClockDay = 3TotalVirtualMachines = 1HasFileTransfer = TRUEHasPerFileEncryption = TRUEHasReconnect = TRUEHasMPI = TRUEHasTDP = TRUEHasJobDeferral = TRUEHasJICLocalConfig = TRUEHasJICLocalStdin = TRUEHasPVM = TRUEHasRemoteSyscalls = TRUEHasCheckpointing = TRUECpuBusyTime = 0CpuIsBusy = FALSETimeToLive = 2147483647State = "Claimed"EnteredCurrentState = 1150284871Activity = "Busy"EnteredCurrentActivity = 1150877237Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner")))Requirements = (START) && (IsValidCheckpointPlatform)IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))MaxJobRetirementTime = 0CurrentRank = 1.000000RemoteUser = "[email protected]"RemoteOwner = "[email protected]"ClientMachine = "cclbuild00.cse.nd.edu"JobId = "2929.0"GlobalJobId = "cclbuild00.cse.nd.edu#1150425594#2929.0"JobStart = 1150425941LastPeriodicCheckpoint = 1150879661ImageSize = 54196TotalJobRunTime = 456222TotalJobSuspendTime = 1080TotalClaimRunTime = 597057TotalClaimSuspendTime = 1271MonitorSelfTime = 1150883051MonitorSelfCPUUsage = 0.066660MonitorSelfImageSize = 8244.000000MonitorSelfResidentSetSize = 2036MonitorSelfAge = 0DaemonStartTime = 1150231320UpdateSequenceNumber = 2208MyAddress = "<129.74.153.164:9453>"LastHeardFrom = 1150883243UpdatesTotal = 2785UpdatesSequenced = 2784UpdatesLost = 0UpdatesHistory = "0x00000000000000000000000000000000"Machine = "ccl00.cse.nd.edu"Rank = ((Owner == "dthain") ||(Owner == "psnowber") ||(Owner == "cmoretti") ||(Owner == "jhemmes") ||(Owner == "gniederw")) * 2 + (PoolName =?= "ccl00.cse.nd.edu") * 1

User Job LogJob 1 submitted.Job 2 submitted.

Job 1 placed on ccl00.cse.nd.eduJob 1 evicted.Job 1 placed on smarty.cse.nd.edu.Job 1 completed.

Job 2 placed on dvorak.helios.nd.eduJob 2 suspendedJob 2 resumedJob 2 exited normally with status 1.

...

JobAd

MachineAdMachine

AdMachineAdMachine

Ad

UserJobLog

JobAdJob

AdJobAd

Failure Criteria:exit !=0core dumpevictedsuspendedbad output

Success Class Failure Class

RIPPER

Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on

version 12.2.

Unexpected DiscoveriesUnexpected Discoveries

Purdue Teragrid (91343 jobs on 2523 CPUs)Purdue Teragrid (91343 jobs on 2523 CPUs)– Jobs fail on machines with (Memory>1920MB)Jobs fail on machines with (Memory>1920MB)– Diagnosis: Linux machines with > 3GB have a Diagnosis: Linux machines with > 3GB have a

different memory layout that breaks some programs different memory layout that breaks some programs that do inappropriate pointer arithmetic.that do inappropriate pointer arithmetic.

UND & UW (4005 jobs on 1460 CPUs)UND & UW (4005 jobs on 1460 CPUs)– Jobs fail on machines with less than 4MB disk.Jobs fail on machines with less than 4MB disk.– Diagnosis: Condor failed in an unusual way when the Diagnosis: Condor failed in an unusual way when the

job transfers input files that don’t fit.job transfers input files that don’t fit.

Many Open ProblemsMany Open ProblemsStrengths and Weaknesses of ApproachStrengths and Weaknesses of Approach– Correlation != Causation -> could be enough?Correlation != Causation -> could be enough?– Limits of reported data -> increase resolution?Limits of reported data -> increase resolution?– Not enough data points -> direct job placement?Not enough data points -> direct job placement?

Acting on InformationActing on Information– Steering by the end user.Steering by the end user.– Applying learned rules back to the system.Applying learned rules back to the system.– Evaluating (and sometimes abandoning) changes.Evaluating (and sometimes abandoning) changes.

Data Mining ResearchData Mining Research– Continuous intake + incremental construction.Continuous intake + incremental construction.– Creating results that non-specialists can understand.Creating results that non-specialists can understand.

Next Step: Monitor 21,000 CPUs on the OSG!Next Step: Monitor 21,000 CPUs on the OSG!

OutlineOutline




Some RuminationsSome Ruminations

These tools attack technical problems.These tools attack technical problems.But, users still have to be clever:But, users still have to be clever:– Where should my jobs run?Where should my jobs run?– How should I partition data?How should I partition data?– How long should I run before a checkpoint?How long should I run before a checkpoint?

Can we provide an interface such that:Can we provide an interface such that:– Scientific users state what to compute.Scientific users state what to compute.– The system decides where, when, and how.The system decides where, when, and how.

Previous attempts didn’t incorporate data. Previous attempts didn’t incorporate data.

The All-Pairs AbstractionThe All-Pairs Abstraction

All-Pairs:All-Pairs:– For a set S and a function F:For a set S and a function F:– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.

The end user provides:The end user provides:– Set S: A bunch of files.Set S: A bunch of files.– Function F: A self-contained program.Function F: A self-contained program.

The computing system determines:The computing system determines:– Optimal decomposition in time and space.Optimal decomposition in time and space.– What resources to employ. (F easy to distr.)What resources to employ. (F easy to distr.)– What to do when failures occur.What to do when failures occur.

An All-Pairs Facility at Notre DameAn All-Pairs Facility at Notre Dame

AllPairsWeb

Portal

CPU CPU CPU CPU

Disk Disk Disk Disk

100s-1000s of machines

2 – Backend decides where to run,how to partition, when to retry failures...

F F F F

F

1 – User uploads S and F into the system.

S

3 – Return result matrix to user.

Our Mode of ResearchOur Mode of Research

Find researchers with systems problems.Find researchers with systems problems.

Solve them by developing new tools.Solve them by developing new tools.

Generalize the solutions to new domains.Generalize the solutions to new domains.

Publish papers and software!Publish papers and software!

AcknowledgmentsAcknowledgments

Science Collaborators:Science Collaborators:– Christophe BlanchetChristophe Blanchet– Sander Klous Sander Klous – Peter KunzstPeter Kunzst– Erwin LaureErwin Laure– John PoirierJohn Poirier– Igor SfiligoiIgor Sfiligoi– Francesco Delli PaoliFrancesco Delli Paoli

CSE Students:CSE Students:– Paul BrennerPaul Brenner– Tim FaltemierTim Faltemier– James FitzgeraldJames Fitzgerald– Jeff HemmesJeff Hemmes– Chris MorettiChris Moretti– Gerhard NiederwieserGerhard Niederwieser– Phil SnowbergerPhil Snowberger– Justin WozniakJustin Wozniak

CSE Faculty:CSE Faculty:– Jesus IzaguirreJesus Izaguirre– Aaron StriegelAaron Striegel– Patrick FlynnPatrick Flynn– Nitesh ChawlaNitesh Chawla

For more information...For more information...

Cooperative Computing LabCooperative Computing Lab

http://www.cse.nd.edu/~cclhttp://www.cse.nd.edu/~ccl

Cooperative Computing ToolsCooperative Computing Tools

http://http://www.cctools.orgwww.cctools.org

Douglas ThainDouglas Thain– [email protected]@cse.nd.edu– http://http://www.cse.nd.edu/~dthainwww.cse.nd.edu/~dthain

attacking data intensive science with distributed computing

Documents

process data

heterogeneous cpus

millions of cpus

data intensive science

data managementthe good

distributed computingproposal

distributed computingprof

large distributed systemsfall