timm morten steinbeck, computer science/computer engineering group kirchhoff institute f. physics,...

Timm Morten Steinbeck, Computer Science/Computer Engineering Group

Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg

A Framework for Building Distributed Data Flow

Chains in Clusters

Requirements

Alice: A relativistic heavy ion physics experiment

Very large multiplicity: 15.000 particles/event

Full event size > 70 MB

Last trigger stage (High Level Trigger, HLT) is first stage with complete event data

Data rate into HLT up to 25 GB/s

High Level Trigger

HLT has to process large volume of data to go from this...

... to this:

HLT has to perform reconstruction of particle tracks from raw ADC data.

1, 2, 123, 255, 100, 30, 5, 1, 4,3, 2, 3, 4, 5, 3, 4, 60, 130, 30, 5,....................

High Level Trigger

HLT consists of Linux-PC farm (500-1000 nodes) with fast network

Nodes are arranged hierarchically First stage reads data from detector

and performs first level processing Each stage sends produced data to

next stage Each further stage performs

processing or merging on input data

High Level Trigger

The HLT needs a software framework to transport data through the PC farm.Requirements for the framework: Efficiency:

The framework should not use too much CPU cycles (which are needed for data analysis)

The framework should transport the data as fast as possible

Flexibility: The framework should consist of components which can be

plugged together in different configurations The framework should allow reconfiguration at runtime

The Data Flow Framework

Components are single processes

Communication based on publisher-subscriber principle:

One publisher can serve multiple subscribers

Publisher

SubscriberSubscribe

SubscriberSubscribe

SubscriberSubscribe

Publisher

SubscriberNew Data

SubscriberNew Data

SubscriberNew Data

Framework Architecture

Framework consists of mutually dependant packages

Utility Classes

Communication Classes

Publisher & SubscriberBase Classes

Data Source, Sink, &Processing Components

Data Source, Sink, &Processing Templates

Data Flow Components

Utility Classes

General helper classes (e.g. timer, thread) Rewritten classes:

Thread-safe string class Faster vector class


Two abstract base classes: For small message transfers For large data blocks

Derived classes for each network technology

Currently existing: Message and block classes for TCP and SCI (Shared Memory Interconnect)

Message Base Class

TCP Message Class SCI Message Class

Block Base Class

TCP Block Class SCI Block Class


Implementations foreseen for Atoll network (University of Mannheim) and Scheduled Transfer Protocol

API partly modelled after socket API (Bind, Connect) Both implicit and explicit connection possible

Implicit ConnectionUser Calls System CallsConnect connectSend transfer dataSend transfer dataSend transfer dataDisconnect disconnect

Explicit ConnectionUser Calls System CallsSend connect, transfer data, disconnectSend connect, transfer data, disconnectSend connect, transfer data, disconnect

Publisher Subscriber Classes

Implement publisher-subscriber interface

Abstract interface for communication mechanism between processes/modules

Currently named pipes or shared memory available

Multi-threaded implementation

Publisher

SubscriberSubscribe

SubscriberSubscribe

SubscriberSubscribe

Publisher

SubscriberNew Data

SubscriberNew Data

SubscriberNew Data

Publisher Subscriber Classes

Efficiency: Event data not sent to subscriber components

Publisher process places data into shared memory (ideally during production)

Descriptors are sent to subscribers holding location of data in shared memory

Requires buffer management in publisher

Publisher

Event M

Data Block n

Data Block 0

New Event

Event MDescriptor

Block n

Block 0

Subscriber

Shared Memory


To merge data streams belonging to one event

To split and rejoin a data stream (e.g. for load balancing)

EventMerger

Event N, Part 1

Event N, Part 2

Event N, Part 3

Event N

Several components to shape the data flow in a chain

EventScatterer EventGatherer

Event 1Event 2Event 3

Event 1

Event 3

Processing

Processing

Processing

Event 1Event 2Event 3


To transparently transport data over the network to other computers (Bridge)

SubscriberBridgeHead has subscriber class for incoming data, PublisherBridgeHead uses publisher class to announce data

Several components to shape the data flow in a chain

SubscriberBridgeHead

Node 1Network

PublisherBridgeHead

Node 2

Component Template

Templates for user components are providedTemplates for user components are providedTemplates for user components are provided To read out data from source

and insert it into a chain (Data Source Template)

To accept data from the chain, process, and reinsert it (Analysis Template)

To accept data from the chain and process it, e.g. store it (Data Sink Template)

Data SourceAddressing &Handling

Buffer Management

Data Announcing

Data Analysis & Processing

Output Buffer Mgt.

Output Data Announcing

Accepting Input Data

Input Data Addressing

Data SinkAddressing &Writing

Accepting Input Data

Input Data Addressing

Benchmarks

Performance tests of framework Dual Pentium 3, w. 733 MHz-800MHz Tyan Thunder 2500 or Thunder HEsl motherboard,

Serverworks HE/HEsl chipset 512MB RAM, >20 GB disk space, system, swap,

tmp Fast Ethernet, switch w. 21 Gb/s backplane SuSE Linux 7.2 w. 2.4.16 kernel

Benchmarks

Benchmark of publisher-subscriber interface Publisher process 16 MB output buffer, event size 128 B Publisher does buffer management, copies data

into buffer, subscriber just replies to each event

Maximum performance more than 12.5 kHz

Benchmarks

Benchmark of TCP message class Client sending messages to server on other PC TCP over Fast Ethernet Message size 32 B

Maximum message rate more than 45 kHz

Benchmarks

Publisher-Subscriber Network Benchmark Publisher on node A, subscriber on node B Connected via Bridge, TCP over Fast Ethernet 31 MB buffer in publisher and receiving bridge Message size from 128 B to 1 MB

SubscriberBridgeHead

Node ANetwork

PublisherBridgeHead

Node B

SubscriberPublisher

Benchmarks

Publisher-Subscriber Network BenchmarkNotes:Dual CPU nodes100% 2 CPUs

Theor. Rate:100 Mb/s / Evt. Size

Benchmarks

CPU load/event rate decrease with larger blocks Receiver more loaded than sender

Minimum CPU load 20% sender, 30% receiver

Maximum CPU load @ 128 B events 90% receiver, 80% sender

Management of > 250.000 events in system!!

Benchmarks

Publisher-Subscriber Network Benchmark

Benchmarks

32 kB event size: Network bandwidth limits 32 kB event size: More than 10 MB/s At maximum event size: 12.3*106 B of 12.5*106 B

"Real-World" Test

13 node test setup Simulation of read-out and processing of 1/36 (slice)

of Alice Time Projection Chamber (TPC) Simulated piled-up (overlapped) proton-proton

events Target processing rate: 200 Hz (Maximum read out

rate of TPC)

"Real-World" Test

EG

ES

CF

AU

FP

CF

AU

FP

CF

AU

FP

CF

AU

FP

CF

AU

FP

CF

AU

FP

EG

ES

T T

EG

ES

EG

ES

T T

EG

ES

T T T T

EG

ES

FP

CF

AU

T T

EG

ES

T T

ADC Unpacker

Event Scatterer

Tracker (2x)

EM

PM

EM

SM

EM

PM

EM

SM

Patch Merger

Slice Merger

Patches1 65432

T T

Event Merger

Event Merger

Event Gatherer

Cluster Finder

File Publisher

"Real-World" Test

CF

AU

FP

1, 2, 123, 255, 100, 30, 5, 4*0, 1, 4, 3*0, 4, 1, 2, 60, 130, 30, 5,....................

1, 2, 123, 255, 100, 30,5, 0, 0, 0, 0, 1, 4, 0, 0, 0, 4, 1, 2, 60, 130, 30, 5,....................

EG

ES

T T

EM

PM

EM

SM

"Real-Word" Test

Third (line of) node(s) connects track segments to form complete tracks (Track Merging)

Second line of nodes finds curved particle track segments going through charge space-points

First line of nodes unpacks zero-suppressed, run-length-encoded ADC values and calculates 3D space-point coordinates of charge depositions

CF

AU

FP

EG

ES

T T

EM

PM

EM

SM

Test Results

CF

AU

FP

EG

ES

T T

EM

PM

EM

SM Rate: 270 Hz CPU load: 2 * 100% Network: 6 * 130 kB/s event data

CPU load: 2 * 75% - 100%

Network: 6 * 1.8 MB/s event data

CPU load: 2 * 60% - 70%

Conclusion & Outlook

Framework allows flexible creation of data flow chains, while still maintaining efficiency

No dependencies created during compilation Applicable to wide range of tasks Performance already good enough for many

applications using TCP on Fast Ethernet

Future work: Use dynamic configuration ability for fault tolerance purposes, improvements of performance

More information: http://www.ti.uni-hd.de/HLT

http://www.ti.uni-hd.de/HLT







timm morten steinbeck, computer science/computer engineering group kirchhoff institute f. physics,...

Documents

data analysis

produced data

input data slide

publisher subscriber

data flow framework

complete event data

large data blocks

large volume of data