advanced parallel implementation of real-time data ... · keywords: numa architectures, concurrent...

Advanced parallel implementation of real-time data processing

algorithms for microwave reflectometry applied to plasma

position feedback control purposes

Goncalo Maria Paes de Vasconcellos Silva e [email protected]

Instituto Superior Tecnico, Lisboa, Portugal

October 2015

Abstract

Research in the field of plasma diagnostics for control purposes in nuclear fusion experiments hasbecome increasingly important. In particular, the plasma’s control inside the chamber of reactor gradeexperimental devices such as the ITER international tokamak will be one of the most critical issueswhen operating of these power generating devices. At the ASDEX Upgrade tokamak, an ITER relevantexperimental device, a plasma position control technique based on data from a microwave reflectometerhas been demonstrated as an alternative to the usual feedback of magnetic field measurements. Thereflectometry diagnostic used in this demonstration, a system capable of producing measurements of theplasma density profile in Real-time (RT), was recently upgraded to improve its performance, operatingdensity range and to satisfy the requirements of the new synchronization and data-exchange softwareinfrastructure used between all the RT measurement diagnostics and the experiments discharge controlsystem (DCS). Part of the upgrade consisted in replacing the ageing data acquisition and RT dataprocessing server, changing the original UMA to a NUMA architecture. Hence, the diagnostic softwareinfrastructure needed to be changed or adapted in order to take full advantage of the gains achievablewith such an architecture, satisfying all the control application requirements. This thesis will addressedall the introduced changes and their impact in the system’s performance, with special focus on the uti-lization of advanced parallelism approaches as well as advanced core isolation and memory managementtechniques. The results obtained with extensive system benchmarking prove the validity of the adoptedimplementation.Keywords: NUMA architectures, Concurrent programming, Parallel processing, Microwave reflec-tometry, Real-time plasma position control, ASDEX Upgrade

1. Introduction

During the last few years, the use of elaboratedplasma measurements for control purposes in nu-clear fusion reactors has become a very importantresearch topic. The position control of the hotplasma column inside the nuclear reactors fusionchamber is one of the most critical issues in the op-eration of the next fusion power generating devices.

To investigate alternatives to the control of theplasma, a radar technique for determining the ra-dial profile of the plasma density is being researchedin the ASDEX Upgrade tokamak. This tech-nique,based on the Real-time (RT) reflectometrysystem, is being upgraded to improve its perfor-mance and operating density range, as well as tosatisfy the requirements of the new software infras-tructure used to synchronize and exchange data be-tween all the RT measurement diagnostics and thetakamaks discharge control system (DCS)[17],

(responsible for the feedback control of the experi-mental reactor).

The data processing software running on the di-agnostic’s server (also referred to as the Level-1software) uses the raw data supplied by the data ac-quisition software (also referred to as the Level-0software) to calculate the plasma’s electron den-sity profile.

This calculation is performed for both the outerLow Field Side (LFS) and the inner High Field Side(HFS) of the plasma. HFS and LFS reflectometrydata originates from the antennas near the innerand outer walls of the toroidal reactor [12].

The corresponding density profile, when also sup-plied with the plasma’s average density, allows foran estimation of the separatrix absolute posi-tion[12]. The position of the separatrix, the lastclosed magnetic flux surface, is used in the feedbackloops of the plasma position controllers.

1

At ASDEX Upgrade the plasma position feed-back loop is operated in a fast ' 1ms cycle [14].Previously, the AUG PPR single DAQ system onlyacquired the data needed to produce one RT profileseparatrix estimation on a 1ms cycle.

The objective of the upgrade is to be capa-ble of a reflectometry profile acquisition rate fourtimes higher than before (data corresponding to1 profile acquired every 250µs for analysis) whilstretaining the same RT measurement rate (1 RTprofile/plasma-wall gap estimate per 1ms) [13], andthe same 25µs microwave sweep rate. The numberof channels of the original data acquisition systemwas doubled, being capable of a ' 2.6GB/s aggre-gate bandwidth [13].

With that in mind, the hardware upgrades to thediagnostic were not only of quantitative (faster elec-tronics) but also of qualitative nature. The diagnos-tic’s server (DS), hosting twice the number of inputchannels and moved from an Uniform Memory Ac-cess or UMA system to a Non-Uniform MemoryAccess or NUMA system [5] [10]. Consequently,the software running on the DS needed to be im-proved to take advantage of those hardware up-grades.

My work focused on utilizing advanced paral-lelism approaches on the upgraded NUMA systemas well as advanced core isolation and memory man-agement techniques with the objective of perform-ing the necessary changes and fine tuning to theoriginal data processing software.

2. Original Diagnostic

The original diagnostic’s UMA server shown on Fig-ure 1 was built around a Tyan-S5375AG2NR moth-erboard [14], populated with two quad-core 3.0 GHzXeon X5450 processors, with two L2 6 MB banksshared between pairs of cores inside each CPU, and2-channels of DDR2 RAM clocked at 667 MHz. Itwas based on the ”Hapertown” architecture, whereCPUs interface with the rest of the componentsthrough the system’s Northbridge that handlescommunications among the CPUs, RAM and PCIedevices, as well as with the Southbridge that han-dles all of a computer’s I/O functions, as well ashandling some PCI devices. The Northbridge ofthe system was the Intel 5100 Memory ControllerHub (MCH), also referred to as Integrated MemoryController Hub (IMCH). The MCH allows only se-quential access to memory and has a bandwidth ofat most 10.6 GB/s, shared between the CPUs.

It contained one custom built, COTS based, 8-channel data-acquisition board from where both theHFS and LFS raw data originated from, representedas DAS on Figure 1. It interfaced with the North-bridge through the PCIe x8 interface with a max-imum bandwidth of 2 GB/s (1,3 GB/s sustained

data bandwidth was measured using the acquisitionboard).

The Southbridge of the system, the I/O Con-troller Hub ICH9R chip, controlled the communi-cation of the CPUs with the ethernet interface andwith the VGA port. It was controlled by the North-bridge through the ESI x4 interface, with a maxi-mum bandwidth of 1 GB/s.

Intel 5100

MCHPCIe

x8

2GB/s

UTDC

(trigger / timestamping)ICH9R

PCI

32bit/33VGA

Xeon X5450*@ 3.0 GHz

FSB* / 1066 MT/s

17 GB/s outbound

8 GB/s inbound

*

dual chan.

10.6 GB/s

FSB*

Acq. trigger Ref. clk

1 GigE

PCIe

x1

250MB/s

DDR2 667

ESI

(x4)

1 GB/s

DDR2 667

2011

HFS K, Ka, Q, V

LFS K, Ka, Q, V

DAQ Board

(8 channel)

Central timer

CPU0 CPU1

Figure 1: Diagnostic’s Server Original Hardware[13]

The original Level-1 software had only one appli-cation, named RCRorg , which was responsible forproducing the plasma’s density profiles and separa-trix position estimation.

RCRorg interfaced with the Level-0 applicationRTR through shared memory blocks. These blockswere used for both raw data communication andfor synchronization between applications and wereallocated at run time by either RTR or RCRorg forevery experiment, whichever one started first.

RTR retrieved the raw data bursts from theDAQ board and subsequently placed them in sharedmemory for Level-1 processing. RCRorg got thedata from the shared memory block and calculatedthe plasma’s density profiles and do the separatrixposition estimation.

When in the initialization stage, RCRorg read theconfiguration files and allocated or connected to theshared memories. Subsequently, it went onto thedata-processing loop, processing a raw data burstin every iteration. RCRorg terminated a plasma’sdischarge by storing the density profiles and theseparatrix estimations in a database, as shown inAlgorithm 1.

2.1. Data Processing

As shown on Figure 3, for every raw-data burst,calculating each (HFS or LFS) density profiles andestimating the separatrix positions required five dis-tinct sequential steps [12], all applied to an input of12× 1KB raw signal samples (see Figures 3 and 2):

• Data Linearisation.

2

Algorithm 1: High level pseudo-code ofRCRorg ’s Online Mode

Data: number of bursts1 Read Configuration Files;2 Allocate/Connect to Shared Memory Blocks;3 Allocate Memory Structures;4 while burst number 6= number of bursts do5 Poll on Raw Data Shared Integer Counter

(signals availabity of new raw data) ;6 Process Burst;7 Increment burst number;8 Increment Processed Data Shared Integer

Counter ;

9 Write results in database;10 Free Memory;11 Return;

• Data Filtering with a FIR filter.

• Fast Fourier Transforms - 24× 4 = 96 1KB FPFFTs in size.

• Neural Network - 24× 8× 10 in size.

• Kalman Filter.

Figure 2: Density Profile Calculation of Shot #27214 ([15])

The plasma’s density profile was calculated afterperforming the first four steps. The combination ofthe the plasma’s density profile with the plasma’saverage density information provided by the DCSachived the R′

aus and R′in , the estimated separatrix

positions. A Kalman Filter was also implementedin order to increase the reliability of the estimations[12].

3. Diagnostic’s Server UpgradeThe upgraded diagnostic’s server is built around aSupermicro MBD-X9DRH-iTF [13] motherboard,featuring two NUMA nodes. Each NUMA node

Density Pro les

(HFS and LFS)

Linearization

Filtering

Raus

Rin

Estimation

Kalman Filter

Neural Network

SHARED MEMORY

Get Raw Data

DCS

Raw Data

(HFS and LFS)

Parallel Zone

Ethernet Data CommunicationShared Memory Data Communication

Internal RCRorg Execution Flow

Average

Density

FFT

RCRorg

Figure 3: Data Burst processing in RCRorg

PCIVGA

Quad channel

> 80-100 GB/s

2x QPI* @ 8 GT/s32 GB/s inter CPU

aggregate QPI bandwidth

CPU2

QPI*16 GB/s

DD

R3 1

600

DD

R3 1

600

DAS(8 channel)

UTDC(cPCIe crate)

DAS(8 channel)

PCIe

x8

2GB/s

PCIe

x8

2GB/s

PCIe

x1

250 MB/s

PCIe 3.0

x8

8GB/s

IntelX540

10 GigE

10 GigE

Central timer

CPU1

LFS K, Ka, Q, V

LFS W, Qx, Vx

HFS K, Ka, Q, V

Markers, freq_cal

Xeon E2670@ 2.60 GHz

DMI(x4 2.0)

2 GB/s

PCH C602

QPI*16 GB/s

Figure 4: Upgraded Data Processing System’sHardware [13]

consists of a 2.60 Ghz Intel Xeon E5-2670 directlyconnected to 16 GB of DDR3 registered RAM mem-ory with 1600 Mhz clock speed. The NUMA nodeinterconnection consists of two 8 GT/s QPI 3 links.

This, based on the ”Ivy Bridge” architecture,eliminated the main bottleneck of the original”Hapertown” (Intel Xeon X5450) based diagnos-tic’s server, as Ivy Bridge CPUs, contrary to the”Hapertown CPUs”, interface directly with theirown segregated PCIe bus/devices and possess inter-nal memory management units, directly interfacingto their own memory banks. This arrangement al-lows for direct DMA transfer between each of theacquisition boards, either HFS or LFSs, and thememory nodes in which data will be processed bythe corresponding CPU/attributed cores [13]. Allhardware inter-connections are now much faster andhave higher bandwidth than the maximum effectivebandwidth of the data acquisition DMA transfer.The system memory interface reaches 80-100 GB/s,which is ten times faster than the peak memorybandwidth in the original diagnostic’s server, whichwas shared between the CPUs through the North-bridge. Furthermore, at the end of the processingchain, when results from both sides (HFS and LFS)

3

are merged and sent to the DCS by a single appli-cation, access to the non local memory node will bemade via dual QPI links that have an aggregatedunidirectional bandwidth of 32 GB/s. In what con-cerns data transfer no real/practical bottlenecks ex-ist at the hardware architecture level.

Additionally, the cache usage improved in thenew CPU architecture. In place of the two L2 6MB banks which were shared between each pair ofcores inside each CPU, all cores now feature an in-dividual 256 kB L2 cache and share a single L3 20MB cache. This arrangement highly contributes fora more symmetrical and scalable performance of themultithreaded codes handling the Real Time dataprocessing tasks. Ivy bridge CPUs also feature AVX256 bit SIMD instructions that together with theirmore optimized architecture implementation com-pensate the lower clock speed of the chosen CPUconfiguration (E5-E2670 at 2.6 GHz vs. X5450 at3.0 GHz). Figure 4 shows a diagram of the newRT data processing System hardware, built using aSupermicro barebone server case and incorporatingthe custom built, COTS based, DAQ system.

3.1. Diagnostic’s Server Software Upgrade

The operating system of the upgraded diagnostic’sserver was updated to the 13.1 version of OpenSuse,the at-the-time newest and most stable version ofthe OpenSUSE Linux distribution [6]. The kernelversion of this operating system was also changedfrom the vanilla to the 3.12 kernel version with en-abled support for NUMA architectures [8] .

The operation system was patched to have hardRT capabilities [16] [7]

4. Upgrade Approach

Upgraded Diagnostic

Separatrix Estimation

Level 1 - New RCR

DCS dependendent

Level 1 - RTL

Independent from DCS

Pro le Calculation

Pro le Calculation

Separatrix Estimation

Level 1 - RCR

Independent

DCS dependendent

Original Diagnostic

Figure 5: Splitting RCR

Separating the HFS and LFS into segregated par-allel data flows was crucial to achieving the targetof the upgrade process.

The data processing sofware was modified to im-plement a pipelined design paradigm that allowseasy and simple modifications to the pipeline stageswithout the need for global modifications.

The data acquisition software was already sepa-rated and is currently able to have fully separatedparallel data-flows.

4.1. Isolating the Plasma Density Profile Calcula-tion

As the plasma’s density profile calculation can beperformed with total independence from the DCS,the software code that performed that calculationcould be separated from the rest. On the otherhand,the estimation of the separatrix position re-quires external RT information provided by theDCS. Therefore, RCRorig was split task-wise intotwo independent applications, as shown on Figure5. A new application was developed, named RTL,that is completely independent from the new RCR.

Three different pipeline levels were implemented,one within the data acquisition stage and two withinthe data processing stage:

• Data Acquisition (RTR) - Control of the dataacquiring phase and raw data gathering.

• Data Processing (RTL) - Calculation of theplasma’s density profiles in RT.

• Data Processing (RCR) - Real-time estima-tion of the separatrix position of plasmaposition/shape feedback control, using theplasma’s average density provided by the DCS.

RTL performs the bulk of the computations. Pro-ducing the density profiles. All the data parallelismof RCRorig was ported into RTL. Hence, RTL be-came the most computationally heavy applicationin the DS, making its optimization the most crucialstep in the upgrade process.

On the other hand, RCR became a much lighterapplication. Within every iteration of its data-processing loop, it receives the density profiles ofeach of the data flows and estimates the separa-trix positions through the use of the plasma’s av-erage density information (provided by the DCS)and improves these estimations with a Kalman Fil-ter. Subsequently, it feeds these estimations to theDCS. These steps can not be performed in paralleland therefore no OpenMP parallel zone is employedwithin RCR.

4.2. Data Processing Flow SeparationSeparating the different data flows (one for HFS andthe other for LFS) within the data processing stagemeant splitting RTL into two separate parallel en-tities, one for processing each board/data flow rawdata (RTL1 and RTL2). This mimics what wasalready implemented in the data acquisition soft-ware with the separation of RTR into RTR1 andRTR2. This separation allowed for the data flowsto be completely separated up to the RCR stage.The latler remained as a single application that thenmerges the data flows, as only one application cancommunicate with the DCS.

Splitting RTL into RTL1 and RTL2 presentedfour independent challenges:

4

Density Pro les

(HFS and LFS)

Raus

Rin

Estimation

Kalman Filter DCS

Parallel Zone

Ethernet Data CommunicationShared Memory Data Communication

Internal Flow of Processing

Average

Density

Linearization

Filtering

Neural Network

FFT

RCRRTL

Get Raw Data

SHARED MEMORY

Figure 6: Data burst processing with RTL

L0 cpuset

ASDEX

Upgrade

DCSLevel-1

separatrix

position estimate

Figure 7: DS’ Pipeline with data flow separation[13]

1. Implementing core isolation policies throughthe cpuset file system [9].

2. Ensuring correct process and thread affinity [3].

3. Optimizing NUMA memory managementthrough the use of the libnuma API [4].

4. Choosing the best concurrency approach forRTL.

NUMA NODE 1

Core ID 0Relative Core 0








Thread 0 Thread 1

Thread 3Thread 2









Thread 3Thread 2

Thread 1Thread 0

cpuset RTL1 cpuset RTL2

cpuset RTR cpuset RCR

System cpuset RTL OpenMP Threads

NUMA NODE 2

Figure 8: CPUSET and Thread Affinity in the up-graded diagnostic’s server

The multiple process concurrency approach cho-sen consisted on having two separate processes,

to calculate each of the two density profiles, oneper data flow. These processes could have beencompletely separate from the beginning, whichrequired running two different applications whenstarting RTL, or one process could have been cre-ated through a fork of the one original process. Theformer was simpler to implement, conceptually eas-ier to understand and more scalable than the latter,as new data flows would only require a third RTLapplication to be executed. Accordingly, RTL1 andRTL2 where implemented as separate single processapplications.

Each of these new applications processes only oneof the data flows. Thus, this approach required exe-cuting the single process application twice (once foreach data flow). These are completely independentexcept for two synchronization instances: one wheninitializing the applications and one when terminat-ing.

4.3. IPC

PROCESSED SHM 1RAW SHM 1 HANDSHAKING SHM 1 AUXILIARY SHM 1

PROCESSED SHM 2RAW SHM 2 HANDSHAKING SHM 2 AUXILIARY SHM 2

Node 1 Node 2

Figure 9: Shared Memory Blocks

Communication through shared memory wasused for a various of reasons, the foremost of whichwas for storing the bursts of raw-data acquired bythe data acquisition software and their timestampsand also for storing the processed data of the dataprocessing stage. By storing this information inshared memories, other applications could accessit any time with very low latency. Additionally,shared memories were also for used for other pur-poses:

1. Storage of configuration parameters commonto more than one application

2. Synchronization through a busy waiting mech-anism. This mechanism was used for inter-application synchronization within runtime-critical stages of the diagnostic’s server by wayof a series of handshaking protocols that allowsafe access to shared resources.

3. Enabling diagnostic’s server benchmarking -Auxiliary shared memory blocks were intro-duced for benchmarking purposes. Every ap-plication uses a specific block within these aux-iliary blocks to store the timestamp for whenevery burst starts and when every burst ends.

5

Each data flow uses isolated shared memoryblocks, as can be see on Figure 9. As such, eachshared memory block referred in this section is mir-rored on both data flows. As it stands currently,each data flow is allowed to have up to one hundreddistinct shared memory blocks.

All shared memory blocks used in the upgradeddiagnostic’s server are now allocated and locked inRAM at boot contrary to how it was done in theoriginal diagnostic’s server where the shared mem-ories had to be allocated and deallocated run orfor every experimental session.. The reason behindthis boot allocation is that at boot there is alwaysenough RAM available on the system. On the otherhand, if the shared memory was to be allocated dur-ing every shot, a situation could arise where a lotof memory was allocated to other applications, re-sulting in not enough memory available to the dataacquisition software and the data processing soft-ware.

Two distinct shared memory blocks were used tohandle raw and processed data:

• Raw-data shared memory block - Memoryblock where the raw data acquired is stored. Itis currently written into by the data acquisitionsoftware and its from this block that that RTLreads the bursts of raw data in order to cal-culate the plasma’s density profile. The BurstTimestamps are also kept within this block.

• Processed-data shared memory block -Memoryblock where processed data is stored. RTLwrites a burst’s plasma density profile into thisblock. RCR also interacts with this block toget the density profiles and to store the sepa-ratrix position estimations.

Two blocks of shared memory were required foreach data flow to perform synchronization betweenapplications and to share common configurationvalues:

• Handshake shared memory - Multi-purposeshared memory used for a collection of vari-ables used for synchronization.

• Configuration Shared Memory - Shared mem-ory used for propagating various types of cross-module configurations between applications.

These two blocks are represented collectively asthe ”Handshaking SHM” in Figure 9.

Availability of new raw-data or of a newly calcu-lated density profile is informed by way of a sharedinteger counter increase, in the same manner to howit was done in the original diagnostic’s server, witha one differences: an additional shared counter -the Density Profile Shared Counter - is now used

between RTL and RCR to signal the availability ofnew density profiles.

As such, when within time-critical zones the up-graded the data processing stage uses a modifiedbusy waiting approach based on busy polling onthis shared integer counter, with a sleeping cycleof 1 microseconds. This sleep interval was reachedafter significant testing on how best to reduced thestrain on the hardware while succeeding in not sig-nificantly reducing the response time of the appli-cations.

This approach does not produce race conditions,as no application writes and reads from the sameshared counter.

4.4. Synchronization Through SemaphoresWhen not within runtime-critical areas of exe-cution, synchronization between applications wasperformed through named counting semaphores.These semaphores are used during the application’sinitialization and termination stages.

5. Results

RCRSeparatrix Position

EstimationRTL2Burst Density Pro le

Calculation

RTL1Burst Density Pro le

Calculation

RTR1Data Acquisition

RTR2Data Acquisition

RTR Stage SPE Stage

RCR Start Time RCR End Time

RTL2 Start Time RTL2 End Time

RTL1 Start Time RTL1 End Time

RTR2End Time

RTR1 End Time

DPC Stage

Figure 10: Benchmark Procedure - Profile Calcula-tion Time

This section presents the results of the bench-marking performed on the upgraded system. Unlessotherwise stated, any benchmarks and statistical re-sults presented bellow were measured over one hun-dred consecutive normal acquisitions (10000 burstsper acquistion) to remove outliers, resulting in a to-tal population of one million timing samples (corre-sponding to the acquisition and processing of a fulldensity profile) per benchmark. All timing resultswhere achieved through the use of the UTDC[11]board. Reading the actual time from the UTDCrequired on average 4µs.

Every histogram and cumulative density functionpresented in this section has a bin size of 1µs. Fig-ure 10 shows all the times measured during thebenchmarking phase.

By measuring these times, all stages of thepipeline can be adequately benchmarked.

5.1. Normal Shot ExecutionThe graph on Figure 11 presents the measuredtimes for a normal shot (10s/10000 bursts ofraw data - 1ms burst profile acquisition cycle).These measurements are relative to their respective

6

Acquisition Time

RTR1 End Time

RTR0 End Time

RTL1 Start Time

RTL1 End Time

RTL0 Start Time

RTL0 End Time

RCR Start Time

RCR End Time

Figure 11: Benchmark - Shot Timings

burst’s timestamp (a burst’s timestamp is assumedto be the timestamp of the first sweep of the burst).

It can be seen that there is a substantial delaybetween data flows (' 40µs to ' 50µs). This de-lay originates in RTR, and its due to the time re-quired to read each burst sweep’s timestamps fromthe UTDC as this operation is performed asymmet-rically by only one of the RTR applications, and itrequires on average ' 40µs.

This delay is propagated through the pipeline un-til the data flows are merged in RCR, when bothdata flows are forcefully synchronized.

5.2. RTL BenchmarkingTiming the density profile calculation procedurewas done by measuring the start time and end timeof each of the RTL applications.

We arrive at the profile calculation times (re-ferred to as DPC) by subtracting the two values.

RTL’s first benchmark was performed offline,with a replay application substituting RTR as thesource of raw data. In this case, the DPC timesare calculated by using the replay application’s endtime instead of RTR’s end time

The respective statistical results are presented onTables 1 and 2.

Mean 106.74 µsMedian 113.4 µsVariance 11.50 µsStandard Deviation 3.40 µsMaximum 185.25 µsSkewness 0.78

Table 1: Benchmarking - Offline DPC1 StatisticalResults

The histograms of these results is shown on Fig-ure 12. Figure 13 shows the cumulative densityfunction version of that same histogram.

These results show that the offline DPC timeshave means of 106.72µs and 109.54µs respectively,and maximums of 185.25µs and 195.50µs respec-


Table 2: Benchmarking - Offline DPC2 StatisticalResults

Time (μs)

Y A

xis

Tit

le

Time (μs)

Table1_RTL1

Figure 12: Benchmarking - Offline Density ProfileCalculation - Histogram

tively. Analysis of these results show that 99.96%of the samples of the DPC1 Times are ≤ 136µsand that 99.96% of the samples of DPC2 Times are≤ 144µs. The profile calculation times are highlyconcentrated around the mean, with a small stan-dard deviation (' 4µs). The distribution of theseresults is bell-like with positive skewness.

There is a difference between the behaviour ofRTL1 and RTL2, with RTL2 having a bigger stan-dard deviation and bigger distribution of the re-sults. This, as we will see, is something that al-ways happens. One possible explanation for thesevariation is the fact that there is only one UTDCboard and that this board is connected through theNUMA node 1, the same node as RTL1. As such,RTL2 in the NUMA node 2 is always slightly slowerin accessing and reading times from the UTDCboard.

Y A

xis

Tit

le

Times (μs)

Percent

Y A

xis

Tit

le

Times (μs)

Figure 13: Benchmarking - Offline Density ProfileCalculation - CDF

On the other hand, the second benchmark wasperformed using raw data acquired in real time us-ing RTR. The respective statistical results are pre-sented on Tables 3 and 4.

The histograms of these results are shown on Fig-ure 14. Additionally, the cumulative density func-

7


Table 3: Benchmarking - RT DPC1 Statistical Re-sults


Table 4: Benchmarking - RT DPC2 Statistical Re-sults

tion of these results is shown on Figure 15.

Time (μs)Time (μs)

Table1_1

Figure 14: Benchmarking - RT Density Profile Cal-culation - Histogram

These results show that the DPC times havemeans of 117.22µs and 115.81µs respectively, andmaximum of 187µs and 215.2µs respectively. Anal-ysis of these results show that 99.96% of the sam-ples of the DPC1 are ≤ 141µs and that 99.98%of the samples of DPC2 are ≤ 150µs. The pro-file calculation times are also highly concentratedaround their respective mean, but less so than inthe offline benchmark. Additionally, this bench-mark showed a bigger standard deviation for bothRTL1 and RTL2 due to a second and smaller his-togram peak between 130µs and 140µs.

5.3. RCR BenchmarkingTiming the process of estimating the separatrix po-sition was done by measuring the start time andend time of each of the RCR application. We arriveat the estimation times (referred to as SPE) by sub-tracting the two values. The respective statisticalresults are presented on Table 5.

These results are very close to the time it takes toread the UTDC, which means they are within themargin of error. This is to be expected, as there

Time (μs)

Time (μs)

Figure 15: Benchmarking - RT Density Profile Cal-culation - CDF

SPE StastisticsMean 4.30 µsMedian 4.18 µsVariance 0.17 µsStandard Deviation 0.41 µsMaximum 15.74 µsSkewness 4.94

Table 5: Benchmarking - Separatrix Position Esti-mation Statistical Results

is no computationally heavy data-processing codewithin RCR.

This Figure presents a time diagram of the up-graded pipeline. Each stage or application’s timeson Figure correspond to the maximum observedtime of that stage/application. The upgradedLevel-1 applications always delivers both the den-sity profiles and the estimation of the separatrixposition before RTR is ready to deliver a new rawburst for processing.

Legend:

Acquisition of burst reflectometry data

Acquisition data DMA upload

RTR1 RCRRTR2

RTL1 RTL2

0 25 100 150 200 250 300 350 400 450 500 550 600 650 700

Time (us)

Figure 16: Benchmarking - Pipeline Worst CaseScenario

The system is also able to deliver a separatrix es-timation every 430µs (counting from the timestampof the burst’s first sweep of the plasma), less thanhalf of the 1ms DCS control cycle [14].

Figure 16 also shows the delay that exists be-tween data flows due to the reading of the burst’stimestamps in RTR. If this delay can be mitigatedor removed, then the diagnostic gains even modespare time and it will be a step closer to achievingthe ITER requiments of an 100µs acquisition rate.

8

5.4. Comparison with the Original VersionAdditionally, if we consider only the density pro-file calculation procedure (the busy waiting andthe SPE have little expression when compared withthe DPC times), the Data processing in the Level-1 Software (RTL and RCR) achieved considerablespeed-up when comparing the results presentedabove with the results of the original diagnosticavailable in [14], both on average and in the worstcase, as can be seen on table 5.4

Orig. DS Upg. DS Speed-upMean 359.50µs 116.51µs 3.08Max 420.00µs 215.20µs 1.95

Table 6: Benchmarking - Comparison with the orig-inal DS

In conclusion, theory states that when doublingthe number of processor cores, the maximum speed-up possible is 2 [2]. Amhdal’s Law [1] also statesthat the speed-up should be at most asymptoticand that the overall speed-up depends on the serialtime of the program. In this case however, the dataprocessing cycle has no serial time, and other factorsof improvement are involved:

• Faster hardware components.

• The move to a NUMA architecture.

• Better isolation and affinity policies imple-mented.

• Parallel code optimization

• Hardware specific compiler optimizations

All these improvement factors resulted into thepractical speed-up presented on Table 5.4.

6. ConclusionsThe initial objective and main motivation of thepresent work was to adapt and fine tune the dataprocessing stages of the upgraded RT reflectome-try diagnostic, to guarantee a 250µs profile dataacquisition rate and a RT density profile and sep-aratrix position calculation rate of 1ms [14]. Asdemonstrated in the previous chapter, the initialobjective was not only met but substantially sur-passed: the system is now capable of performing afull data aquisition and RT calculation cycle on a250µs period. It was shown that all the techniquesherein described contributed to a successful and re-liable implementation of the codes to be used in theplasma position feedback control demonstration ,using both HFS and LFS profile data, planned forthe ASDEX Upgrade 2016 experimental campaign.

The current state of the diagnostic software im-plementation is as follows:

• The code takes maximum advantage of the up-graded hardware with a NUMA-aware memoryallocation and management policy and with animproved isolation and affinity policy

• The code succeeds in separating the HFS andthe LFS density profile calculation at both thehardware and software levels by way of optimalmemory management and affinity policies.

• The upgraded DS implements a pipelined andmodular approach, isolating the several stagesof the DS (acquisition, density profile calcula-tion and separatrix position estimation).

The upgraded system can now calculate in realtime HFS’ and LFS’ density profiles on 110µs onaverage and with a maximum time of 216µs. Theseparatrix position is estimated in an almost negli-gible time (≤ 10µs). Hence, the pipelined softwareimplementation of the various acquisition and dataprocessing stages allows the system to be operatedin a cycle that is smaller than 250µs.

The upgraded code is therefore expected to beable to handle a continuous 250µs rate of acquisi-tion/data processing, which means having not onlythe Level-1, but the entire pipeline functioning atthat speed.

As the full communication with the DCS was notpossible to test during the experimental operationperiod of the tokamak, a completely connected testis still to be made. However, the communicationbetween RT diagnostics and the DCS and the cor-responding control actuation have been timed in thepast to last less than 200µs. For this reason, it ispredictable that a fully connected last stage in thepipeline - the RCR stage - should allow the same250µs cycle to be met (as it now uses only 6µs).

Although such tight timings are not not requiredby the ASDEX Upgrade control system, a reliableoperation in such configuration is an invaluable casestudy for the design of the ITER PPR, currentlyunder way.

AcknowledgementsI would like to thank my advisors, Dr .Jorge San-tos and, for without his help and guidance over thetwo years I worked at the Instituto de Plasmas eFusao Nuclear (IPFN) this work would have notbeen possible. I would also like to thank my col-leges in IPFN.

I want to thank and acknowledge the importanceof the ASDEX Upgrade Project. The future of en-ergy is in nuclear fusion, and I was privileged to beable to make a contribution to this project.

I want to thank my University and all my teachersfor all the knowledge, work ethics and values thatwere instilled in me throughout all these years, es-

9

pecially Dra. Maria Emlia Manso who introducedme to IPFN.

Finally, my family, especially my father for theever present encouragement and support.

References[1] G. S. Almasi and A. Gottlieb. Highly Parallel

Computing. Benjamin-Cummings PublishingCo., Inc., Redwood City, CA, USA, 1989.

[2] B. Barney et al. Introduction to parallel com-puting. Lawrence Livermore National Labora-tory, 6(13):10, 2010.

[3] F. Gebali. Algorithms and Parallel Computing,volume 84. John Wiley & Sons, 2011.

[4] A. Kleen, C. Wickman, C. Lameter, andthe SUSE labs. libnuma and numactl - ANUMA API for Linux . http://oss.sgi.

com/projects/libnuma/. [Online; accessed22-September-2015].

[5] M. Mikolajczak. Designing and building par-allel programs: Concepts and tools for paral-lel software engineering. IEEE Concurrency,5(2):88–90, 1997.

[6] openSUSE Project. OpenSuse Linux Distribu-tion . https://www.opensuse.org/en/. [On-line; accessed 22-September-2015].

[7] L. K. Organization. Real Time Linux Patch. https://rt.wiki.kernel.org/index.php/

RT_PREEMPT_HOWTO/. [Online; accessed 22-September-2015].

[8] L. K. Organization. The Linux Kernel Archives. kernel.org/. [Online; accessed 22-September-2015].

[9] L. M. Pages. Cpuset Man Page.http://man7.org/linux/man-pages/

man7/cpuset.7.html. [Online; accessed22-September-2015].

[10] M. J. Quinn. Parallel Programming, volume526. TMH CSE, 2003.

[11] G. Raupp, R. Cole, K. Behler, M. Fitzek,P. Heimann, A. Lohs, K. Lddecke, G. Neu,J. Schacht, W. Treutterer, D. Zasche, T. Ze-hetbauer, and M. Zilker. A universal time sys-tem for {ASDEX} upgrade. Fusion Engineer-ing and Design, 6668(0):947 – 951, 2003. 22ndSymposium on Fusion Technology.

[12] J. Santos. Fast reconstruction of reflectometrydensity profiles on asdex upgrade for plasmaposition feedback purposes. 2008.

[13] J. Santos, G. Santos, M. Zilker, L. Guimarais,W. Treutterer, C. Rapson, M. Manso, et al.Enhancement of the asdex upgrade real-timeplasma position reflectometry diagnostic feed-back purposes. 2014.

[14] J. Santos, M. Zilker, L. Guimarais, W. Treut-terer, C. Amador, and M. Manso. Cots-basedhigh-data-throughput acquisition system for areal-time reflectometry diagnostic. NuclearScience, IEEE Transactions on, 58(4):1751–1758, 2011.

[15] J. Santos, M. Zilker, and W. Treutterer. Mi-crowaves used for the first time in the positioncontrol of a fusion machine. In 6o Congresso doComite Portugues da URSI. ICP-AutoridadeNacional de Comunicacoes, 2012.

[16] K. G. Shin and P. Ramanathan. Real-timecomputing: A new discipline of computer sci-ence and engineering. Proceedings of the IEEE,82(1):6–24, 1994.

[17] W. Treutterer, L. Giannone, A. Kallenbach,C. Rapson, G. Raupp, and M. Reich. Realtime control of plasma performance on asdexupgrade and its implications for iter. In FusionEngineering (SOFE), 2013 IEEE 25th Sympo-sium on, pages 1–7, June 2013.

10

advanced parallel implementation of real-time data ... · keywords: numa architectures, concurrent...

Documents