session 1 introduction concurrent programming

27
From Deep Space to Deep Sea Introduction Concurrent Programming Eric Verhulst [email protected], http://www.altreonic.com Content The von Neuman legacy Using more than one processor On concurrent multi-tasking How to program these things? From MP to VSP: the Virtual Single Processor Embedded Systems Engineering The formal development of OpenComRTOS • Conclusion 2 MC4ES workshop 11 January 2012

Upload: eric-verhulst

Post on 18-Dec-2014

168 views

Category:

Technology


1 download

DESCRIPTION

Short introduction on concurrent/parallel programming, some formalisation of systems engineering and on formal development of OpenComRTOS.

TRANSCRIPT

Page 1: Session 1 introduction  concurrent programming

From Deep Space to Deep Sea

Introduction Concurrent Programming

Eric Verhulst

[email protected], http://www.altreonic.com

Content

• The von Neuman legacy

• Using more than one processor

• On concurrent multi-tasking

• How to program these things?

• From MP to VSP: the Virtual Single Processor

• Embedded Systems Engineering

• The formal development of OpenComRTOS

• Conclusion

2MC4ES workshop11 January 2012

Page 2: Session 1 introduction  concurrent programming

The von Neuman ALU vs. an embedded processor

• The sequential programming paradigm is based on the von Neumann architecture

• But this was only meant for a single ALU

• A real processor in an embedded system :

• Inputs data, Processes the data (only this covered by von Neumann) Output the result

• On other words :

• at least two communications, often one computation

• => Communication/Computation ratio must be > 1 (in optimal case)

• Standard programming languages (C, Java, …) only cover the computation and sometimes limited runtime multitasking

• We have an unbalance, and have been living with it for decades

• Reason ? : history

• Computer scientists use workstations (eye Nyquist Frequency is 50 Hz)

• Only embedded systems must process data in real-time

• Embedded systems were first developed by hardware engineers3MC4ES workshop11 January 2012

Multi-tasking: doing more than one thing at a time• Origin:

• A software solution to a hardware limitation

• von Neumann processors are sequential, the real-world is “parallel” by nature and software is just modeling

• Developed out of industrial needs

• How?

• A function is a [callable] sequential stream of instructions

• Uses resources [mainly registers] => defines “context”

• Non-sequential processing =

• switching between ownership of processor(s)

• reducing overhead by using idle time or to avoid active wait :

• each function has its own workspace

• a task = function with proper context and workspace

• Scheduling to achieve real-time behavior for each task

• preemption capability : switch asynchronously

4MC4ES workshop11 January 2012

Page 3: Session 1 introduction  concurrent programming

Scheduling approaches (1)

• Dominant real-time/scheduling system view paradigms:

• Control flow:

• Event driven - asynchronous : latency is the issue

• Traverse the state machine

• Uncovered states generate complexity

• Data-flow:

• Data-driven : throughput is the issue

• Multi-rate processing generates complexity

• Time-triggered:

• Play safe: allocate timeslots beforehand

• Reliable if system is predictable and stationary

• Static scheduling

• Play safe: single pre-calculated thread

• Reliable if no SW errors, if no HW failures

5MC4ES workshop11 January 2012

Scheduling approaches (2)

• REAL SYSTEMS :

• combination of above

• distinction is mainly implementation and style issue, not conceptual

• SCHEDULING IS AN ORTHOGONAL ISSUE TO MULTI-TASKING

• multi-tasking gives logical behaviour

• scheduling gives timely behaviour

6MC4ES workshop11 January 2012

Page 4: Session 1 introduction  concurrent programming

Classical (RT) scheduling (1)

• Superloop :

• loops into single endless loop : [test, call function]

• Round-robin, FIFO :

• first in, first served

• cooperative multi-tasking, run till descheduling point, no preemption

• Priority based, pre-emptive

• tasks run when executable in order of priority

• can be descheduled at any moment

• Rata Monotonic Analysis (Liu & Stanckovic) : CPU load < 70 %

• simplified model : tasks are independent

• graceful degradation by design

7MC4ES workshop11 January 2012

Classical (RT) scheduling (2)

• Earliest deadline first (EDF):

• most dynamic form : deadlines are time based

• but complex to use and implement

• time granularity is an issue (hardware lacking support)

• catastrophic failure mode

• For MP systems:

• No real generic solution for embedded

• IT solutions focus on QoS and soft real-time + high overhead

• Higher need for error recovery

• Graceful degradation : garantee QoS even if HW resources fail

• RMA (priority, preemptive) still best approach

8MC4ES workshop11 January 2012

Page 5: Session 1 introduction  concurrent programming

Scheduling visualized: tracer

9MC4ES workshop11 January 2012

Using more than one processor

11 January 2012 MC4ES workshop 10

Page 6: Session 1 introduction  concurrent programming

Why Multi-Processing ?• Laws of diminishing return :

• Power consumption increases more than linearly with speed (F**2, V)

• Highest (peak) speed achieved by micro-parallel tricks :

• Pipelining, VLIW, out of order execution, branch prediction, …

• Efficiency depends on application code

• Requires higher frequencies and many more gates

• Creates new bottlenecks :

• I/O and communication become bottlenecks

• Memory access speed slower than ALU processing speed

• Result :

• 2 CPU@1F Hz can be better than 1 CPU@2F Hz if communication support (HW and SW) is adequate

• The catch :

• Not supported by von Neumann model

• Scheduling, task partitioning and communication are inter-dependent

• BUT SCHEDULING IS NOT ORTHOGONAL TO PROCESSOR MAPPING AND INTERPROCESSOR COMMUNICATION

11MC4ES workshop11 January 2012

Generic MP system

• There is no conceptual difference between:

• Multicore (often assumed homogeneous)

• Manycore (often assumed heterogeneous)

• Parallel (closer to reality)

• Distributed

• Networked

• Difference is implementation architecture:

• Shared vs. distributed memory

• SMP vs. SMD vs. MIMD

• Note: PCIe (RapidIO, …) emulate shared bus using point to point very fast sit serial lanes

12MC4ES workshop11 January 2012

Page 7: Session 1 introduction  concurrent programming

Example: A Rack in a (MP)-SoC (1)

13MC4ES workshop11 January 2012

Example: A Rack in a (MP)-SoC (2)

14MC4ES workshop11 January 2012

Page 8: Session 1 introduction  concurrent programming

Why MP-SoC is now standard

• High NRE, high frequency signals

• Conclusion :

• multi-core, course grain asynchronous SoC design

• cores as proven components -> well defined interfaces

• keep critical circuits inside

• simplify I/O, high speed serial links, no buses

• NRE dictates high volume -> more reprogrammability

• system is now a component

• below minimum thresholds of power and cost• it becomes cheap to “burn” gates

• software becomes the differentiating factor

15MC4ES workshop11 January 2012

On concurrent multi-tasking

11 January 2012 MC4ES workshop 16

Page 9: Session 1 introduction  concurrent programming

On concurrent multi-tasking (1)

• Tasks need to interact

• synchronize

• pass data = communicate

• share resources

• A task = a virtual single processor or unit of abstraction

• A (SW) multi-tasking system can emulate a (HW) real system

• Multi-tasking needs communication services

17MC4ES workshop11 January 2012

On concurrent multi-tasking (2)

• Theoretical model :

• CSP : Communicating Sequential Processes (and its variations)

• C.A.R. Hoare

• CSP := sequential processes + channels (cfr. occam)

• Channels := synchronised (blocked) communication, no protocol

• Formal, but doesn’t match complexity of real world

• Generic model : module based, multi-tasking based, process oriented ,…

• Generic model matches reality of MP-SoC

• Very powerful to break the von-Neumann constrictor

18MC4ES workshop11 January 2012

Page 10: Session 1 introduction  concurrent programming

There are only programs

• Simplest form of computation is assignment :

a:= b

• Semi-Formal :

BEFORE : a = UNDEF; b = VALUE(b)

AFTER : a = VALUE(b); b = VALUE(b)

• Implementation in typical von Neumann machine :

Load b, register X

Store X, a

19MC4ES workshop11 January 2012

CSP explained in occam

PROC P1, P2 :

CHAN OF INT32 c1,c2 :

PAR

P1(c1, c2)

P2(c1, c2)

/* c1 ? a : read from channel c1 into variable a */

/* c1 ! a : write variable a into channel c2 */

/* order of execution not defined by clock but by */

/* channel communication : execute when data is ready */

P1 P2

C1

C2

Needed :

- context- communication

a b

20MC4ES workshop11 January 2012

Page 11: Session 1 introduction  concurrent programming

A small parallel program

No assumption in PAR case about order of execution => self-synchronising

P1 P2

INT32 a :

SEQa:= ANYc1 ! a

INT32 b :

SEQb:= ANYc1 ? b

Equivalent :

SEQINT32 a,b :a:= ANYb:= ANYb:= a

c

21MC4ES workshop11 January 2012

The PAR version at von Neuman machine level

PROC_1

Load a, register X

Store X, output register

(hidden : start channel transfer)

(hidden : transfer control to PROC_2

PROC_2

(hidden : detect channel transfer)

(hidden : transfer control to PROC_2)

Load input register, X

Store X, b

In between :

• Data moves from output register to input register

• Sequential case is an optimization of the parallel case• But there hidden assumptions about data validity!

22MC4ES workshop11 January 2012

Page 12: Session 1 introduction  concurrent programming

The same program for hardware with Handel-C

Void main(void)

par /* WILL GENERATE PARALLEL HW (1 clock cycle) */

chan chan_between;

int a, b;

{ chan_between ! a

chan_between ? b}

But :

seq /* WILL GENERATE SEQUENTIAL HW (2 clock cycles) */

chan chan_between;

int a, b;

chan_between ! a

chan_between ? b

} 23MC4ES workshop11 January 2012

Consequences • Data is protected inside scope of process.

• Interaction is through explicit communication

• In order to safeguard abstract equivalence :

• Communication backbone needed

• Automatic routing needed (but deadlock free)

• Process scheduler if on same processor

• In order to safeguard real-time behavior

• Prioritisation of communication for dynamic applications (or schedule)

• In order to handle multi-byte communication :

• Buffering at communication layer, Packetisation, DMA in background

• Result :

• prioritized packet switching : header, priority, payload

• Communication not fundamentally different from data I/O

24MC4ES workshop11 January 2012

Page 13: Session 1 introduction  concurrent programming

The (next generation) SoC

Xilinx Zync dual core ARM + FPGA

25MC4ES workshop11 January 2012

The (next generation) SoC

Altera dual core ARM + FPGA

26MC4ES workshop11 January 2012

Page 14: Session 1 introduction  concurrent programming

How to program these things?

11 January 2012 MC4ES workshop 27

Where’s the (new ) programming model?

• Issue: what’s wrong with the “ good old” software?

• => von neuman => shared memory syndrome

• Issue is not access to memory but integrity of memory

• Issue is not bandwidth to memory, but latency

• Sequential programs have lost the information of the inherent parallelism in the problem domain

• Most attempts (MPI, ...) just add a heavy comm layer

• Issue: underlying hardware still visible

• Difficult for:• Porting to another target - Scalability (from small to large AND

vice-versa)

• Application domain specific - Performance doesn’t scale28MC4ES workshop11 January 2012

Page 15: Session 1 introduction  concurrent programming

Beyond concurent multi-tasking• Multi-tasking = Process Oriented Programming

• A Task =

• Unit of execution

• Encapsulated functional behavior

• Modular programming

• High Level [Programming] Language :

• common specification :

• for SW: compile to asm

• for HW: compile to VHDL or Verilog

• E.g. program PPC with ANSI C (and RTOS), FPGA with Handel-C

• C level design is enabler for SoC “co-design”

• More abstraction gives higher productivity

• But interfaces be better standardized for better re-use

• Interfaces can be “compiled” for higher volume applications29MC4ES workshop11 January 2012

Multi-tasking API: an orthogonal set

• Events : binary data, one to one, local -> interface to HW

• [counting] semaphore : events with a memory, many to many, distributed

• FIFO queue : simple datacomm between tasks, many to many, distributed

• Mailbox/Port : variable size datacomm between tasks, rendez-vous, one/many to one/many, distributed

• Resources : ownership protection, priority inheritance

• Memory maps/pools

• semantic issues : distributed operation, blocking, non-blocking, time-out, asynchronous operation, group operators

• L1_memcpy_[A][WT] (not just memcpy!)

• L1_SendPacketToPort_[NW][W][T][A]

30MC4ES workshop11 January 2012

Page 16: Session 1 introduction  concurrent programming

From MP to VSP:Virtual Single Processor

11 January 2012 MC4ES workshop 31

Virtual Single Processor (VSP) model

• Transparent parallel programming

• Cross development on any platform + portability

• Scalability, even on heterogeneous targets

• Distributed semantics

• Program logic neutral to topology and object mapping

• Clean API provides for less programming errors

• Prioritized packet switching communication layer

• Based on “CSP” (C.A.R. Hoare): Communicating Sequential Processes: VSP is pragmatic superset

• Implemented first in Virtuoso RTOS (now WRS/Intel) -> OpenComRTOS

32

Multitasking and message passingProcess oriented programmingInterfacing using communication protocolsApplication doesn’t need to know physical layer

MC4ES workshop11 January 2012

Page 17: Session 1 introduction  concurrent programming

33MC4ES workshop11 January 2012

34MC4ES workshop11 January 2012

Page 18: Session 1 introduction  concurrent programming

Full application : Matlab/Simulink type design

Embedded DSP app with GUI front-end

35

DACDACDrivertask

ADCADCDriverTask

Virtuoso tasks & communication channels, on specific DSPcard

ReadAudioDataTask

ProcessAudiodata

stage 1

ProcessAudiodata

stage 2

Split L-Rchannels

ProcessR

channelstage 3

ProcessL

channelstage 3

ProcessR

channelstage 4

ProcessL

channelstage 4

PlayAudioDatatask

ProcessAudiodata

stage 6

ProcessAudiodata

stage 5

Channeljoiner

DSP 2

DSP 4

DSP 1

DSP 3

GUI front-end

Parameter knobs,monitor windows,

etc...

Front-end can bewritten in any

language, and runremotely

Parameter settings& Control task

Monitor Task

MC4ES workshop11 January 2012

Homogeneous OpenComRTOS

36

Block diagram at top level, executable spec in e.g. C

MC4ES workshop11 January 2012

Page 19: Session 1 introduction  concurrent programming

Heterogeneous OpenComRTOS with VSP and host OS

37MC4ES workshop11 January 2012

Next: Heterogeneous VSP with reprogrammable HW

38MC4ES workshop11 January 2012

Page 20: Session 1 introduction  concurrent programming

Conclusion

• RTOS is much more than real-time

• General purpose “process oriented” design and programming

• Hide complexity inside chip for hardware (in SoC chip)

• Hide complexity inside task for software (with RTOS)

• Hide complexity of communication in system level support

• CSP provides unified theoretical base for hardware and software, RTOS makes it pragmatic for real world :

• “DESIGN PARALLEL, OPTIMIZE SEQUENTIALLY”

• Software meets hardware with same development paradigm :

• Handel-C for FPGA, “Parallel” C for SW

• FPGA with macro-blocks is next generation SW defined SoC :

• Time for asynchronous HW design ?

39MC4ES workshop11 January 2012

Embedded Systems Engineering

11 January 2012 MC4ES workshop 40

Page 21: Session 1 introduction  concurrent programming

Engineering methodology

41MC4ES workshop11 January 2012

OpenVE ©

Formalized modelling

Simulation

Code generation

OpenComRTOS ©

Formally developed

Runtime support for

concurrency and

communication

StarFish Controller

©

Control & processing platform

SIL enabled with support for fault tolerance

GoedelWorks ©

Formalised requirements &

specifications capturing

Project repository

UserApplications

Test harness

Modeling

Meta-models

UnifyingRepository

Unified architectural paradigm:Interacting Entities

UnifiedSemantics

Automating a complex processPhase1 Phase2 Phase4Phase3

PackagingSpecs

HardwareSpecs

System and SafetySpecificationsAnalysis

SoftwareSpecs

SoftwareArchitecturalDesign

SoftwareImplementationVerificationand Test

System

Specifications(Normal Case)

Test Cases

Fault Cases

SystemMaintenanceSystem

Requirements

RequirementsCapturing

System

Requirements

SystemValidation

SpecificationsCapturing

SystemDevelopment

System Integration

System Validation

Domain Specific

Specifications

Test Cases

Fault Cases

Domain Specific

ArchitecturalDesign

Test procedures

Domain Specific

Beta release

Test results

User manual

System

ReleasedDesign Code

SystemTest Results

System

ReleasedSource Code& Distribution

ValidationResults

System

Updates

Safety Specs

SystemArchitecture

SystemArchitecting

SystemIntegration

SystemMaintenance

System

Architecture

FMEA

HARA

ASIL process flow identified 2800 process requirements!

Cost : + ++ ++ ++ +++ +++++ +++++++ ++++++ ++++++++++of issues

11 January 2012 42MC4ES workshop

Page 22: Session 1 introduction  concurrent programming

The OpenComRTOS approach

• Derived from a unified systems engineering methodology

• Two keywords:

• Unified Semantics• use of common “systems grammar”

• covers requirements, specifications, architecture, runtime, ...

• Interacting Entities ( models almost any system)

• RTOS and embedded systems:

• Map very well on “interacting entities”

• Time and architecture mostly orthogonal

• Logical model is not communication but “interaction”43MC4ES workshop11 January 2012

The OpenComRTOS project

• Target systems:• Multi-Many-core, parallel processors, networked systems, include

“legacy” processing nodes running old (RT)OS

• Methodology:

• Formal modeling and formal verification

• Architecture:

• Target is multi-node, hence communication is system-level issue, not a programmer’s concern

• Scheduling is orthogonal issue

• An application function = a “task” or a set of “tasks”• Composed of sequential “segments”

• In between Tasks synchronise and pass data (“interaction”)44MC4ES workshop11 January 2012

Page 23: Session 1 introduction  concurrent programming

TLA+ used for formal modelling

• TLA (the Temporal Logic of Actions ) is a logic for specifying and reasoning about concurrent and reactive systems.

• Also UPPAAL for time-out services

11 January 2012 45MC4ES workshop

Graphical view of RTOS “Hubs”

11 January 2012 MC4ES workshop 46

W

L

W

L

Count

Synchronising

Predicate

Owner Task

CeilingPriority

Buffer List

Predicate Action

Generic Hub (N-N)

T T

TThreshold

Synchronisation

Data needs to

be buffered

Prioity Inheritance

For semaphores

Synchronisation

Waiting Lists

For resources

Similar to Guarded Actions or a pragmatic superset of CSP

Page 24: Session 1 introduction  concurrent programming

All RTOS entities are “HUBs”

47MC4ES workshop11 January 2012

OpenComRTOS application view: any entity can be mapped onto any node

48MC4ES workshop11 January 2012

Page 25: Session 1 introduction  concurrent programming

Unexpected: RTOS 10x smaller

• Reference is Virtuoso RTOS (ex-Eonic Systems)

• New architectures benefits:

• Much easier to port

• Same functionilaty (and more) in 10x less code

• Smallest size SP: 1 KByte program, 200 bytes of RAM

• Smallest size MP: 2 KBytes

• Full version MP: 5 Kbytes, grows to 8 KB for complex SoC

• Why is small better ?

• Much better performance (less instructions)

• Frees up more fast internal memory

• Easier to verify and modify

• Architecture allows new services without changing the RTOS kernel task!

49MC4ES workshop11 January 2012

Clean architecture gives small codeOpenComRTOS L1 code size figures (MLX16)

MP FULL SP SMALL

L0 L1 L0 L1

L0 Port 162 132

L1 Hub shared 574 400

L1 Port 4 4

L1 Event 68 70

L1 Semaphore 54 54

L1 Resource 104 104

L1 FIFO 232 232

L1 Resource List 184 184

Total L1 services 1220 1048

Grand Total 3150 4532 996 2104

Smallest application: 1048 bytes program code and 198 bytes RAM (data)(SP, 2 tasks with 2 Ports sending/receiving Packets in a loop, ANSI-C)Number of instructions : 605 instructions for one loop (= 2 x context switches, 2 x L0_SendPacket_W, 2 x L0_ReceivePacket_W)

50MC4ES workshop11 January 2012

Page 26: Session 1 introduction  concurrent programming

Probably the smallest MP-demo in the world

Code Size Data Size

Platform firmware 520 0

- 2 application tasks

- 2 UART Driver tasks

- Kernel task

- Idle task

- OpenComRTOS full MP

(_NW, _W, _WT, _A)

230

338

3500

1002, of which

- Kernel stack: 100

- Task stack: 4*64

- ISR stack: 64

- Idle Stack: 50

- 568

Total 4138 + 520 1002 + 568

Conclusions

• Don’t be afraid of multi-many-parallel-distributed-network-centric programming.

• It’s more natural than sequential programming

• The benefits are:• scalability, faster development, reuse, …

52MC4ES workshop11 January 2012

Page 27: Session 1 introduction  concurrent programming

The book

53

Formal Development of a Network-Centric RTOSSoftware Engineering for Reliable Embedded Systems

Verhulst , E., Boute , R.T., Faria , J.M.S., Sputh , B.H.C., Mezhuyev , V.

Springer

Documents the project (IWT co-funded) and the design, formal modeling, implementation of OpenComRTOS

MC4ES workshop11 January 2012