yaron doweck yael einziger supervisor: mike sumszyk spring 2011 semester project

26
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project Efficient Real-Time Multicore Image Processing on TI C66x midterm presentation

Upload: yadira-bristow

Post on 14-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Yaron Doweck Yael Einziger

Supervisor: Mike Sumszyk

Spring 2011Semester Project

Efficient Real-Time Multicore Image

Processing on TI C66xmidterm presentation

Page 2: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Contents

1.Project Goals

2.Development Tools

3.Learning Steps

4.What’s next

2 / 26

Page 3: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Project Goal

* Learn to use the new TI C66 platform and to exploit its abilities and advantages.

* Implement a Real-Time computer vision algorithm using multi-core programming.

3 / 26

Page 4: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Contents

1.Project Goals

2.Development Tools

3.Learning Steps

4.What’s next

4 / 26

Page 5: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Development tools

*Hardware:

TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor

*Software:

Code Composer Studio v5 with BIOS MCSDK 2.0

5 / 26

Page 6: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

TMS320C6678

*8 C66x CorePac DSP’s

*Based on TI’s Keystone Multicore Architecture

*320 GMAC/160 GFLOP @ 1.25GHz

*32KB L1P, 32KB L1D, 512KB L2 Per Core

*4MB Shared L2

*64-Bit DDR3 Interface (DDR3-1600)

6 / 26

Page 7: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Contents

1.Project Goals

2.Development Tools

3.Learning Steps

4.What’s next

7 / 26

Page 8: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Learning steps

1. CCS Simulator and Profiler

2. Cache configuration

3. DMA data transfer

4. Interrupts

5. Fixed and Floating point libraries (DSPlib, IMGlib, Vlib,…)

6. SYS/BIOS

7. Multi-core programming

8 / 26

Page 9: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 1: CCS Simulator and

Profiler*The CCS V5 can simulate the C6678 processor and some peripherals.

*The profiler analyzes execution time and statistics for functions and code lines.

9 / 26

Page 10: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 1: CCS Simulator and

Profiler*Graph viewer – enables to view data from memory in time or frequency domain.

*Image Analyzer – enables to view an image stored in memory or file. Supports grayscale, RGB and YUV color formats.

10 / 26

Page 11: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 2: Cache

*32 KB L1P cache. L1P is read-allocate and direct mapped.

*32 KB L1D cache. L1D is read-allocate, write-back and 2-way set associative.

*Each can be configured as 0, 4, 8, 16 or 32 KB cache.

*512KB L2 cache. L2 is read and write allocate and 4-way set associative.

*L2 can be configured as 0, 32, 64, 128, 256 or 512 KB cache.

*All configurations can be done during run time.

11 / 26

Page 12: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 2: Cache

Achievements:

*Configuring different L1 and L2 cache sizes during or before run time.

*Using L1 and L2 as SRAM memory (fully SRAM or part SRAM and part cache).

*Controlling variable locations (L1,L2 or DDR3 memories).

12 / 26

Page 13: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 3: DMA

*C66xx Processors has 3 EDMA3 controllers, each with 64 DMA channels + 8 QDMA channels.

*EDMA3 supports data transfer to\from cache, shared memory or external memory.

*EDMA3 supports the use of hardware interrupts.

*In addition, each core has a faster IDMA controller for internal transfers.

13 / 26

Page 14: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 3: DMA

14 / 26

Achievements:

*Using IDMA to transfer data inside a core (L2↔L1).

*Using EDMA3 to transfer data to\from L1, L2 and DDR3.

Page 15: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 4: InterruptsThe interrupt controller supports up to 128 system events. They consist of both internally-generated events (within the C66x CorePac) and chip-level events.

15 / 26

Page 16: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 4: InterruptsThe interrupt controller outputs 15 signals to the core from the event inputs:

*One maskable hardware exception

*12 maskable hardware interrupts

*One non-maskable signal

*One reset signal

16 / 26

Page 17: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 4: Interrupts

17 / 26

Achievements:

*Configuring manually triggered events.

*Configuring EDMA transfer completion routine using EDMA system event.

Page 18: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 5: Libraries*DSPLib – an optimized DSP function library that includes general-purpose signal-processing routines for real-time applications.

18 / 26

LPF

Page 19: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 5: Libraries

*IMGLib – an optimized image/video processing function library that includes general-purpose image/video processing routines for real-time applications.

19 / 26

Histogram

Edge Detection

Derivative

Page 20: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 5: Libraries

Some more libraries

*VLib – a collection of computer vision algorithms that are optimized for TI DSPs.

*IQMath – a collection of highly optimized fixed point arithmetic, trigonometric and mathematical functions. typically used in real-time applications.

*fastMath – optimized arithmetic and trigonometric functions for floating point devices.

20 / 26

Page 21: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 5: Libraries

21 / 26

Achievements:

*Using DSPLib for a simple signal-processing application with floating point arrays.

*Using IMGLib for a simple image-processing application.

Still left:

*Studying VLib, IQMath and fast Math Libraries.

*Compare actual running time to the running time specified in the User Guide.

Page 22: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 6: SYS/BIOS

*SYS/BIOS is a real time operating system designed to be used by applications that require real-time scheduling and synchronization.

*SYS/BIOS provides preemptive multi-threading, hardware abstraction, real-time analysis, and configuration tools.

*SYS/BIOS is designed to minimize memory and CPU requirements on the target.

22 / 26

Page 23: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Step 6: SYS/BIOS

23 / 26

Achievements:

*Using SYS/BIOS modules to configure DSP’s memory (cache sizes, memory sections, heap and stack size).

*Running a multi-threaded program with shared variables protection.

Still left:

*Using SYS/BIOS modules to configure DSP peripherals (LAN, SRIO, PCIe).

Page 24: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

Learning steps

1. CCS Simulator and Profiler - done

2. Cache configuration - done

3. DMA data transfer - done

4. Interrupts - done

5. Fixed and Floating point libraries (DSPlib, IMGlib, Vlib,…) – In Progress

6. SYS/BIOS – In Progress

7. Multi-core programming24 / 26

Page 25: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

1.Project Goals

2.Development Tools

3.Learning Steps

4.What’s next

25 / 26

Contents

Page 26: Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project

What’s next1. Implementation of a bidirectional data flow

between DDRIII and L1, possibly through L2. (3 weeks)

2. Performance analysis (throughput, latency and accuracy) when using floating point versus fixed point libraries. (2 weeks)

3. Usage of hardware semaphores for parallel data access and Multicore Navigator for enabling messages communication between different cores. (4 weeks) 

26 / 26