towards a java multiprocessor

Towards a Java MultiprocessorTowards a Java Multiprocessor

Christof Pitter, Martin Schoeberl

Vienna University of Technology, Austria

27.September 2007

04/19/23 Towards a Java Multiprocessor 2

MotivationMotivation

• Chip multiprocessing (CMP)

• Actual trend in server & desktop systems

• Embedded systems• Challenge for hard

real-time systems• RT-Java promising

topic for future

LEON3 by

ARM11 MPCcore by


Our GoalOur Goal

• Chip-multiprocessor (CMP)– global shared memory

• Java Optimized Processor (JOP) = Java VM in hardware

• Time predictable

• Still good performance

• Implementation in FPGA


Our Goal IIOur Goal II


AgendaAgenda

• CMP Architecture– Memory Model– Cache Memory– Synchronization

• FPGA Implementation• Benchmark Results• Conclusion• Future Work


Memory ModelMemory Model

Shared Memory Distributed Shared Memory


Why Shared Memory?Why Shared Memory?

JVM memory areas2 shared data areas: Heap, Method area

Shared Memory Distr. Shared MemoryPhysically centralized Physically distributed

Symmetric access time (UMA)

Access time varies with location (NUMA)

Arbiter Interconnection network + Message passing

Low bandwidth, high # CPUs Higher bandwidth


Why no NoC?Why no NoC?

• No use for a network

• Multiple masters to a slave

• Masters communicate through memory

• May introduce long latencies

• Hardware Overhead

SoC bus


Cache MemoryCache Memory

• Cache coherence conflicts avoided by architecture

• Stack cache: private data for each thread

• Method cache: read-only memory

• Heap not cached


SynchronizationSynchronization

• Protect parallel access to shared objects– JVM: associates a lock with each object– JOP: activation & deactivation of interrupts– CMP:

• Use of one global lock for the heap• Future work: multiple locks

• Avoidance of priority inversion– Priority inheritance locks


Proposed ArchitectureProposed Architecture


FPGA ImplementationFPGA Implementation

• Up to 3 JOPs• Memory arbiter• SoC bus (SimpCon)• External shared

memory

• Development board:– Altera Cyclone EP1C12– 1Mbyte SRAM


Simple SoC Interconnect Simple SoC Interconnect ((SimpCon)SimpCon)

• Synchronous SoC bus

• Point-to-point communication

• Master-Slave interconnection

• Signals only valid for 1 cycle– Master can continue execution

• Signal rdy_cnt:– Informs master of availabe data– Fast data transfer due pipelining


Memory Arbiter IMemory Arbiter I

• Resolves conflicts of competing memory requests

• SimpCon interface:– Masters with arbiter– Arbiter with slave

• Scalable for variable

# of CPUs

CPU Arbiter Shared Memory

SimpCon SimpConMaster Slave Master Slave

CPU

CPU

Master

Master


Memory Arbiter IIMemory Arbiter II

• Fixed priority arbitration scheme

• Priority established by unique CPU ID– Lowest ID is top priority

• Zero-cycle arbitration:– Arbitration process happens in same cycle– No bus request phase (AMBA)– Increases memory bandwidth– Will it scale? Reduces fmax


ExperimentsExperiments

• Performance measurements on real hardware

• Benchmark JavaBenchEmbedded

• Real world application tasks:– Lift (elevation controller in automation factory)– Kfl (node of distributed motor control system)

• One task per CPU

• Performance measured in iterations/s


Benchmark Results IBenchmark Results I

• Comparison between dual JOP against single JOP– Same frequency (80 MHz)

• Single JOP result:– Lift 13138 iterations/s

• Dual JOP result:

Processor JOP0 JOP1

Lift 12951 12951

97.113138

1295112951

dualJOPSpeedup


Benchmark Results IIBenchmark Results II• Comparison between tripple JOP against single JOP

– Maximum frequencies

• Single JOP result at 100 MHz– Lift 16425 iterations/s

• Tripple JOP result at 75 MHz:

Processor JOP0 JOP1 JOP2

Lift 11736 11538 11260

10.216425

112601153811736

trippleJOPSpeedup


SpeedupSpeedup

0,0

0,5

1,0

1,5

2,0

2,5

Sp

ee

du

p

1 2 3

JOP (number)

Speedup vs. number of JOPs


Resource ConsumptionResource Consumption

• Cyclone EP1C12Q240 by Altera (12060 LE, 29,25 KB)

Processor Resources Memory fmax

(LE) (KB) (MHz)

JOP 2815 7.63 100

Dual JOP 5540 15.62 80

Tripple JOP 8219 23.42 75


Maximum FrequencyMaximum Frequency

0

20

40

60

80

100

fmax

(M

Hz)

1 2 3

JOP (number)

Max. frequency vs. number of JOPs


ConclusionConclusion

• Proposed Java CMP with shared memory

• Verification of CMP architecture – Dual JOP & Tripple JOP prototypes running in

real hardware

• Performance measurements:– Dual JOP 1.58 times better perf. @ fmax– Tripple JOP 2.1 times better perf. @ fmax


Future WorkFuture Work

• Synchronization: multiple locks

• Improvement of memory arbiter:– Different arbitration schemes for time

predictability– Zero-cycle latency?

• Experiments with more cores on FPGA

• RT-Scheduling for CMP

Thank You!Thank You!

Questions & CommentsQuestions & Comments

towards a java multiprocessor

Documents