processors used in system on chip

PROCESSORS

Mr. A. B. Shinde

Assistant Professor,

Electronics Engineering,

PVPIT, Budhgaon.

[email protected]

mailto:[email protected]

CISC

CISC stands for Complex Instruction Set Computer

CISC is a instruction set architecture (ISA) in which each instruction

can execute several low-level operations, such as a load from memory,

an arithmetic operation, and a memory store, all in a single instruction.

CISC are chips that are easy to program and which make efficient use of

memory.

Examples of CISC processor families are

System/360,

PDP-11, VAX,

68000, and

x86.

Complex Instruction Set

Computer CISC History

The first PC microprocessors developed were CISC chips, because

all the instructions the processor could execute were built into the

chip.

Memory was expensive in the early days of PCs, and CISC chips

saved memory because their programming could be fed directly into

the processor.

CISC was developed to make compiler development simpler. It shifts

most of the burden of generating machine instructions to the

processor.

For example, instead of having to make a compiler write long

machine instructions to calculate a square-root, a CISC processor

would have a built-in ability to do this.


Computer CISC Philosophy

The three decisions that led to the CISC philosophy, which drove all

computer designs until the late 1980s, and is still in major use today

are the

use Microcode,

build rich instruction sets, and

build high-level instruction sets.



Use Microcode:

simple logic to control the data paths between the various elements

of the processor.

In a micro programmed system, the main processor has some built-

in memory (typically ROM) that contains groups of microcode

instructions which correspond with each machine-language

instruction.

Since the microcode memory can be much faster than main

memory, an instruction set can be implemented in microcode

without losing much speed over a purely hard-wired

implementation.



Build rich instruction sets:

By using a micro programmed design, designers could build more

functionality into each instruction.

This design cut down on the total number of instructions required to

implement a program, so it made more efficient use of a slow main

memory.

Made the job for assembly-language programmer simpler

The enhancements included string manipulation operations, special

looping constructs, and special addressing modes for indexing

through tables in memory.



Build high-level instruction sets :

After the programmer-friendly instruction sets were built, designers

started to build instruction sets which map directly from high-level

languages.

Because micro program instruction sets can be written to match the

constructs of high-level languages, the compiler does not have to

be as complicated.

Allows compilers to emit fewer instructions per line of source


Computer Characteristics

CISC are Mostly Von Neumann Architecture

(There are few exceptions)

Same bus for program memory, data memory, I/O, registers, etc

Generally Micro-coded ,Variable length instructions

Segmentation is possible with Segment Register s like DS, ES and an

offset which can be common to all segments.

Many powerful instructions are supported, making the assembly

language programmer’s job much easier.

Physical Memory Extension Possible


Computer Characteristics Of CISC Design

Instruction sets : CISC instruction sets have some common

characteristics:

A 2-operand format, where instructions have a source and a destination.

Register to register, register to memory, and memory to register

commands.

Multiple addressing modes for memory, including specialized modes for

indexing through arrays

Variable length instructions where the length often varies according to the

addressing mode

Instructions which require multiple clock cycles to execute.


Computer Characteristics Of CISC Design

Hardware architectures: CISC hardware architectures have several

characteristics in common:

Complex instruction-decoding logic, driven by the need for a single

instruction to support multiple addressing modes.

A small number of general purpose registers. This is the direct result of

having instructions which can operate directly on memory and the limited

amount of chip space not dedicated to instruction decoding, execution, and

microcode storage.

Several special purpose registers. Many CISC designs set aside special

registers for the stack pointer, interrupt handling, and so on. This can

simplify the hardware design.

A "Condition code" register which is set as a side-effect of most

instructions.


Computer Characteristics of CISC Design

CISC and the Classic Performance Equation

The equation for determining performance is

(the number of cycles per instruction * instruction cycle time) = execution time.

This allows you to speed up a processor in 3 different ways :

- use fewer instructions for a given task,

- reduce the number of cycles for some instructions, or

- speed up the clock (decrease the cycle time.)

CISC tries to reduce the number of instructions for a program


Computer The Advantages of CISC

Microprogramming is as easy as assembly language to implement,

and much less expensive than hardwiring a control unit.

The ease of micro-coding new instructions allowed designers to make

CISC machines upwardly compatible: a new computer could run the

same programs as earlier computers because the new computer

would contain a superset of the instructions of the earlier computers.

As each instruction became more capable, fewer instructions could

be used to implement a given task. This made more efficient use of

the relatively slow main memory.

Because micro-program instruction sets can be written to match the

constructs of high-level languages, the compiler does not have to be

as complicated.


Computer The Disadvantages Of CISC

As many instructions as possible could be stored in memory with the

least possible wasted space, individual instructions could be of almost

any length this means that different instructions will take different

amounts of clock time to execute, slowing down the overall

performance of the machine.

Many specialized instructions aren't used frequently enough to justify

their existence --- approximately 20% of the available instructions are

used in a typical program.

CISC instructions typically set the condition codes as a side effect of

the instruction. Setting the condition codes take time, and

programmers have to remember to examine the condition code bits

before a subsequent instruction changes them.


Computer

Intel 8086 Architecture, the 1st member of x86 family


Computer Addressing modes

Register Addressing Mode

Memory Addressing Modes

Displacement Only Addressing Mode

Register Indirect Addressing Modes

Indexed Addressing Modes

Based Indexed Addressing Modes

Based Indexed Plus Displacement Addressing

RISC

RISC Stands for Reduced Instruction Set Computer

RISC is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set ofinstructions found in other types of architectures.

RISC represents a CPU design to make instructions execute very quickly.

Well known RISC families include Alpha,

ARC,

ARM,

AVR,

MIPS,

PA-RISC,

Power Architecture (including PowerPC),

SuperH and

SPARC.

CHARACTERISTICS OF RISC

RISC chip will typically have far fewer transistors dedicated to the core

logic which originally allowed designers to increase the size of the

register set and increase internal parallelism.

Other features, which are typically found in RISC architectures

are:

Uniform instruction format

Using a single word with the opcode in the same bit positions in every

instruction, demanding less decoding;

Identical general purpose registers

Any register can be used in any context, simplifying compiler design

(there are separate floating point registers)

Simple addressing modes.

Complex addressing performed via sequences of arithmetic and/or load-store

operations;

Few data types in hardware, some CISCs have byte string instructions.

RISC

RISC designs are also more likely to feature a Harvard memory

model, where the instruction stream and the data stream are

conceptually separated;

this means that modifying the memory where code is held might

not have any effect on the instructions executed by the

processor.

On the upside, this allows both caches to be accessed simultaneously,

which can often improve performance.

Many early RISC designs also shared the characteristic of having

a branch delay slot. A branch delay slot is an instruction space

immediately following a jump or branch.

RISC

Key features

Large number of general purpose registers or use of compiler

technology to optimize register use

Limited and simple instruction set

Emphasis on optimizing the instruction pipeline

RISC

History

The first RISC projects came from IBM, Stanford, and UC-Berkeley in the late

70s and early 80s.

The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2 were all designed with

a similar philosophy which has become known as RISC.

Certain design features have been characteristic of most RISC processors:

one cycle execution time:

Pipelining:

large number of registers:

CISC Vs RISC

CISC RISC

Emphasis on hardware Emphasis on software

Includes multi-clock complex instructions Single-clock, reduced instruction only

Memory-to-memory:

"LOAD" and "STORE"

incorporated in instructions

Register to register:

"LOAD" and "STORE"

are independent instructions

Small code sizes large code sizes

Transistors used for storing complex

instructions

Spends more transistors on memory

registers

High cycles per second Low cycles per second

Variable length Instructions Equal length instructions which make

pipelining possible

Primary goal is to complete a task in as

few lines of assembly as possible

Primary goal is to speedup individual

instruction

CISC Vs RISC

The CISC

Approach Instruction :

MULT 2:3, 5:2

Operations:

1. Loads the two operands into separate

registers

2. Multiplies the operands in the execution unit

3. Then stores the product in the some

temporary register

4. Stores value back to memory location 2:3

The RISC Approach

Instructions :

LW A, 2:3

LW B, 5:2

MULT A, B

SW 2:3, A

Operations:

1. Load operand1 into register A

2. Load operand2 into register B

3. Multiply the operands in the execution unit

and store result in A

4. Store value of A back to memory location

2:3

CISC Vs RISC

VON NEUMANN

ARCHITECTURE

VON NEUMANN

ARCHITECTURE

John Von Neumann

VON NEUMANN

ARCHITECTURE The Von Neumann architecture is a design

model for a stored-program digital

computer that uses a processing unit and a

single separate storage structure to hold both

instructions and data.

It is named after the mathematician and early

computer scientist John Von Neumann.

VON NEUMANN BOTTLENECK

The separation between the CPU and memory leads to the vonNeumann bottleneck, the limited throughput (data transfer rate)between the CPU and memory compared to the amount of memory.

In most modern computers, throughput is much smaller than the rate atwhich the CPU can work.

The performance problem can be alleviated (to some extent) by severalmechanisms. Providing a cache between the CPU and the mainmemory, providing separate caches with separate access paths for dataand instructions.

The problem can also be sidestepped somewhat by using parallelcomputing, using for example the NUMA architecture—this approachis commonly employed by supercomputers.

HARVARD ARCHITECTURE

The Harvard architecture is a computer architecture with physically

separate storage and signal pathways for instructions and data.

The term originated from the Harvard Mark: relay-based computer, which

stored instructions on punched tape (24 bits wide) and data in electro-

mechanical counters. These early machines had limited data storage,

entirely contained within the central processing unit, and provided no

access to the instruction storage as data.

Today, most processors implement such separate signal pathways for

performance reasons but actually implement a Modified Harvard

architecture, so they can support tasks like loading a program from

disk storage as data and then executing it

HARVARD ARCHITECTURE MEMORY

DETAILS

In a Harvard architecture, there is no need to make the two memories

share characteristics.

In particular, the word width, timing, implementation technology, and

memory address structure can differ.

In some systems, instructions can be stored in read-only memory while

data memory generally requires read- write memory.

In some systems, there is much more instruction memory than data

memory so instruction addresses are wider than data addresses.

CONTRAST WITH VON NEUMANN

ARCHITECTURES

In a computer with the contrasting von Neumann architecture, theCPU can be either reading an instruction or reading/writing datafrom/to the memory.

Both cannot occur at the same time since the instructions anddata use the same bus system.

In a computer using the Harvard architecture, the CPU can both readan instruction and perform a data memory access at the same time,even without a cache.

A Harvard architecture computer can thus be faster for a givencircuit complexity because instruction fetches and data access do notcontend for a single memory pathway.

Also, a Harvard architecture machine has distinct code and data addressspaces: instruction address zero is not the same as data address zero.Instruction address zero might identify a twenty-four bit value, while dataaddress zero might indicate an eight bit byte that isn't part of that twenty-four bit value.

Soft processors

Soft processors

A soft processor is an Intellectual Property (IP) core that is

implemented using the logic primitives of the FPGA. Being soft allows it

to have a high degree of flexibility and configurability.

Soft processor is a microprocessor core that can be wholly implemented

using logic synthesis.

It can be implemented via different semiconductor devices containing

programmable logic (e.g., ASIC, FPGA, CPLD).

Key benefits of using a soft processor include configurability to trade

between price and performance, faster time to market, easy integration

with the FPGA fabric, and avoiding obsolescence.

Soft processors

Most systems, if they use a soft processor at all, only use a single soft

processor. However, a few designers tile as many soft cores onto an

FPGA as will fit

While many people put exactly one soft microprocessor on a FPGA, a

sufficiently large FPGA can hold two or more soft microprocessors,

resulting in a multi-core processor. The number of soft processors on a

single FPGA is only limited by the size of the FPGA.

Some people have put dozens or hundreds of soft microprocessors on a

single FPGA

Soft processors

What are the key benefits of having a soft FPGA-based processing

system ?

FPGA-based provides many key benefits.

IBM’s power PC

PowerPC is acronym for Performance Optimization With Enhanced

RISC – Performance Computing,

PowerPC sometimes abbreviated as PPC

PPC405Fx Embedded

Processor The IBM 405Fx 32-bit reduced instruction set computer (RISC)

processor core, referred to as the PPC405Fx core, implements the

PowerPC Architecture with extensions for embedded applications.

PPC405Fx Features

The PPC405Fx core provides high performance and low power

consumption.

The PPC405Fx RISC CPU executes at sustained speeds

approaching one cycle per instruction.

On-chip instruction and data cache arrays can be implemented to

reduce chip count and design complexity in systems and improve

system throughput.

PPC405Fx Embedded

Processor PPC405Fx Features

The PowerPC RISC fixed-point CPU features:

PowerPC User Instruction Set Architecture (UISA) and extensions forembedded applications

Thirty-two 32-bit general purpose registers (GPRs)

Five-stage pipeline with single-cycle execution of most instructions, includingloads/stores

Unaligned load/store support to cache arrays, main memory, and on-chipmemory (OCM)

Hardware multiply/divide for faster integer arithmetic (4-cycle multiply, 35-cycle divide)

Multiply-accumulate instructions

Enhanced string and multiple-word handling

True little endian operation

Parity detection and reporting for the instruction cache, data cache, andtranslation lookaside buffer (TLB)

Programmable Interval Timer (PIT), Fixed Interval Timer (FIT), and watchdogtimer

PPC405Fx Embedded


Storage control : Separate, configurable, two-way set-associative instruction and data cache

units; the instruction cache array is 16KB and

the data cache array is 16KB

Eight words (32 bytes) per cache line

Support for any combination of 0KB, 4KB, 8KB, and 16KB, and 32KB instruction and data cache arrays, depending on model

Read and write line buffers

Instruction fetch hits are supplied from line buffer

Data load/store hits are supplied to line buffer

Programmable ICU prefetching of next sequential line into line buffer

Programmable ICU prefetching of non-cacheable instructions, full line (eight words) or half line (four words)

Write-back or write-through DCU write strategies

Programmable allocation on loads and stores

Operand forwarding during cache line fills

PPC405Fx Embedded


Memory Management

Translation of the 4GB logical address space into physical addresses

Independent enabling of instruction and data translation/protection

Page level access control using the translation mechanism

Software control of page replacement strategy

WIU0GE (write-through, cachability, compressed user-defined 0,guarded, endian) storage attribute control for each virtual memoryregion

WIU0GE storage attribute control for thirty-two real 128MB regions inreal mode

Support for OCM that provides memory access performance identicalto cache hits

Full PowerPC floating-point unit (FPU) support using the auxiliaryprocessor unit (APU) interface

(the PPC405Fx does not include an FPU)

PPC405Fx Embedded


PowerPC timer facilities

64-bit time base

PIT, FIT, and watchdog timers

Synchronous external time base clock input

Debug Support

Enhanced debug support with logical operators

Four instruction address compares (IACs)

Two data address compares (DACs)

Two data value compares (DVCs)

JTAG instruction to write to ICU

Forward or backward instruction tracing

Minimized interrupt latency

Advanced power management support

PPC405Fx Embedded

Processor PowerPC Architecture

The PowerPC Architecture comprises three levels of standards:

PowerPC User Instruction Set Architecture (UISA), including the base user-level

instruction set, user level registers, programming model, data types, and

addressing modes.

PowerPC Virtual Environment Architecture, describing the memory model, cache

model, cache-control instructions, address aliasing, and related issues. While

accessible from the user level, these features are intended to be accessed from

within library routines provided by the system software.

PowerPC Operating Environment Architecture, including the memory

management model, supervisor level registers, and the exception model. These

features are not accessible from the user level.

PPC405Fx Embedded

Processor Processor Core Organization

PPC405Fx Embedded


The processor core consists of a 5-stage pipeline, separate instruction

and data cache units, virtual memory management unit (MMU), three

timers, debug, and interfaces to other functions.

Instruction and Data Cache Controllers

The instruction cache unit (ICU) and data cache unit (DCU) enable

concurrent accesses and minimize pipeline stalls.

The storage capacity of the cache units, which can range from 0KB–32KB,

depends upon the implementation. Both cache units are two-way set-

associative, use a 32-byte line size.

The instruction set provides a rich assortment of cache control instructions,

including instructions to read tag information and data arrays.

PPC405Fx Embedded


Instruction Cache Unit

The ICU provides one or two instructions per cycle to the execution

unit (EXU) over a 64-bit bus. A line buffer enables the ICU to be

accessed only once for every four instructions, to reduce power

consumption by the array.

The ICU can forward any or all of the words of a line fill to the EXU to

minimize pipeline stalls caused by cache misses.

The ICU aborts speculative fetches abandoned by the EXU,

eliminating unnecessary line fills and enabling the ICU to handle the

next EXU fetch.

Aborting abandoned requests also eliminates unnecessary external

bus activity to increase external bus utilization.

PPC405Fx Embedded


Data Cache Unit

The DCU transfers 1, 2, 3, 4, or 8 bytes per cycle, depending on the

number of byte enables presented by the CPU.

The DCU contains a single-element command and store data queue

to reduce pipeline stalls; this queue enables the DCU to

independently process load/store and cache control instructions.

Dynamic PLB request prioritization reduces pipeline stalls even

further. When the DCU is busy with a low-priority request while a

subsequent storage operation requested by the CPU is stalled, the

DCU automatically increases the priority of the current request to the

PLB.

PPC405Fx Embedded


Data Cache Unit

The DCU uses a two-line flush queue to minimize pipeline stalls caused by

cache misses. Line flushes are postponed until after a line fill is completed.

Registers comprise the first position of the flush queue; the line buffer built

into the output of the array for manufacturing test serves as the second

position of the flush queue.

Single queued flushes are non-blocking. When a flush operation is pending,

the DCU can continue to access the array to determine subsequent load or

store hits.

Requests abandoned by the CPU can also be aborted by the cache

controller.

Additional DCU features enable the programmer to tailor performance for a

given application. The DCU can function in write-back or write-through mode,

as controlled by the Data Cache Write-through Register (DCWR) or the

translation look-aside buffer (TLB).

PPC405Fx Embedded


Memory Management Unit

The 4GB address space of the PPC405Fx is presented as a flat address space.

The MMU provides address translation, protection functions, and storage attribute

control for embedded applications.

Working with appropriate system level software, the MMU provides the following

functions:

Translation of the 4GB logical address space into physical addresses

Independent enabling of instruction and data translation/protection

Page level access control using the translation mechanism

Software control of page replacement strategy

Additional control over protection using zones

Storage attributes for cache policy and speculative memory access control

The MMU can be disabled under software control. If the MMU is not used, the

PPC405Fx core provides other storage control mechanisms.

PPC405Fx Embedded


Timer Facilities

The processor core contains a time base and three timers:

Programmable Interval Timer (PIT)

Fixed Interval Timer (FIT)

Watchdog timer

The time base is a 64-bit counter incremented either by an internal signal equal

to the CPU clock rate or by a separate external timer clock signal.

The PIT is a 32-bit register that is decremented at the same rate as the time

base is incremented. The user loads the PIT register with a value to create the

desired delay. When a decrement occurs on a PIT count of 1, the timer stops

decrementing, a bit is set in the Timer Status Register (TSR), and a PIT interrupt

is generated. Optionally, the PIT can be programmed to reload automatically the

last value written to the PIT register, after which the PIT begins decrementing

again.

PowerPC 7xx

PowerPC 7xx

The PowerPC 7xx is a family of third generation 32-bit PowerPC

microprocessors designed and manufactured by IBM and Motorola.

The 7xx family is also widely used in embedded devices like printers,

routers, storage devices, spacecraft and video game consoles.

The 7xx family had its shortcomings, namely lack of SMP (Symmetric

multiprocessing) support and SIMD capabilities and a relatively weak

FPU (Floating-point unit).

IBM 750CL RISC

Microprocessor The IBM 750CL PowerPC® RISC microprocessor is an implementation

of the PowerPC Architecture with enhancements to improve the floating

point performance and the data transfer capability .

IBM 750CL RISC

Microprocessor Overview

750CL implements the 32-bit portion of the PowerPC Architecture, whichprovides 32-bit effective addresses, integer data types of 8, 16, and 32bits, and floating-point data types of single and double-precision.

750CL is a superscalar processor that can complete two instructionssimultaneously. It incorporates the following six execution units:

Floating-point unit (FPU)

Branch processing unit (BPU)

System register unit (SRU)

Load/store unit (LSU)

Two integer units (IUs): IU1 executes all integer instructions. IU2executes all integer instructions except multiply and divideinstructions.

IBM 750CL RISC

Microprocessor 750CL Microprocessor Features

High-performance, superscalar microprocessor.

Six independent execution units and two register files.

Rename buffers.

Completion unit.

Separate on-chip L1 instruction and data caches (Harvard architecture).

On-chip 1:1 L2 cache.

DMA engine.

Write gather pipe.

ECC error correction for most single-bit errors, detection of double-bit errors.

Separate memory management units (MMUs) for instructions and data.

Bus interface features include the following:

Multiprocessing support features

Power and thermal management

Performance monitor can be used to help debug system designs

In-system testability and debugging features through JTAG boundary-scan capability.

IBM 750CL RISC

Microprocessor PowerPC Instruction Set

Integer instructions — These include computational and logicalinstructions.

Floating-point instructions — These include floating-pointcomputational instructions, as well as instructions that affect theFPSCR.

Load/store instructions — These include integer and floating-pointload and store instructions.

Flow control instructions — These include branching instructions,condition register logical instructions, trap instructions, and otherinstructions that affect the instruction flow.

Processor control instructions — These instructions are used forsynchronizing memory accesses and management of caches, TLBs,and the segment registers.

Memory control instructions — To provide control of caches, TLBs,and SRs.

IBM 750CL RISC : Block Diagram

Spartan-3 FPGA

Spartan-3 FPGA

Spartan-3 family of FPGA is specifically designed to meet the needs of

high volume, cost-sensitive consumer electronic applications.

The Spartan-3 family has increased amount of

logic resources,

capacity of internal RAM,

total number of I/Os and

overall level of performance by improved clock management functions.

Spartan-3 FPGA enhancements, combined with advanced process

technology, deliver more functionality and bandwidth than was previously

possible.

Spartan-3 FPGAs are ideally suited to a wide range of consumer

electronics applications, including broadband access, home networking,

display/projection and digital television equipment.

The Spartan-3 family is a superior alternative to mask programmed

ASICs.

Spartan-3 FPGA

Features

Low-cost, high-performance logic solution for high-volume, consumer-

oriented applications

Select IO interface signaling

Up to 633 I/O pins

622+ Mb/s data transfer rate per I/O

DDR, DDR2 SDRAM support up to 333 Mb/s

Logic resources

logic cells with shift register capability

Wide, fast multiplexers

Dedicated 18 x 18 multipliers

JTAG logic compatible with IEEE 1149.1/1532

Spartan-3 FPGA

Features

Select RAM hierarchical memory

Up to 1,872 Kbits of total block RAM

Up to 520 Kbits of total distributed RAM

Digital Clock Manager (up to four DCMs)

Clock skew elimination

Frequency synthesis

High resolution phase shifting

Eight global clock lines and abundant routing

Fully supported by Xilinx ISE and WebPACK Software development systems

MicroBlaze and PicoBlaze processor, PCI, PCI Express PIPE Endpoint, and

other IP cores.

Spartan-3 FPGA Attributes

CLB: Configurable Logic Block

DCM: Digital Clock Manager

I/O: Input Output

Spartan-3 Family Architecture

Spartan-3 Family Architecture

Architectural Overview

Configurable Logic Blocks (CLBs) contains flexible Look-Up Tables (LUTs) that

implement logic plus storage elements used as flip-flops or latches. CLBs perform a wide

variety of logical functions as well as store data.

Input/Output Blocks (IOBs) controls the flow of data between the I/O pins and the

internal logic of the device. IOBs support bidirectional data flow plus 3-state operation.

Supports a variety of signal standards, including several high-performance differential

standards. Double Data-Rate (DDR) registers are included.

Block RAM provides data storage in the form of 18-Kbit dual-port blocks.

Multiplier Blocks accept two 18-bit binary numbers as inputs and calculate the product.

The Spartan-3A DSP platform includes special DSP multiply-accumulate blocks.

Digital Clock Manager (DCM) Blocks provide self-calibrating, fully digital solutions for

distributing, delaying, multiplying, dividing, and phase-shifting clock signals.

Digitally Controlled Impedance (DCI) feature provides automatic on-chip

terminations, simplifying board designs

Simplified IOB Diagram


IOB Overview

The Input/Output Block (IOB) provides a programmable, bidirectional interface

between an I/O pin and the FPGA’s internal logic.

There are three main signal paths within the IOB: the output path, input path,

and 3-state path. Each path has its own pair of storage elements that can act as

either registers or latches. The three main signal paths are as follows:

The input path carries data from the pad, which is bonded to a package pin, through an

optional programmable delay element directly to the line. The IOB outputs IQ1, and IQ2

all lead to the FPGA’s internal logic.

The output path, starting with the O1 and O2 lines, carries data from the FPGA’s

internal logic through a multiplexer and then a three-state driver to the IOB pad.

The 3-state path determines when the output driver is high impedance. The T1 and T2

lines carry data from the FPGA’s internal logic through a multiplexer to the output

driver. The output driver is active-Low enabled.

All signal paths entering the IOB, including those associated with the storage elements,

have an inverter option.


Storage Element Functions

There are three pairs of storage elements in each IOB, one pair for each

of the three paths.

It is possible to configure each of these storage elements as an edge-

triggered D-type flip-flop (FD) or a level-sensitive latch (LD).

The storage-element-pair on either the Output path or the Three-State

path can be used together with a special multiplexer to produce Double-

Data-Rate (DDR) transmission. This is accomplished by taking data

synchronized to the clock signal’s rising edge and converting them to

bits synchronized on both the rising and the falling edge.

The combination of two registers and a multiplexer is referred to as a

Double-Data-Rate D-type flip-flop (FDDR).

Arrangement of Slices within the

CLB


CLB

All slices have the following elements in common:

Two logic function generators,

Two storage elements,

Wide-function multiplexers,

Carry logic, and

Arithmetic gates,

The left-hand pair supports two additional functions:

Storing data using Distributed RAM and

Shifting data with 16-bit registers.

The RAM-based function generator—also known as a Look-Up Table orLUT—is the main resource for implementing logic functions.

The LUTs in each left-hand slice pair can be configured as DistributedRAM or a 16-bit shift register.

The function generators located in the upper and lower portions of theslice are referred to as the "G" and "F", respectively.


CLB

The storage elements in the upper and lower portions of the slice are

called FFY and FFX, respectively.

Wide-function multiplexers effectively combine LUTs in order to permit

more complex logic operations. Each slice has two of these multiplexers

with F5MUX in the lower portion of the slice and FiMUX in the upper

portion.

The carry chain, together with various dedicated arithmetic logic gates,

support fast and efficient implementations of math operations.

Five multiplexers control the chain: CYINIT, CY0F, and CYMUXF in the

lower portion as well as CY0G and CYMUXG in the upper portion.

The dedicated arithmetic logic includes the exclusive-OR gates XORG

and XORF as well as the AND gates GAND and FAND.

PicoBlaze

PicoBlaze

The PicoBlaze microcontroller is a compact, capable and cost-effective

fully embedded 8-bit RISC microcontroller core optimized for the

Spartan-3 family.

It also provides support for the Virtex-5, Spartan-6, and Virtex-6 FPGA

families.

The PicoBlaze microcontroller provides cost-efficient microcontroller-

based control and simple data processing.

The PicoBlaze microcontroller is optimized for efficiency and low

deployment cost.

It occupies just 96 FPGA slices, (only 12.5% of an XC3S50 FPGA).

Typically a single FPGA block RAM stores up to 1024 program

instructions, which are automatically loaded during FPGA configuration.

The PicoBlaze microcontroller performs a respectable 44 to 100 million

instructions per second (MIPS) depending on the target FPGA family

and speed grade.

PicoBlaze

The PicoBlaze microcontroller core is totally embedded within the target

FPGA and requires no external resources.

The PicoBlaze microcontroller is extremely flexible.

The basic functionality is easily extended and enhanced by connecting

additional FPGA logic to the microcontroller’s input and output ports.

The PicoBlaze peripheral set can be customized to meet the specific

features, function, and cost requirements of the target application.

PicoBlaze microcontroller is delivered as synthesizable VHDL source

code, the core is future-proof and can be migrated to future FPGA

architectures.

Being integrated within the FPGA, the PicoBlaze microcontroller reduces

board space, design cost, and inventory.

Why the PicoBlaze

Microcontroller There are literally dozens of 8-bit microcontroller architectures and

instruction sets.

The PicoBlaze microcontroller is specifically designed and optimized for the

Spartan-3 family, and with support for Spartan-6, and Virtex-6 FPGA

architectures.

It is compact, yet capable architecture consumes considerably less FPGA

resources than comparable 8-bit microcontroller architectures within an

FPGA.

Furthermore, the PicoBlaze microcontroller is provided as a free, source-

level VHDL file with royalty-free re-use within Xilinx FPGAs.

Because it is delivered as VHDL source, the PicoBlaze microcontroller is

immune to product obsolescence as the microcontroller can be retargeted to

future generations of Xilinx FPGAs, exploiting future cost reductions and

feature enhancements.

Furthermore, the PicoBlaze microcontroller is expandable and extendable.

Why the PicoBlaze

Microcontroller Before the advent of the PicoBlaze and MicroBlaze embedded processors,

the microcontroller resided externally to the FPGA, limiting the connectivity

to other FPGA functions and restricting overall interface performance.

By contrast, the PicoBlaze microcontroller is fully embedded in the FPGA

with flexible, extensive on-chip connectivity to other FPGA resources.

Signals remain within the FPGA, improving overall performance.

The PicoBlaze microcontroller reduces system cost because it is a single-

chip solution, integrated within the FPGA and sometimes only occupying

leftover FPGA resources.

The PicoBlaze microcontroller is resource efficient. Consequently, complex

applications are sometimes best portioned across multiple PicoBlaze

microcontrollers with each controller implementing a particular function, for

example, keyboard and display control, or system management.

Why Use a Microcontroller within an

FPGA?

Microcontrollers and FPGAs both successfully implement practically any

digital logic function. Each has unique advantages in cost, performance, and

ease of use.

Microcontrollers are well suited to control applications, especially with widely

changing requirements.

The FPGA resources required to implement the microcontroller are relatively

constant. The same FPGA logic is re-used by the various microcontroller

instructions, conserving resources.

The program memory requirements grow with increasing complexity.

Programming control sequences or state machines in assembly code is

often easier than creating similar structures in FPGA logic.

As an application increases in complexity, the number of instructions

required to implement the application grows and system performance

decreases accordingly.


FPGA?

FPGA is more flexible than microcontroller.

For example, an algorithm can be implemented sequentially or completely in

parallel, depending on the performance requirements.

A completely parallel implementation is faster but consumes more FPGA

resources.

A microcontroller embedded within the FPGA provides the best of both

worlds. The microcontroller implements non-timing crucial complex control

functions while timing critical or data path functions are best implemented

using FPGA logic.

For example, a microcontroller cannot respond to events much faster than a few

microseconds. The FPGA logic can respond to multiple, simultaneous events in just a

few to tens of nanoseconds. Conversely, a microcontroller is cost-effective and simple

for performing format or protocol conversions.


FPGA?

PicoBlaze Microcontroller FPGA Logic

Strengths Easy to program, excellent for

control and state machine

applications

Resource requirements remain

constant with increasing

complexity

Re-uses logic resources,

excellent for lower-performance

functions

Significantly higher performance

Excellent at parallel operations

Sequential vs. parallel

implementation tradeoffs optimize

performance or cost

Fast response to multiple,

simultaneous inputs

Weaknesses Executes sequentially

Performance degrades with

increasing complexity

Program memory requirements

increase with increasing

complexity

Slower response to simultaneous

inputs

Control and state machine

applications more difficult to program

Logic resources grow with increasing

Complexity

PicoBlaze Microcontroller

Features 16 byte-wide general-purpose data registers

1K instructions of programmable on-chip program store, automaticallyloaded during FPGA configuration

Byte-wide Arithmetic Logic Unit (ALU) with CARRY and ZERO indicatorflags

64-byte internal scratchpad RAM

256 input and 256 output ports for easy expansion and enhancement

Automatic 31-location CALL/RETURN stack

Predictable performance, always two clock cycles per instruction, up to200 MHz or 100 MIPS in a Virtex-II Pro FPGA

Fast interrupt response; worst-case 5 clock cycles

Optimized for Xilinx Spartan-3 architecture—just 96 slices and 0.5 to 1block RAM

Support in Spartan-6, and Virtex-6 FPGA architectures

Assembler, instruction-set simulator support

PicoBlaze Microcontroller

PicoBlaze Microcontroller Functional

Blocks

General-Purpose Register

The PicoBlaze microcontroller includes 16 byte-wide general-purposeregisters, designated as registers s0 through sF. For better programclarity, registers can be renamed using an assembler directive. Allregister operations are completely interchangeable.

There is no dedicated accumulator; each result is computed in aspecified register.

1,024-Instruction Program Store

The PicoBlaze microcontroller executes up to 1,024 instructions frommemory within the FPGA. Each PicoBlaze instruction is 18 bits wide.The instructions are compiled within the FPGA design and automaticallyloaded during the FPGA configuration process.

Other memory organizations are possible to accommodate morePicoBlaze controllers within a single FPGA or to enable interactive codeupdates without recompiling the FPGA design.


Blocks

Arithmetic Logic Unit (ALU)

The byte-wide Arithmetic Logic Unit (ALU) performs all microcontrollercalculations, including: basic arithmetic operations such as addition and subtraction

bitwise logic operations such as AND, OR, and XOR

arithmetic compare and bitwise test operations

comprehensive shift and rotate operations

All operations are performed using an operand provided by any specifiedregister (sX). The result is returned to the same specified register (sX). If aninstruction requires a second operand, then the second operand is either asecond register (sY) or an 8-bit immediate constant (kk).

Flags

ALU operations affect the ZERO and CARRY flags.

The ZERO flag indicates when the result of the last operation resulted in zero.

The CARRY flag indicates various conditions, depending on the lastinstruction executed.

The INTERRUPT_ENABLE flag enables the INTERRUPT input.


Blocks

64-Byte Scratchpad RAM

The PicoBlaze microcontroller provides an internal general-purpose

64-byte scratchpad RAM, directly or indirectly addressable from the

register file using the STORE and FETCH instructions.

The STORE instruction writes the contents of any of the 16 registers

to any of the 64 RAM locations.

The complementary FETCH instruction reads any of the 64 memory

locations into any of the 16 registers.

The six-bit scratchpad RAM address is specified either directly (ss)

with an immediate constant, or indirectly using the contents of any of

the 16 registers (sY).

Only the lower six bits of the address are used; the address should

not exceed the 00 - 3F range of the available memory.


Blocks

Input/Output

The Input/Output ports extend the PicoBlaze microcontroller’s

capabilities and allow the microcontroller to connect to a custom

peripheral set or to other FPGA logic.

The PicoBlaze microcontroller supports up to 256 input ports and 256

output ports or a combination of input/output ports.

The PORT_ID output provides the port address.

During an INPUT operation, the PicoBlaze microcontroller reads data from the

IN_PORT port to a specified register, sX.

During an OUTPUT operation, the PicoBlaze microcontroller writes the

contents of a specified register, sX, to the OUT_PORT port.


Blocks

Program Counter (PC)

The Program Counter (PC) points to the next instruction to be executed. By default, the

PC automatically increments to the next instruction location when executing an

instruction.

Only the JUMP, CALL, RETURN instructions and the Interrupt and Reset Events

modify the default behavior. The PC cannot be directly modified by the application

code. The 10-bit PC supports a maximum code space of 1,024 instructions (000 to 3FF

hex). If the PC reaches the top of the memory at 3FF hex, it rolls over to location 000.

Program Flow Control

The default execution sequence of the program can be modified using conditional and

non-conditional program flow control instructions.

The JUMP instructions specify an absolute address anywhere in the 1,024-instruction

program space.

CALL and RETURN instructions provide subroutine facilities for commonly used

sections of code.

If the interrupt input is enabled, an Interrupt Event also preserves the address of the

preempted instruction on the CALL/RETURN stack while the PC is loaded with the

interrupt vector, 3FF hex.


Blocks

CALL/RETURN Stack

The CALL/RETURN hardware stack stores up to 31 instruction addresses, enabling

nested CALL sequences up to 31 levels deep.

The stack is implemented as a separate cyclic buffer. When the stack is full, it

overwrites the oldest value. No program memory is required for the stack.

Interrupts

The optional INTERRUPT input, allows the PicoBlaze microcontroller to handle

asynchronous external events. “Asynchronous” relates to interrupts occurring at any

time during an instruction cycle.

However, recommended design practice is to synchronize all inputs to the PicoBlaze

controller using the clock input.

The PicoBlaze microcontroller responds to interrupts quickly in just five clock cycles.

Reset

The PicoBlaze microcontroller is automatically reset immediately after the FPGA

configuration process completes. After configuration, the RESET input forces the

processor into the initial state. The PC is reset to address 0, the flags are cleared,

interrupts are disabled, and the CALL/RETURN stack is reset.

PicoBlaze Architecture

MicroBlaze Processor


The MicroBlaze embedded processor soft core is a reduced instruction setcomputer (RISC) optimized for implementation in Xilinx Field ProgrammableGate Arrays (FPGAs).

In terms of its instruction-set architecture, MicroBlaze is very similar to theRISC-based DLX architecture.

With few exceptions, the MicroBlaze can issue a new instruction everycycle, maintaining single-cycle throughput under most circumstances.

MicroBlaze's primary I/O bus, the CoreConnect PLB bus, is a traditionalsystem-memory mapped transaction bus with master/slave capability.

For access to local-memory (FPGA BRAM), MicroBlaze uses a dedicatedLMB bus, which reduces loading on the other buses.

User-defined coprocessors are supported through a dedicated FIFO-styleconnection called FSL (Fast Simplex Link). The coprocessor(s) interface canaccelerate computationally intensive algorithms.


Many aspects of the MicroBlaze can be user configured:

cache size,

pipeline depth (3-stage or 5-stage),

embedded peripherals,

memory management unit, and

bus-interfaces can be customized.

The area-optimized version of MicroBlaze, which uses a 3-stage pipeline,

sacrifices clock-frequency for reduced logic-area.

The performance-optimized version expands the execution-pipeline to 5-

stages, allowing top speeds of 210 MHz

Also, key processor instructions which are rarely used but more expensive

to implement in hardware can be selectively added/removed

This customization enables a developer to make the appropriate design

tradeoffs for a specific set of host hardware and application software

requirements.


With the memory management unit, MicroBlaze is capable of hosting

operating systems requiring hardware-based paging and protection,

such as the Linux kernel.

Otherwise it is limited to operating systems with a simplified protection

and virtual memory-model: e.g. Free RTOS or Linux without MMU

support.

MicroBlaze's overall throughput is substantially less than a comparable

hardened CPU-core (such as the PowerPC440 in the Virtex-5.)

MicroBlaze

Features

The MicroBlaze soft core processor is highly configurable, allowing you

to select a specific set of features required by your design.

The fixed feature set of the processor includes:

Thirty-two 32-bit general purpose registers

32-bit instruction word with three operands and two addressing modes

32-bit address bus

Single issue pipeline

In addition to these fixed features, the MicroBlaze processor is

parameterized to allow selective enabling of additional functionality.

MicroBlaze Architecture

MicroBlaze

Data Types and Endianness

MicroBlaze uses Big-Endian bit-reversed format to represent data.

The hardware supported data types for MicroBlaze are word, half

word, and byte.

Word Data Type

Half Word Data Type

Byte Data Type

MicroBlaze

Instructions

All MicroBlaze instructions are 32 bits and are defined as either Type A or TypeB.

Type A instructions have up to two source register operands and one destinationregister operand.

Type B instructions have one source register and a 16-bit immediate operand(which can be extended to 32 bits by preceding the Type B instruction with animm instruction).

Type B instructions have a single destination register operand.

Instructions are provided in the following functional categories:

arithmetic,

logical,

branch,

load/store, and

special.

MicroBlaze

Registers

MicroBlaze has an orthogonal instruction set architecture. It has thirty-

two 32-bit general purpose registers and up to eighteen 32-bit special

purpose registers, depending on configured options.

1. General Purpose Registers

The thirty-two 32-bit General Purpose Registers are numbered

R0 through R31. The register file is reset on bit stream download

(reset value is 0x00000000).

MicroBlaze

2. Special Purpose Registers

Program Counter (PC)

The Program Counter (PC) is the 32-bit address of the execution

instruction.

When used with the MFS instruction the PC register is specified by

setting Sa = 0x0000.

MicroBlaze


Machine Status Register (MSR)

The Machine Status Register contains control and status bits for the

processor.

When reading the MSR, bit 29 is replicated in bit 0 as the carry copy.

When writing to the MSR, the Carry bit takes effect immediately

and the remaining bits take effect one clock cycle later.

The MSR is specified by setting Sx = 0x0001.

MicroBlaze


Exception Address Register (EAR)

The Exception Address Register stores the full load/store address that caused the

exception for the following:

The contents of this register is undefined for all other exceptions.

- - The EAR is specified by setting Sa = 0x0003.

- - An unaligned access exception that means the unaligned access address

- - A DPLB or DOPB exception that specifies the failing PLB or OPB data access address

- - A data storage exception that specifies the (virtual) effective address accessed

- - An instruction storage exception that specifies the (virtual) effective address read

- - A data TLB miss exception that specifies the (virtual) effective address accessed

- - An instruction TLB miss exception that specifies the (virtual) effective address read

MicroBlaze


Exception Status Register (ESR)

The Exception Status Register contains status bits for the processor.

The ESR is specified by setting Sa = 0x0005.

Branch Target Register (BTR)

The Branch Target Register only exists if the MicroBlaze processor is

configured to use exceptions.

The register stores the branch target address for all delay slot branch

instructions executed while MSR[EIP] = 0.

The BTR is specified by setting Sa = 0x000B.

MicroBlaze


Floating Point Status Register (FSR)

The Floating Point Status Register contains status bits for the floating

point unit.

The register is specified by setting Sa = 0x0007.

Exception Data Register (EDR)

The Exception Data Register stores data read on an FSL link that

caused an FSL exception.

The contents of this register is undefined for all other exceptions.

The EDR is specified by setting Sa = 0x000D.

MicroBlaze


Zone Protection Register (ZPR)

The Zone Protection Register is used to override MMU memory

protection defined in TLB entries.

Translation Look-Aside Buffer Low Register (TLBLO)

Translation Look-Aside Buffer High Register (TLBHI)

Translation Look-Aside Buffer Index Register (TLBX)

Translation Look-Aside Buffer Search Index Register (TLBSX)

Processor Version Register (PVR)

MicroBlaze

Pipeline Architecture

MicroBlaze instruction execution is pipelined. For most instructions, eachstage takes one clock cycle to complete.

Consequently, the number of clock cycles necessary for a specificinstruction to complete is equal to the number of pipeline stages, andone instruction is completed on every cycle.

A few instructions require multiple clock cycles in the execute stage tocomplete.

When executing from slower memory, instruction fetches may takemultiple cycles.

MicroBlaze implements an instruction prefetch buffer that reduces theimpact of such multi-cycle instruction memory latency.

When the pipeline resumes execution, the fetch stage can load newinstructions directly from the prefetch buffer instead of waiting for theinstruction memory access to complete.

MicroBlaze

Pipeline Architecture

Three Stage Pipeline

Five Stage Pipeline

Fetch (IF), Decode (OF), Execute (EX), Access Memory (MEM), and Writeback (WB).

MicroBlaze

Memory Architecture

MicroBlaze is implemented with a Harvard memory architecture;

instruction and data accesses are done in separate address spaces.

Each address space has a 32-bit range (that is, handles up to 4-GB of

instructions and data memory respectively).

Both instruction and data interfaces of MicroBlaze are 32 bits wide and

use big endian, bit-reversed format.

MicroBlaze supports word, halfword, and byte accesses to data memory.

Data accesses must be aligned, unless the processor is configured to

support unaligned exceptions.

All instruction accesses must be word aligned.

Any ?’s