building a risc system in an fpga, circuit cellar issue 116-117, april 00

26 Issue 116 March 2000 CIRCUIT CELLAR ® www.circuitcellar.com

Building aRISCSystem inan FPGA

FEATUREARTICLE

Jan Gray

iTo kick off this three-part article, Jan’s go-ing to port a Ccompiler, design aninstruction set, writean assembler andsimulator, and designthe CPU datapath.Get reading, you’veonly got a month be-fore your connectingarticle arrives!

used to envyCPU designers—

the lucky engineerswith access to expensive

tools and fabs. But, field-program-mable gate arrays (FPGAs) have madecustom-processor and integrated-system design much more accessible.

20–50-MHz FPGA CPUs are per-fect for many embedded applications.They can support custom instructionsand function units, and can be recon-figured to enhance system-on-chip(SoC) development, testing, debug-ging, and tuning. Of course, FPGAsystems offer high integration, shorttime-to-market, low NRE costs, andeasy field updates of entire systems.

FPGA CPUs may also provide newanswers to old problems. Considerone system designed by Philip Freidin.During self-test, its FPGA is config-ured as a CPU and it runs the tests.Later the FPGA is reconfigured fornormal operation as a hardwired sig-nal processing datapath. The ephem-eral CPU is free and saves money byeliminating test interfaces.

THE PROJECTSeveral companies sell FPGA CPU

cores, but most are synthesized imple-mentations of existing instructionsets, filling huge, expensive FPGAs,and are too slow and too costly forproduction use. These cores are mar-keted as ASIC prototyping platforms.

In contrast, this article shows howa streamlined and thrifty CPU design,optimized for FPGAs, can achieve acost-effective integrated computersystem, even for low-volume productsthat can’t justify an ASIC run.

I’ll build an SoC, including a 16-bitRISC CPU, memory controller, videodisplay controller, and peripherals, ina small Xilinx 4005XL. I’ll apply freesoftware tools including a C compilerand assembler, and design the chipusing Xilinx Student Edition.

If you’re new to Xilinx FPGAs, youcan get started with the Student Edi-tion 1.5. This package includes thedevelopment tools and a textbookwith many lab exercises.[3]

The Xilinx university-programfolks confirm that Student Edition isnot just for students, but also for pro-fessionals continuing their education.Because it is discounted with respectto their commercial products, you donot receive telephone support, al-though there is web and fax-backsupport. You also do not receivemaintenance updates—if you need the

Part 1: Tools, Instruction Set, and Datapath

Table 1—The xr16 C language calling conventionsassign a fixed role to each register. To minimize the costof function calls, up to three arguments, the returnaddress, and the return value are passed in registers.

Register Use

r0 always zeror1 reserved for assemblerr2 function return valuer3–r5 function argumentsr6–r9 temporariesr10–r12 register variablesr13 stack pointer (sp)r14 interrupt return addressr15 return address

CIRCUIT CELLAR ® Issue 116 March 2000 27www.circuitcellar.com

next version of the software, you haveto buy it all over again. Nevertheless,Student Edition is a good deal and agreat way to learn about FPGA design.

My goal is to put together a simple,fast 16-bit processor that runs C code.Rather than implement a complexlegacy instruction set, I’ll design anew one streamlined for FPGA imple-mentation: a classic pipelined RISCwith 16-bit instructions and sixteen16-bit registers. To get things started,let’s get a C compiler.

C COMPILERFraser and Hanson’s book is the

literate source code of their lcc retar-getable C compiler.[1] I downloadedthe V.4.1 distribution and modified itto target the nascent RISC, xr16.

Most of lcc is machine indepen-dent; targets are defined using ma-chine description (md) files. Lcc shipswith ’x86, MIPS, and SPARC md files,and my job was to write xr16.md.

I copied xr16.md from mips.md,added it to the makefile, and added anxr16 target option. I designed xr16register conventions (see Table 1) andchanged my md to target them.

At this point, I had a C compiler fora 32-bit 16-register RISC, but neededto target a 16-bit machine withsizeof(int)=sizeof(void*)=2. lcc obtainstarget operand sizes from md tables, soI just changed some entries from 4 to 2:

Interface xr16IR = { 1, 1, 0, /* char */ 2, 2, 0, /* short */ 2, 2, 0, /* int */ 2, 2, 0, /* T* */

Next, lcc needs operators that loada 2-byte int into a register, add 2-byteint registers, dereference a 2-bytepointer, and so on. The lcc ops util-ity prints the required operator set. Imodified my tables and instructiontemplates accordingly. For example:

reg: CVUI2(INDIRU1(addr)) \�lb r%c,%0\n� 1

uses lb rd,addr to load an unsignedchar at addr and zero-extend it into a16-bit int register.stmt: EQI2(reg,con) \�cmpi r%0,%1\nbeq %a\n� 2

uses a cmpi, beq sequence to com-pare a register to a constant andbranch to this label if equal.

I removed any remaining 32-bitassumptions inherited from mips.md,and arranged to store long ints inregister pairs, and call helper routinesfor mul, div, rem, and some shifts.

My port was up and running in justone day, but I had already read the lccbook. Let’s see what she can do. List-ing 1 is the source for a binary treesearch routine, and Listing 2 is theassembly code lcc-xr16 emits.

INSTRUCTION SETNow, let’s refine the instruction

set and choose an instruction encod-ing. My goals and constraints include:cover C (integer) operator set, fixed-size 16-bit instructions, easily de-coded, easily pipelined, with three-operand instructions (dest = src1

op src2/imm), as encoding spaceallows. I also want it to be byte ad-dressable (load and store bytes and

words), and provide one addressingmode—disp(reg). To support longints we need add/subtract carry andshift left/right extended.

Which instructions merit the mostbits? Reviewing early compiler out-put from test applications shows thatthe most common instructions (staticfrequency) are lw (load word), 24%;sw (store word), 13%; mov (reg-regmove), 12%; lea (load effective ad-dress), 8%; call, 8%; br, 6%; andcmp, 6%. Mov, lea, and cmp can besynthesized from add or sub with r0.69% of loads/stores use disp(reg)addressing, 21% are absolute, and10% are register indirect.

Therefore we make these choices:

• add, sub, addi are 3-operand• less common operations (logical ops,

add/sub with carry, and shifts) are 2-operand to conserve opcode space

• r0 always reads as 0• 4-bit immediate fields• for 16-bit constants, an optional

immediate prefix imm establishes themost significant 12-bits of the in-struction that immediately follows

• no condition codes, rather use aninterlocked compare and condi-tional branch sequence

• jal (jump-and-link) jumps to aneffective address, saving the returnaddress in a register

• call func encodes jal r15,funcin one 16-bit instruction (providedthe function is 16-byte aligned)

• perform mul, div, rem, and variableand multiple bit shifts in software

The six instruction formats areshown in Table 2 and the 43 distinctinstructions are shown in Table 3.adds, subs, shifts, and imm areuninterruptible prefixes. Loads/storestake two cycles, jump and branch-taken take three cycles (no branch

Listing 1— This sample C code declares a binary search tree data structure and defines a binary searchfunction. Search returns a pointer to the tree node whose key compares equal to the argument key, orNULL if not found.

typedef struct TreeNode { int key; struct TreeNode *left, *right;} *Tree;

Tree search(int key, Tree p) { while (p && p->key != key) if (p->key < key) p = p->right; else p = p->left; return p;}

Table 2—The xr16 has six instruction formats, eachwith 4-bit opcode and register fields.

Format 15–12 11–8 7–4 3–0

rrr op rd ra rbrri op rd ra immrr op rd fn rbri op rd fn imm

i12 op imm12 … …br op cond disp8 …


delay slots). The four-bit imm fieldencodes either an int (-8–7): add/sub, logic, shifts; unsigned (0–15): lb,sb; or unsigned word displacement (0,2–30): lw, sw, jal, call.

Some assembly instructions areformed from other machine instruc-tions, as you can see in Table 4. Notethat only signed char data use lbs.

ASSEMBLERI wrote a little multipass assembler

to translate the lcc assembly outputinto an executable image.

The xr16 assembler reads one ormore assembly files andemits both image andlisting files. The lexicalanalyzer reads the sourcecharacters and recognizestokens like the identifier_main. The parser scanstokens on each line andrecognizes instructionsand operands, such asregister names and effec-tive address expressions.The symbol table remem-bers labels and their ad-dresses, and a fixup tableremembers symbolic refer-ences.

In pass one, the assem-bler parses each line. La-bels are added to thesymbol table. Each in-struction expands into oneor more machine instruc-tions. If an operand refersto a label, we record afixup to it.

In pass two, we checkall branch fixups. If abranch displacement ex-ceeds 128 words, we re-

write it using a jump. Because insert-ing a jump may make other branchesfar, we repeat until no far branchesremain.

Next, we evaluate fixups. For eachone, we look up the target address andapply that to the fixup subject word.Lastly, we emit the output files.

I also wrote a simple instruction setsimulator. It is useful for exercisingboth the compiler and the embeddedapplication in a friendly environment.

Well, by now you are probablywondering if there is any hardware tothis project. Indeed there is! First,let’s consider our target FPGA device.

THE FPGAThe Xilinx XC4005XL-PC84C-3 is

a 3.3-V FPGA in an 84-pin J-leadPLCC package. This SRAM-baseddevice must be configured by externalROM or host at power-up. It has a14 × 14 array of configurable logicblocks (CLBs) and 61 bonded-out I/Oblocks (IOBs) in a sea of program-mable interconnect.

Every CLB has two 4-input look-up

tables (4-LUTs) and two flip-flops.Each 4-LUT can implement any logicfunction of 4 inputs, or a 16 × 1-bitsynchronous static RAM, or ROM.Each CLB also has “carry logic” tobuild fast, compact ripple-carry adders.

Each IOB offers input and outputbuffers and flip-flops. The outputbuffer can be 3-stated for bidirectionalI/O. The programmable interconnectroutes CLB/IOB output signals to otherCLB/IOB inputs. It also provides wide-fanout low-skew clock lines, and hori-zontal long lines, which can be drivenby 3-state buffers at each CLB.[2]

The XC4000XL architecture wouldappear to have been designed withCPUs in mind. Just eight CLBs canbuild a single-port 16 × 16-bit registerfile (using LUTs as SRAM), a 16-bitadder/subtractor (using carry logic), ora four-function 16-bit logic unit. Be-cause each LUT has a flip-flop, thedevice is register rich, enabling apipelined implementation style; andas each flip-flop has a dedicated clockenable input, it’s easy to stall thepipeline when necessary. Long line

buses and 3-state driversform an efficient word-wide multiplexer of themany function unit re-sults, and even an on-chip 3-state peripheralbus.

THE PROCESSORINTERFACE

Figure 1 gives you agood look at the xr16processor macro symbol.The interface was de-signed to be easy to usewith an on-chip bus. Thekey signals are the sys-tem clock (CLK), nextmemory address (AN15:0),next access is a read(READN), next access is16-bit data (WORDN),address clock enable:above signals are valid,start next access (ACE),memory ready input: thecurrent access completesthis cycle (RDY), instruc-tion word input(INSN15:0), on-chip bidi-

Figure 1 —The xr16processing symbolports, which includeinstruction and databuses, next addressand memory con-trols, and buscontrols, constituteits interface to thesystem memorycontroller.

AN[15:0]ACE

WORDNREADNDBUSN

DMAIREQ

DMAREQZERODMA

RDYUDTLDT

INSN[15:0] UDLDTD[15:0]CLK

XR16

CLK

CLK

VCONZ

A15

INSN[15:0]RDYIREQDMAREQZERODMA

INSN[15:0]RDY

IREQDMAREQ

ZERODMA

READNWORDNDBUSN

DMAACE

READNWORDNDBUSNDMAACE

RNA[3:0]RNB[3:0]

RFWE

FWDIMM[11:0]

IMMOP[5:0]

PCEBCE15_4

ADDCI

LOGICOP[1:0]

SUMT

ZXT

LOGICT

SRISRTSLT

RETADT

BRDISP[7:0]BRANCH

SELPCZEROPCDMAPC

PCCERETCE

CLK

VCO

NZ

A15

RNA[3:0]RNB[3:0]RFWE

FWDIMM[11:0]IMMOP[5:0]

PCEBCE15_4

ADDCILOGICOP[1:0]

SUMT

ZXT

LOGICT

SRISRTSLTRETADT

BRDISP[7:0]BRANCHSELPCZEROPCDMAPCPCCERETCE

UDTLDT

UDLDT

UDTLDTUDLDT

Result[15:0]ADDR[15:0]

D[15:0]AN[15:0]

Z,N,V,A 15

Control unit Datapath

CTRL16 RLOC=R0C0 DP16 RLOC=R5C0

Figure 2 —The control unit receives instructions, decodes them, and drives both thememory control outputs and the datapath control signals.


Z

AREG[15:0]

BREG[15:0]

AMUX[15:0]

BMUX[15:0]

D[15:0] Q[15:0]

AREGS

REGFILERLOC=R1C0

A[3:0]

CLK

WE

RNA[3:0]

CLK

RFWE

FWD

FWD

A[15:0] O[15:0]

B[15:0]

SEL

M2_16RLOC=R1C2

A

D[15:0] Q[15:0]

CE

FD16ERLOC=R1C2

CLK

PCE

CLK

RNB[3:0]

CLK

RFWE

D[15:0] Q[15:0]

A[15:0]

CLK

WE

B[15:0] O[15:0]

IR[11:0]

IMM16RLOC=R1C3

OP[5:0]

IMM[11:0]

IMMOP[5:0]

BCE15_4

CLK

PCE

D[15:0] Q[15:0]CE15_4

CLK

CE3_0

REGFILERLOC=R1C1

RESULT[15:0]

CLK

PCE

D[15:0] Q[15:0]

CLK

CE

DOUTDOUT[15:0]

FD16ERLOC=R1C4

A[15:0] Q[15:0]B[15:0]

OP[1:0]

LOGIC16RLOC=R1C4

LOGICOP[1:0]

B[15:0]

CIA[15:0]

B[15:0]ADD

COOFL

S[15:0]

ADD

CI

VCO

A[15:0]

A15

ADDSUBADSU16

I[15:0]Z

Z

ZERODETRLOC=R1C6

SUM[15:0]

SUM15BUF

N

LOGIC[15:0]

SUMT TBUFT16X

SUMBUF

RLOC=R1C5

RESULT MUX

SUMT TBUFT16X

LOGICBUF

RLOC=R1C4

SRTT

BUFT16X

SRBUF

RLOC=R1C1

SLT TBUFT16X

SLBUF

RLOC=R1C0

SRI

SRI,A[15:1]

A[14:0],G

LDT T BUFT8XLDBUF

RLOC=R5C3

DOUT[7:0] RESULT[7:0]

UDT T BUFT8XUDBUF

RLOC=R1C3


UDLDT T BUFT8XUDLDBUF

RLOC=R5C2


ZXT T BUFT8XZHBUF

RLOC=R1C2

G,G,G,G,G,G,G,G RESULT[15:8]

RETADT T BUFT16XRETBUF

RLOC=R1C9

RETAD[15:0]D[15:0] Q[15:0]

CLK

FD16ERLOC=R1C9

CERETCE

CLK

D[15:0] O[15:0]

WCLK

RAM16X16SRLOC=R1C9

WEPCCE

CLK

PC[15:0]

A[3:0]

PC

DMAPC

G,G,G,DMAPC

ADDR[15:0]

A[15:0] O[15:0]

ZERO

M2 16ZRLOC=R1C7

SELSELPC

B[15:0]

ADDRMUX

ZEROPC

PCNEXT[15:0]

CIA[15:0]

B[15:0] COOFL

S[15:0]

GND

PCINCRADD16

BRDISP[15:0] PCDISP[15:0]

PCDISP16RLOC=R1C6

BRANCH

BRDISP[7:0]

PCDISP

BRANCH

PCDISP[15:0]

ADDRESS/PC UNIT

RET

RLOC=R-1C8

GND

BREGS

EXECUTION UNIT

IMMED B

FD12E4RLOC=R1C3

LOGIC

RLOC=R-1C5

To execute one instruction percycle you need a 16-entry 16-bit regis-ter file with two read ports (add r3, r1,r2) and one write port (add r3, r1,r2); an immediate operand multiplexer(mux) to select the immediate field asan operand (addi r3, r1, 2); an arith-metic/logic unit (ALU) (sub r3, r1,r2; xor r3, r1); a shifter (srai r3,1), and an effective address adder tocompute reg+offset (lw r3, 2(r1)).

You’ll also need a mux to select aresult from the adder, logic unit, leftor right shifter, return address, or loaddata; logic to check a result for zero,negative, carry-out, or overflow; aprogram counter (PC), PC incrementer,branch displacement adder (br L),and a mux to load the PC with a jumptarget address (call _foo); and amux to share the memory port for

instruction fetch (addr ← PC) andload/store (addr ← effective address).

Careful design and reuse will letyou minimize the datapath area be-cause the adder, with the immediatemux, can do the effective address add,and the PC incrementer can also addbranch displacements. The memoryaddress mux can help load the PCwith the jump target.

DATAPATH SCHEMATICFigure 3 is the culmination of these

ideas. There are three groups of re-sources. The execution unit is theheart of the processor. It fetches oper-ands from the register file and theimmediate fields of the instructionregister, presents them to the add/sub,logic, and (trivial) shift units, andwrites back the result to the register

rectional data bus to load/store data(D15:0).

The memory/bus controller (whichI’ll explain further in Part 3) decodesthe address and activates the selectedmemory or peripheral. Later it assertsRDY to signal that the memory accessis done.

As Figure 2 shows, the CPU issimply a datapath that is steered by acontrol unit. Next month, I’ll exam-ine the control unit in greater detail.The rest of this article explores thedesign and implementation of thedatapath.DATAPATH RESOURCES

The instruction set evolved withthe datapath implementation. Eachnew idea was first evaluated in termsof the additional logic required and itsimpact on the processor cycle time.

Figure 3 —The pipelined datapath has an execution unit, a result multiplexer, and an address/PC unit. Operands from the register file or immediate field are selected and latchedinto the A and B operand registers. Then the function units, including ADDSUB, operate upon A and B, and one of the results is driven onto RESULT15:0 and written back into theregister file. Meanwhile, the address/PC unit increments the PC to help fetch the next instruction.


file. The result multiplexerselects one result from thevarious function units. Theaddress/PC unit drives thenext memory address, andincludes the PC, PC adder,and address mux. Now, let’ssee how each resource isimplemented in our FPGA.

REGISTER FILEDuring each cycle, we

must read two register oper-ands and write back one re-sult. You get two read ports(AREG and BREG) by keepingtwo copies of the 16 × 16-bitregister file REGFILE, andreading one operand fromeach. On each cycle you mustwrite the same result valueinto both copies.

So, for each REGFILE andeach clock cycle you must do one readaccess and one write access. EachREGFILE is a 16 × 16 RAM. Recallthat each CLB has two 4-LUTs, eachof which can be a 16 × 1-bit RAM.Thus, a REGFILE is a column of eightCLBs. Each REGFILE also has an in-ternal 16-bit output register that cap-tures the RAM output on the CLKfalling edge.

To read and write the REGFILEeach clock, you double-cycle it. In thefirst half of each clock cycle, the con-trol unit presents a read-port sourceoperand register number to the RAMaddress inputs. The selected registeris read out and captured in theREGFILE output register as CLK falls.

In the second half cycle, the con-trol unit drives the write-port registernumber. As CLK rises, the RESULT15:0

is written to the destination register.

OPERAND SELECTIONWith the two source registers

AREG and BREG in hand, you nowselect the A and B operands, and latchthem in the A and B registers. Someexamples are shown in Table 5.

The A operand is AREG unless (aswith add2) the instruction depends onthe result of the previous instruction.Next month, you’ll see why this pipe-line data hazard is avoided by forward-ing the add1 result directly into the A

register, just in time for add2.FWD, a 16-bit mux of AREG or

RESULT, does this result forwarding.It consists of 16 1-bit muxes, each a 3-input function implemented in asingle 4-LUT, and arranged in a col-umn of eight CLBs. The FWD outputis captured in the A operand register,made from the 16 flip-flops in thesame CLBs. As for the B operand,select either the BREG register fileoutput port or an immediate constant.

For rri and ri format instruc-tions, B is the zero- or sign-extended4-bit imm field of the instruction reg-ister. But, if there’s an imm prefix, loadB15:4 with its 12-bit imm12 field, thenload B3:0 while decoding the rri or ri

format instruction whichfollows.

So, the B operand muxIMMED is a 16-bit-wideselection of either BREG,015:4||IR3:0, sign15:4||IR3:0, orIR11:0||03:0 (“||” means bitconcatenation).

I used an unusual 2-1mux with a fourth “forceconstant” input for thiszero/sign extension func-tion, primarily because itfits in a single 4-LUT.So, as with FWD,IMMED is an 8-CLBcolumn of muxes.

The B operand registeruses IMMED’s CLBs 16flip-flops. The registerhas separate clock en-ables for B15:4 and B3:0, topermit separate loading of

the upper and lower bits for an immprefix.

For sw or sb, read the register to bestored, via BREG, into DOUT15:0,another column of eight CLBs flip-flops.

ALUThe arithmetic/logic-unit consists

of a 16-bit adder/subtractor and a 16-bit logic unit, which concurrentlyoperate on the A and B registers.

LOGIC computes the 16-bit resultof A and B, A or B, A xor B, or Aandnot B, as selected by LOGICOP1:0.Each logic unit output bit is a func-tion of the four inputs Ai, Bi, andLOGICOP1:0, and fits in a single

Hex Fmt Assembler Semantics

0dab rrr add rd,ra,rb rd = ra + rb;1dab rrr sub rd,ra,rb rd = ra – rb;2dai rri addi rd,ra,imm rd = ra + imm;3d*b rr {and or xor andn adc rd = rd op rb;

sbc} rd,rb4d*i ri {andi ori xori andni rd = rd op imm;

adci sbci slli slxisrai srli srxi} rd,imm

5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm);6dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm);8dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd;9dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd;Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm;B*dd br {br brn beq bne bc bnc bv

bnv blt bge ble bgt bltubgeu bleu bgtu} label if (cond) pc += 2*disp8;

Ciii i12 call func r15 = pc, pc = imm12<<4;Diii i12 imm imm12 imm'next15:4 = imm12;7xxx – reservedExxx – reservedFxxx – reserved

Table 3—The xr16 needs only 43 different instructions to efficiently implement aninteger-only subset of the C programming language.

Listing 2— Here’s the xr16 assembly code (with comments added) that lcc generates from Listing 1. lcchas done a good job, although a few register-to-register moves are unnecessary.

_search: br L3 ; r3=k r4=pL2: lw r9,(r4)

cmp r9,r3 ; p->k < k?bge L5lw r4,4(r4) ; p = p->rightbr L6

L5: lw r4,2(r4) ; p = p->leftL6:L3: mov r9,r4

cmp r9,r0 ; p==0?beq L7lw r9,(r4)cmp r9,r3 ; p->k != k?bne L2

L7: mov r2,r4 ; retval = pL1: ret


serted to update PC with PCNEXT.When the next access is a load/store,SELPC and PCCE are false, andADDR ← SUM, without updating PC.

PCDISP is a 16-bit mux of +215:0

and 2×disp8, 5 CLBs tall. PCINCR isan instance of the ADD16 librarysymbol, 9 CLBs tall. ADDRMUX is a16-bit 2-1 mux with a fourth input,ZERO, to set PC to 0 on reset. It’s 16LUTs, 8 CLBs tall.

PC is not a simple register, butrather it is a 16-entry register file. PC0

is the CPU PC, and PC1 is the DMAaddress. PC is a 16 × 16 RAM, eightCLBs tall.

I used RLOC attributes to place thedatapath elements. Figure 4 is theresulting floorplan on the 14 × 14 CLBFPGA. Each column of CLBs provideslogic, flip-flops, and TBUF resources.

THE DATAPATH IN ACTIONNext, let’s see what happens when

we run 0008: addi r3,r1,2. As-suming that PC=6 and r1=10,PCINCR adds PCDISP=2 to PC=6,giving PCNEXT=8. Because SELPC istrue, ADDR ← PCNEXT=8, and thenext memory cycle reads the word at0008. Because PCCE is true, PC isupdated to 8.

Some time later, RDY is assertedand the control unit latches 0x2312(addi r3,r1,2) into its instructionregister. The control unit sets RNA=1,so AREG=r1. BREG is not used. FWDis false so A=AREG=r1=10. IMMOP isset to sign-extend the 4-bit imm field,and so B=2.

We add A+B=10+2 and as SUMT isasserted (low), we drive SUM=12 onto

the RESULT bus. The control unitasserts RFWE (register file write en-able), and sets RNA=RNB=3 to writethe result into both REGFILEs’ r3.

DEVELOPMENT TOOLSThis hardware was designed, simu-

lated, and compiled on a PC using theFoundation tools in Xilinx StudentEdition 1.5. I used schematics for thisproject because their 2-D layoutmakes it easier to understand the dataflow because they offer explicit con-trol and because they support theRLOC (relative location) placementattributes that are essential tofloorplanning (to achieve the smallest,fastest, cheapest design).

To compile my schematics into aconfiguration bitstream, Foundationruns these tools:

• map: technology mapping—mapschematic’s arbitrary logic struc-tures into the device’s LUTs andflip-flops

• par: place and route—place thelogic and flip-flops in specific CLBsand then route signals through theprogrammable interconnect

• trce: static timing analysis—enu-merate all possible signal paths inthe design and report the slowestones

• bitgen: generate a bit stream con-figuration file for the design

HIGH-PERFORMANCE DESIGNThe datapath implementation

showcases some good practices, suchas exploiting FPGA features (usingembedded SRAM, four input logic

Assembly Maps to

nop and r0,r0mov rd,ra add rd,ra,r0cmp ra,rb sub r0,ra,rbsubi rd,ra,imm addi rd,ra,-immcmpi ra,imm addi r0,ra,-immcom rd xori rd,-1lea rd,imm(ra) addi rd,ra,immlbs rd,imm(ra) lb rd,imm(ra) (load-byte, xori rd,0x80 sign-extending) subi rd,0x80j addr jal r0,addrret jal r0,0(r15)

Table 4—Many assembly pseudo-instructions arecomposed from the native instructions. Only raresigned char data use the rather expensive lbs.

Figure 4 —In the datapath floorplan, RLOC attributesapplied to the datapath schematic pin down thedatapath elements to specific CLB locations. TheRESULT15:0 bus runs horizontally across the bottomeight rows of CLBs.

AR

EG

S, A

RE

G, S

LBU

F

BR

EG

S, B

RE

G, S

RB

UF

FW

D, A

, UD

LDB

UF,

ZH

BU

F

IMM

ED

, B, L

DB

UF,

UD

BU

F

LOG

IC, D

OU

T, L

OG

ICB

UF

AD

DS

UB

, SU

MB

UF

PC

DIS

P, Z

AD

DR

MU

X

PC

INC

R

PC

, RE

T, R

ET

BU

F

4-LUT. Thus, the 16-bit logic unit is acolumn of eight CLBs.

ADDSUB adds B to A, or subtractsB from A, according to its ADD input.It reads carry-in (CI) and drives carry-out (CO), and overflow (V). ADDSUBis an instance of the ADSU16 librarysymbol, and is 10 CLBs high—one toanchor the ripple-carry adder, eight toadd/sub 16 bits, and one to computecarry-out and overflow.

Z, the zero detector, is a 2.5-CLBNOR-tree of the SUM15:0 output.

The shifter produces either A>>1 orA<<1. This requires no logic, so muxsimply selects either SRI || A15:1 orA14:0 || 0. SRI determines whether theshift is logical or arithmetic.

RESULT MULTIPLEXERThe result mux selects the instruc-

tion result from the adder, logic unit,A>>1, A<<1, load data, or return ad-dress. You build this 16-bit 7-1 muxfrom lots of 3-state buffers (TBUFs).In every cycle, the control unit assertssome resource’s output enable, driv-ing its output onto the RESULT15:0

long line bus that spans the FPGA.In the third article of this series,

I’ll share the CPU result bus as the16-bit on-chip data bus for load/storedata. During sw or sb, the CPU drivesDOUT7:0 and/or DOUT15:8 onto RE-SULT15:0. During lw or lb, the se-lected memory or peripheral drivesthe load data on RESULT15:0 or RE-SULT7:0.

ADDRESS/PC UNITThis unit generates memory ad-

dresses for instruction fetch, load/store, and DMA memory accesses. Foreach cycle, we add PC += 2 to fetchthe next instruction. For a takenbranch, we add PC += 2×disp8. Forjal and call, we load PC with theeffective address SUM from ADDSUB.

Refer to Figure 3 to see how thisarrangement works. PCINCR adds PCand the PCDISP mux output (either+2 or the branch displacement) givingPCNEXT. ADDRMUX then selectsPCNEXT or SUM as the next memoryaddress.

If the next memory access is aninstruction fetch, ADDR ← PCNEXT,and PCCE (PC clock enable) is as-


structures, TBUFs, and flip-flop clockenables), floorplanning (placing func-tions in columns, ordering columns toreduce interconnect requirements,and running the 3-state bus horizon-tally over the columns), iterativedesign (measuring the area and delayeffects of each potential feature), andusing timing-driven place-and-routeand iterative timing improvement.

I apply timing constraints, such asnet CLK period=28;, which causespar to find critical paths in the designand prioritize their placement androuting to best meet the constraints.Next, I run trce to find criticalpaths. Then I fix them, rebuild, andrepeat until performance is satisfac-tory.

I’ve built some tools, settled on aninstruction set, built a datapath toexecute it, and learned how to imple-ment it efficiently in an FPGA. Nextmonth, I’ll design the control unit. I

Instruction(s) A B

add rd,ra,rb AREG BREG

addi rd,ra,i4 AREG sign-ext imm

sb rd,i4(ra) AREG zero-ext imm

imm 0x123 ignored imm12 || 03:0

addi rd,ra,4 AREG B15:4 || imm

add1 r3,r1,r2 AREG BREGadd2 r5,r3,r4 RESULT BREG

Table 5—Depending on the instruction or instructionsequence, A is either AREG or the forwarded result,and B is either BREG or an immediate field of theinstruction register.

Jan Gray is a software developerwhose products include a leading C++compiler. He has been building FPGAprocessors and systems since 1994,and now he designs for Gray Re-search LLC. You may reach him [email protected].

SOFTWAREVisit the Circuit Cellar web sitefor more information, includingspecifications, source code, sche-matics, and links to related sites.

REFERENCES[1] C. Fraser and D. Hanson, A

Retargetable C Compiler: Designand Implementation, Benjamin/Cummings, Redwood City, CA,1995.

[2] T. Cantrell, “VolksArray,” Cir-cuit Cellar, April 1998, pp. 82-86.

[3] D. Van den Bout, The PracticalXilinx Designer Lab Book,Prentice Hall, 1998. (Availableseparately and included withXilinx Student Edition.)

SOURCEXilinx Student Edition 1.5Xilinx, Inc.(408) 559-7778Fax: (408) 559-7114www.xilinx.com

© Circuit Cellar, The Magazine for Computer Applications.Reprinted with permission. For subscription information call(860) 875-2199, email [email protected] or on ourweb site at www.circuitcellar.com.

CIRCUIT CELLAR ® Issue 117 April 2000 1www.circuitcellar.com

Building a RISC Systemin an FPGA

FEATUREARTICLE

Jan Gray

lIn Part 1, Jan intro-duced his plan tobuild a pipelined 16-bit RISC processorand System-on-a-Chip in an FPGA.This month, he ex-plores the CPU pipe-line and designs thecontrol unit. Listen up,because next month,he’ll tie it all together.

ast month, Idiscussed the

instruction set andthe datapath of an xr16

16-bit RISC processor. Now, I’llexplain how the control unit pushesthe datapath’s buttons.

Figure 2 in Part 1 (Circuit Cellar,116) showed the CTRL16 control unitschematic symbol in context. Inputsinclude the RDY signal from thememory controller, the next instruc-tion word INSN15:0 from memory, andthe zero, negative, carry, and overflowoutputs from the datapath.

The control unit outputs managethe datapath. These outputs includepipeline control clock enables,register and operand selectors, ALUcontrols, and result multiplexeroutput enables. Before designing thecontrol circuitry, first consider howthe pipeline behaves in both good andbad times.

PIPELINED EXECUTIONTo increase instruction through-

put, the xr16 has a three-stagepipeline—instruction fetch (IF),decode and operand fetch (DC), andexecute (EX).

In the IF stage, it reads memory atthe current PC address, captures theresulting instruction word in theinstruction register IR, and incre-ments PC for the next cycle. In theDC stage, the instruction is decoded,and its operands are read from theregister file or extracted from animmediate field in the IR. In the EXstage, the function units act upon theoperands. One result is driven throughthree-state buffers onto the result busand is written back into the registerfile as the cycle ends.

Consider executing a series ofinstructions, assume no memory waitstates. In every pipeline cycle, fetch anew instruction and write back itsresult two cycles later. Yousimultaneously prepare the nextinstruction address PC+2, fetch

Part 2: Pipeline and Control Unit Design

Table 1—Here the processor fetches instruction I1 attime t1 and computes its result in t3, while I2 starts in t2

and ends in t4. Memory accesses are in boldface.

t1 t2 t3 t4 t5

IF1 DC1 EX1

IF2 DC2 EX2

IF3 DC3 EX3

IF4 DC4

2 Issue 117 April 2000 CIRCUIT CELLAR ® www.circuitcellar.com

instruction IPC, decode instruction IPC-2,and execute instruction IPC-4.

Table 1 shows a normal pipelinedexecution of four instructions. That’sthe simple case, but there are severalpipeline complications to consider—data hazards, memory wait states,load/store instructions, jumps andbranches, interrupts, and directmemory access (DMA).

What happens when an instructionuses the result of the precedinginstruction?

I1: andi r1,7I2: addi r2,r1,1

Referring to time t3 of Table 1, EX1

computes r1=r1&7, while DC2 fetchesthe old value of r1. In t4, EX2

incorrectly adds 1 to this stale r1. This is a data hazard, and there are

several ways to address it. The assem-bler can reorder instructions or insertnops to avoid the problem. Or, thecontrol unit can detect the hazard andstall the pipeline one cycle, in orderto write-back the result to the registerfile before fetching it as a source regis-ter. However, these techniques hurtperformance.

Instead, you do result forwarding,also known as register file bypass.The datapath DC stage includes FWD,a 16-bit 2-1 multiplexer (mux) ofAREG (register file port A), and theresult bus. Most of the time, FWDpasses AREG to the A operand regis-ter, but when the control unit detectsthe hazard (DC source register equalsEX destination register), it asserts itsFWD output signal, and the A registerreceives the I1 result just in time forEX2 in t4.

Unlike most pipelined CPUs, thexr16 only forwards results to the Aoperand—a speed/area tradeoff. Theassembler handles any rare port B data

hazards by swapping A and B operands,if possible, or inserting nops if not.

MEMORY ACCESSESThe processor has a single memory

port for reading instructions andloading and storing data. Mostmemory accesses are for fetchinginstructions. The processor is also theDMA engine, and a video refreshDMA cycle occurs once every eightclocks or so. Therefore, in any givenclock cycle, the processor executeseither an instruction fetch memorycycle, a DMA memory cycle, or aload/store memory cycle.

Memory transactions are pipelined.In each memory cycle, the processordrives the next memory cycle’saddress and control signals and awaitsRDY, indicating the access has beencompleted. So, what happens whenmemory is not ready?

The simplest thing to do is to stopthe pipeline for that cycle. CTRLdeasserts all pipeline register clockenables PCE, ACE, and so forth. Thepipeline registers do not clock, andthis extends all pipeline stages by onecycle. In Table 2, memory is not readyduring the fetch of instruction I3 in t3,and so t4 repeats t3. (Repeated pipestages are italicized.)

IL in Listing 1 is a load word in-struction. Loads and stores need asecond memory access, causing pipe-line havoc (see Table 3). In t4 youmust run a load data access insteadof an instruction fetch. You muststall the pipeline to squeeze in thisaccess.

Then, although you fetched I3 in t3,you must not latch it into theinstruction register (IR) as t3 ends,

because neither EXL nor DC2 arefinished at this point. In particular,DC2 must await the load result inorder to forward it to A, because I2

uses r6—the result of IL!Finally, if (in t3) you don’t save the

just-fetched I3 somewhere, you’ll loseit, because in t4, the memory port isbusy with the load cycle. If you loseit, you’ll have to re-fetch it no soonerthan t5, with the result that even a no-wait load requires three cycles, whichis unacceptable.

To fix this problem, the controlunit has a 16-bit NEXTIR register andan IR source multiplexer (IRMUX). Int3, it captures I3 in NEXTIR, and thenin t4, IR is loaded from NEXTIRinstead of from the memory port(which is busy with the load).NEXTIR ensures a two-cycle load orstore, at a cost of eight CLBs.

As with instruction fetch accesses,load/store memory accesses mayhave to wait on slow memory. Forexample, had RDY not been assertedduring t4, the pipeline would havestalled another cycle to wait for EXL

access to complete.

BRANCHING OUTNext, consider the effect of jumps

(call and jal) and taken branches.By the time you execute the jump ortaken branch IJ during EXJ (updatingPC), you’ll have decoded IJ+1 andfetched IJ+2. These instructions in thebranch shadow (and their side effects)must be annulled.

Continuing the Table 3 examplefrom time t5, and assuming the branchis taken at t7, you must annul the EX5

stage of I5, and the DC6 and EX6 stagesof I6. (Annulled stages are struck

Listing 1— This C code produces assembly code that includes a load IL and a branch IB. Each causespipeline headaches.

Table 2—During t3, the instruction fetch memory accessof I3 is not RDY, so the pipeline registers do not clock,and the pipeline stalls until RDY is asserted in t4.Repeated pipeline stages are italicized.

t1 t2 t3 t4 t5

IF1 DC1 EX1 EX1

IF2 DC2 DC2 EX2

IF3 IF3 DC3

IF4

if ((p->flags & 7) == 1) p->x = p->y;

IL: lw r6,2(r10) ;load r6 with p->flags

I2: andi r6,7 ;is (p->flags & 7)

I3: addi r0,r6,-1 ;==1?

IB: bne T

I5: lw r6,6(r10) ;yes: load r6 with p->y

...


through). Execution continues at in-struction IT. T9 is not an EX5 loadcycle, because the I5 load is annulled.

Because you always annul the twobranch shadow instructions, jumpsand taken branches take three cycles.Jumps also save the return address inthe destination register. This returnaddress is obtained from the data-path’s RET register, which holds theaddress of the instruction in the DCpipeline stage.

INTERRUPTSWhen an interrupt request

occurs, you must jump to theinterrupt handler, preserve theinterrupt return address, retirethe current pipeline, executethe handler, and later return tothe interrupted instruction.

When INTREQ is asserted,you simply override thefetched instruction with int,that is, jal r14,10(r0) viathe IRMUX. This jumps to theinterrupt handler at 0x0010

and leaves the return address in r14,which is reserved for this purpose.When the handler has completed, itexecutes iret, (i.e, jal r0,0(r14))and exection resumes with theinterrupted instruction.

There are two pipeline issues here.First, you must not interrupt aninterlocked instruction sequence (anyadd, sub, shift, or imm followed byanother instruction). If an interlockedinstruction is in the DC stage, theinterrupt is deferred one cycle.

Secondly, the int must not beinserted in a branch or jump shadow,lest it be annulled. If a branch or jumpis in the DC stage, or if a takenbranch or jump is in the EX stage, theinterrupt is deferred.

The simplicity of the process paysoff once again. The time to take aninterrupt and then return from a nullinterrupt handler is only six cycles.

You might be wondering about theinterrupt priorities, non-maskableinterrupts, nested interrupts, andinterrupt vectors. These artifacts ofthe fixed-pinout era need not behardwired into our FPGA CPU. Theyare best done by collaboration with anon-chip interrupt controller and theinterrupt handler software.

The last pipeline issue is DMA.The PC/address unit doubles as aDMA engine. Using a 16 × 16 RAM asa PC register file, you can fetch eitheran instruction (AN ← PC0 += 2) or aDMA word (AN ← PC1 += 2) permemory cycle.

After an instruction is fetched, if

Table 3—Pipelined execution of the load instruction IL, I2, I3, thebranch IB, the annulled I5 and I6, and the branch target IT. Duringt4 you stall the pipeline for the IL load/store memory cycle. Thebranch IB executed in t7 causes I5 and I6 to be annulled in t8 andt9. Annulled instructions are struck through.

t1 t2 t3 t4 t5 t6 t7 t8 t9

IFL DCL EXL EXLIF2 DC2 DC2 EX2

IF3 IF3 DC3 EX3

IFB DCB EXB

IF5 DC5 EX5

IF6 DC6 EX6

IFT DCT

IFDMAP

LSP

DMALSP

IFDMA

Mem cycle state machine

LS

IFN PRE

IF FDPE

RDY

CLK

D

CE

C

Q

LSPEXLDSTEXANNUL

Annul state machine

RESET

BRANCH

JUMP

DCAN

PCE

CLK

^

C^

CE

DPRE

Q

FDPEDCANNUL

RESET

DCANNUL

BRANCH

JUMPINIT=S

DMAREQ JFJKCDMAP

KDMA

C^CLKCLR

Q DMAP

Pending requests

J

KC^

CLR

Q

FJKCINTP

CLK

IREQIFINTPCE

BRANCH

JUMPDCINTINH

INTP

FDPERESET

CEC^

INIT= S

RESETPRE

DGNDRDYCLK

Q

CLKPCE CE

C

D

CLR

Q DCINT

FDCEDCINT

IFINT

JK

C^CLK

DMA

CLR

ZERODMAQ

FJKCZEROP

ZEROP

DMAN ZERO

C^CLKINIT=S

PCE CE

DPRE

EXAN

FDPEEXANNUL

IF

DMAPDMAN

D

CEC^CLK

RDY

CLR

Q

FDCEDMA

DMAN

LSP

DMAPLSP

IF

LSN

Q EXANNUL

RDYBUF

ACE

RDYIFN

PCE

PCCEIFN RDY

DMANOR2

RDYIFN

DCINT

RETCE

WORDNLSN

EXLBSB

READNLSN

EXST

BUF

BUF

DBUSNLSN

DMAN DMAPC

IFNJUMP DMAN SELPC

ZEROPCZeroReset

FSM outputs

Figure 1— This control unit finite state machine schematic implementsthe symbol CTRLFSM in Figure 2. It consists of the memory cycle FSM(see Figure 4), plus instruction annulment and pending request registers.The FSM outputs are derived from the machines current and next states.

a) b)


DMAREQ has been asserted, youinsert one DMA memory cycle.

This PC register file costs eightCLBs for the RAM, but saves 16 CLBs(otherwise necessary for a separate 16-bit DMA address counter and a 16-bit2-1 address mux), and shaves a couple

of nanoseconds from the system’scritical path. It’s a nice example of aproblem-specific optimization youcan build with a customizableprocessor.

To recap, each instruction takesthree pipeline cycles to move throughthe instruction fetch, operand fetchand decode, and execute pipelinestages. Each pipeline cycle requires upto three memory access cycles(mandatory instruction fetch, optionalDMA, and optional EX stage load orstore). Each memory access cyclerequires one or more clock cycles.

CONTROL UNIT DESIGNNow that you understand the pipe-

line, you are ready to design the con-trol unit. (For more information onRISC pipelines, see Computer Orga-nization and Design: The Hardware/Software Interface, by Patterson andHennessy.) [1] First, some importantnaming conventions. Some controlunit signal names have prefixes andsuffixes to recognize their function orcontext (most signal names sans pre-

fix are DC stage signals):• Nsig: not signal—signal inverted• DCsig: a DC stage signal• EXsig: an EX stage signal• sigN: signal in “next cycle”—input

to a flip-flop whose output is sig• sigCE: flip-flop clock enable• sigT: active low 3-state buffer

output enable

Each instruction flows through thethree stages (IF, DC, and EX) of thecontrol unit (see Figure 2) pipeline. Inthe IF stage, when the instructionfetch read completes, the new instruc-tion at INSN15:0 is latched into IR.

In the DC stage, DECODE decodesIR to derive internal control signals.In the first half clock cycle, CTRLdrives RNA3:0 and RNB3:0 with thesource registers to read, and drivesFWD and IMM5:0 to select the A and Boperands. If the instruction is abranch, CTRL determines if it istaken. Then as the pipeline advances,the instruction passes into EXIR.

In the EX stage, CTRL drives ALUand result mux controls. If the in-

Table 4—RNA and RNB control the A and B ports ofthe register file. While CLK is high, they select whichregisters to read, based upon register fields of theinstruction in the DC stage. While CLK is low, theyselect which register to write, based upon the instruc-tion in the EX stage.

RNA When

RA DC: add sub addilw lb sw sb jal

RD DC: all rr, ri format0 DC: callEXRD EX: all but call15 EX: call

RNB When

RB DC: add sub, all rr fmtRD DC: sw sbEXRD EX: all but call15 EX: call

FD16CENEXTIR

D[15:0]CEC

Q[15:0]

CLR

^

CLKIF

A[15:0] O[15:0]B[15:0]SELINT

NIR[15:0]

INSN[15:0]

IRMUX

IRMUX

IFIFINT

IRMUX[15:0]D[15:0]CEC

^

PCECLK CLR

Q[15:0]

FD16CEIR EXIR

FD16CE

D[15:0]IR[15:0]

CEC

^

CLKPCE

CLR

Q[15:0]

EXIRB

I[15:0] O[15:0]EXIR[15:0]

I[15:0] O[15:0]

IRB

IMMB

I[15:0] O[15:0]

BUF16

OP[3:0],RD[3:0],RA[3:0],RB[3:0]

IR[11:0]BUF16

IMM[11:0]

BUF16

EXOP[3:0],EXRD[3:0],BRDISP[7:0]

BRDISP[7:0]

Instruction registers

FSM

CTRLFSM

PCEACEWORDNREADNDBUSNIFIFINTDMAEXANEXANNULSELPCZEROPCDMAPCPCCERETCE

PCEACE

WORDNREADNDBUSN

IFIFINTDMA

EXANEXANNUL

SELPCZEROPCDMAPC

PCCERETCE

IREQDCINTINH

EXLDSTEXLBSBEXSTBRANCHJUMP

ZERODMADMAREQRDY CLK

IREQDCINTINH

EXLDSTEXLBSB

EXSTBRANCH

JUMP

ZERODMADMAREQ

RDYCLK

RRRIIMM_12IMM_4SEXTIMM4WORDIMM4ADDSUBSUBSTCALLNSUMNLOGICNLWNLDNLBNSRNSLNJALBRADCSBCNSUBDCINTINH

EXNSUBEXFNSRAEXIMMEXLDSTEXLBSBEXRESULTSEXCALLEXJAL

RRRIIMM12

IMM4SEXTIMM4

WORDIMM4ADDSUB

SUBST

CALLNSUM

NLOGICNLWNLDNLBNSRNSL

NJALBR

ADCSBCNSUB

DCINTINH

EXNSUBEXFNSRA

EXIMMEXLDSTEXLBSB

EXRESULTSEXJALIEXJAL

OP[3:0]FN[3:0]

EXOP[3:0]

PCE CLK

EXOP[3:0]

OP[3:0] IR[7:4]

DECODEInstruction decoder

PCECLK

Control state machine

^

^

Figure 2— This control unit schematic implementshalf of the symbol CTRL16 in last month’s Figure 2,including the CPU finite state machine, instructionregister pipline, and instruction decoder. Instructionsenter on INSN15:0 and are latched in IR and decoded.


• RDY: memory cycle complete (inputfrom the memory controller)

• READN: next memory cycle is aread transaction—true except forstores

• WORDN: next cycle is 16-bit data—true except for byte loads/stores

• DBUSN: next cycle is a load/store,and it needs the on-chip data bus

• ACE (address clock enable): the nextaddress AN15:0 (a datapath output)and the above control outputs areall valid, so start a new memorytransaction in the next clock cycle.ACE equals RDY, because ifmemory is ready, the CPU isalways eager to start anothermemory transaction.There are no IF stage control out-

puts. Internal to the control unit,three signals control IF stage re-sources. Those three signals are:

• PCE: enable IR and EXIRclocking

• IF: asserted in an instructionfetch memory cycle

• IFINT: force the next instruction tobe int = jal r14,10(r0) =

Table 5—Here’s a look at the result multiplexer output enable controls.The instruction determines which enable is asserted and which functionunit drives RESULT15:0.

Enable Instruction Source

SUMT add sub addi SUM15:0

adc sbc adci sbciLOGICT and or xor andn LOGIC15:0

andi ori xori andniSLT slli A14:0 || 0SRT srli srai SRI || A15:1

ZXT lb 015:8

RETADT jal call RETAD15:0

none sw sb br* imm —

0xAE01

If a DMA or load/store accessis pending, IF enables NEXTIR tocapture the previously fetchedinstruction (take a look back attime t3 in Table 3). Otherwise,the instruction fetch is the onlymemory access in the pipe stage.So, IF is then asserted with PCE,and IRMUX selects the INSN15:0

input as the next instruction tocomplete.

DECODE STAGEThe greater part of the control unit

operates in the DC stage. It mustdecode the new instruction, controlthe register file, the A and B operandmultiplexers, and prepare most EXstage control signals.

The instruction register IR latchesthe new instruction word as the DCstage begins. The buffers IRB andIMMB break out the instruction fieldsOP, RD, and so forth—IR15:12 is re-named OP3:0 and so on (the tools opti-mize away these buffers).

The instruction decoder DECODEis simple. It is a set of 30 ROM 16x1s,gate expressions, and a handful of flip-flops. Each ROM inputs OP3:0 orEXOP3:0 and outputs some decodedsignal. The decoder is relativelycompact because xr16 has a simpleinstruction set, and its 4-bit opcodesare a good match for the FPGA’s 4LUTs.

The register file control signals,shared by both the DC and EX stages,are RNA3:0: port A register number;RNB3:0: port B register number; andRFWE: register file write enable.

struction is a load/store, it in-serts a memory access. In the lasthalf cycle, RNA and RNB bothdrive the destination registernumber to store the result intothe register file.

Let’s consider each part of thecontrol finite state machine (seeFigure 1). The control FSM hasthree states:

• IF: current memory access is aninstruction fetch cycle

• DMA: current access is a DMAcycle

• LS: current access is a load/store

Figure 4 shows the state transitiondiagram. The FSM clocks when onememory transaction completes andanother begins (on RDY). CTRLFSMalso has several other bits of state:

• DCANNUL: annul DC stage• EXANNUL: annul EX stage• DCINT: int in DC stage• DMAP: DMA transfer pending• INTP: interrupt pending

DCANNUL and EXANNUL are setafter executing a jump or takenbranch. They suppress any effects ofthe two instructions in the branchshadow, including register file write-back and load/store memory accesses.So, an annulled add still fetches andadds its operands, but its results arenot retired to the register file.

DCINT is set in the pipeline cyclefollowing the insertion of the intinstruction. It inhibits clocking ofRET for one cycle, so that the intpicks up the return address of theinterrupted instruction rather thanthe instruction after that.

The highest fan-out control signal isPCE, the pipeline clock enable. Mostdatapath registers are enabled by PCE.It indicates that all pipe stages areready and the pipeline can advance.PCE is asserted when RDY signalscompletion of the last memory cyclein the current pipeline cycle. If mem-ory isn’t ready, PCE isn’t asserted, andthe pipeline stalls for one cycle.

The control FSM also takes care ofmanaging the memory interface viathe following signals:

Table 6—Here’s a look at the result multiplexer output enable controls. The instruction determines which enable toassert and thus determines which function unit drives the RESULT bus.

Next cycle Next address Outputs

IF AN ← PC0 += 2 SELPC PCCEIF branch AN ← PC0 += 2×disp8 BRANCH SELPC PCCE

IF jal call AN ← PC0 = SUM PCCE

IFreset AN ← PC0 = 0 SELPC ZEROPC PCCE

LS load/store AN ← SUM —

DMA AN ← PC1 += 2 SELPC DMAPC PCCEDMA reset AN ← PC1 = 0 SELPC ZEROPC DMAPC PCCE


RNARA[3:0]RD[3:0]SELRDSELR0EXRD[3:0]SELR15SELSRC

RA[3:0]RD[3:0]

RRRICALL

EXRD[3:0]EXCALL

CLK

RN[3:0]FWD

RZERO

EXRESULTSEXANNULRZERO

RNA[3:0]

AND3B1

FWD

RNMUX4RLOC=R2C0

RA[3:0]RD[3:0]SELRDSELR0EXRD[3:0]SELR15SELSRC

RN[3:0]FWD

RZERO

RNB[3:0]"N.C."

"N.C."

RB[3:0]RD[3:0]

STGND

EXRD[3:0]EXCALL

CLK

RNMUX4RLOC=R2C1

RNB

IR3SEXTIMM4 IMM_12

IR0WORDIMM4

IMMOP[5:0]IMMOP0

IMMOP1

BUFBUF

BUF

IMMOP2IMMOP3IMMOP4

IMM_4IMM_4

IMM_12

IR0WORDMM4

PCE

IMMOP5

BCE15_4EXIMMEXANNUL

ZNCVCOND[3:0]

TRUE

IR[11:8]

ZN

COV

TRUEBR

EXAN

TRUTH

BRN

PCE

CLKCLR

CE

D

C

BRANCHFDCE

Q

TRUEDC:conditional branches

DMAPC BRANCH

EXANNULEXJAL

JUMP

D0 Q0D1 Q1D2 Q2D3 Q3CE

CLK

NLBNSRNSL

NJALPCECLK

ZXTSRTSLTRETARDT

FD4PEINIT= S

D0 Q0D1 Q1D2 Q2D3 Q3CE

CLK

NSUMNLOGIC

NLWNLDPCECLK

SUMTLOGICT"N.C.""N.C."

FD4PEINIT= S

T2

T1

SRI

BUF

BUFBUF

EXFNSRAA15

SRI

EXIR4EXIR5

LOGICOP0LOGICOP1

LOGICOP[1:0]

EXNSUB ADD

D QCEC

CLR

PCECLK

CI

FDCECI

COADCSBC

NSUB

^

EXRESULTSPCE

EXANNULRZERO

DC: operand select ionExecute stage

RFWE

^^

Figure 3— The remainder of the control unit schematic implements the DC stage operand selectionlogic including register file, immediate operand control, branch logic, EX stage ALU, and result muxcontrols.

With CLK high,CTRL drives RNAand RNB with theDC stageinstruction’s sourceregister numbers.With CLK low,CTRL drives RNAand RNB with theEX stage destinationregister number.

RFWE is assertedwith PCE whenthere is a result towrite back. It is falsefor instructions,which produces noresult (immediateprefix, branch, orstore) for annulledinstructions, and fordestination r0.

The muxes RNAand RNB produceRNA3:0 and RNB3:0, asshown in Table 4, asselected by decodeoutputs RRRI,CALL, ST, EXCALL,and CLK. Call isirregular. Itcomputes r15 = pc,pc = r0 + imm12<<4,and the registers r15and r0 are implicit.

The FWD signalcauses RESULT to beforwarded into A,overriding AREG.CTRL asserts FWD when the EX stagedestination register equals the DCstage source register A (detectedwithin RNA), unless the EX stageinstruction is annulled or itsdestination is r0.

Last month, I discussed IMMED,the BREG/immediate operand mux.IMMOP5:0 controls IMMED, basedupon the decoder outputsWORDIMM, SEXTIMM4, IMM_12,and IMM_4.

B3:0 is clock enabled on PCE, butB15:4 uses B15_4CE. B15_4CE is PCE,unless the EX stage instruction isimm. Thus, the imm prefix establishesB15:4, and the subsequent immediateoperand instruction provides B3:0

only.

Now, turning to conditionalbranches, if the DC stage instructionis a branch, then the EX stageinstruction must be add, sub, oraddi, which drives the control unit’scondition inputs Z (zero), N(negative), CO (carry-out), and V(overflow).

Late in the DC stage, the TRUEmacro evaluates whether or not thebranch condition COND is true withrespect to the condition inputs. If so,and if the branch instruction is notannulled, the BRANCH flip-flop isset. Therefore, as the pipelineadvances and the branch instructionenters the EX stage, the BRANCHcontrol output is asserted. Thisdirects PCINCR to take the branch

by adding 2×disp8 tothe PC.

THE EXECUTESTAGE

Now, let’s discussthe EX stage ALU,result mux, andaddress unit controls.The ALU and shiftcontrol outputs are:

• ADD: set unless theinstruction is sub orsbc• CI: carry-in. 0 foradd and 1 for sub,unless it’s adc or sbcwhere we XOR in theprevious carry-out• LOGICOP1:0: selectand, or, xor, or andn.LOGICOP1:0 is simplyEXIR5:4 (i.e., EX stagecopy of FN1:0)• SRI: shift rightinput—0 for srli andA15 for srai (shiftright arithmetic)

slxi and srxi (shiftextended left/right formulti-word shift sup-port) are not yet imple-mented. Be my guest!

The result muxcontrol outputs SUMT,LOGICT, SLT, SRT,SXT, and RETADT are

active low RESULT bus 3-state outputenables. Each cycle, all EX stagefunction units produce results. Oneasserted T enables its unit’s 3-statebuffers to drive the RESULT bus, asshown in Table 5.

ZXT zeroes RESULT15:8 during lb.As you’ll see next month, the systemdrives RESULT7:0 with the byte loadresult.

The following outputs control theaddress unit:

• BRANCH: if set, add 2×disp8 to PC,otherwise add +2

• SELPC: if set, next address isPCNEXT15:0, otherwise SUM15:0

• ZEROPC: if set, next address is 0• PCCE (PC clock enable): update PCi


Jan Gray is a software developerwhose products include a leading C++compiler. He has been building FPGAprocessors and systems since 1994,and he now designs for Gray Re-search LLC. You may reach him [email protected].

SOFTWAREVisit the Circuit Cellar web sitefor more information, includingspecifications, source code,schematics, and links to relatedsites.

REFERENCE[1] D. Patterson and J. Hennessy,

Computer Organization andDesign: The Hardware/SoftwareInterface, Morgan Kaufmann, SanMateo, CA, 1994.

Figure 4 —Each memory cycle is an instruction fetchunless there is a DMA transfer pending or the EX stageinstruction is a load or store. The FSM clocks when onememory transaction completes and another begins (onRDY).

IF

DMA

LS

LSP

*LSP

DMAP

*DM

A

P×LSP

*DM

AP×LSP

DMAP: DMA pendingLSP: load/store pending

• DMAPC: if set, fetch and updatePC1 (DMA address), otherwise PC0

(PC)

Depending on the next memorycycle and the current EX stageinstruction, the control unit selectsthe next address by asserting certaincombinations of control outputs (seeTable 6).

WRAP-UPThis month, we considered pipe-

lined processor design issues and ex-plored the detailed implementation ofour xr16 control unit—and lived! TheCPU design is complete. The finalarticle in this series tackles the designof this System-on-a-Chip. I


CIRCUIT CELLAR ® Issue 118 May 2000 1www.circuitcellar.com

Building a RISC Systemin an FPGA

FEATUREARTICLE

Jan Gray

tNow that the xr16RISC processor iscomplete, it’s time totie everything to-gether and wrap upthis series. In this fi-nal part, Jan designsa demo system thatincludes an on-chipbus, memory control-ler, video controller,and peripherals.

he xr16 RISCprocessor is de-

signed, now it’s timeto design the rest of the

System-on-a-Chip (SoC). Besides theCPU, the FPGA hosts an on-chip bus,bus controller, parallel port, RAM,video controller, and an externalSRAM controller.

This month, I’ll show how simpleinterfaces can make SoC design asstraightforward as classic CPU, gluelogic, memory, peripherals, and PCBdesign used to be.

XS40 BOARDThe project targets the XESS XS40-

005XL V.1.2 FPGA board in Photo 1,which includes a Xilinx XC4005XL,12-MHz oscillator (see Figure 1),32-KB SRAM, 8031 MCU,7-segment LED, voltageregulators, and parallelport and VGA port connec-tors. It’s simple, inexpen-sive, and is featured in ThePractical Xilinx DesignerLab Book included withXilinx Student Edition.

I chose this board be-cause it is well supportedwith documentation andtools, and because it canbe used for both the XSEexercises and this project.

A SYSTEM-ON-A-CHIPI’ll build an integrated system from

the resources at hand—the FPGA,RAM, the video and parallel ports,and the 12-MHz oscillator.

I used the RAM for program, data,and video memory. The byte-wide,asynchronous SRAM isn’t ideal, but itis fast enough for you to read andlatch a byte on each clock edge,thereby fetching a 16-bit instructionduring each cycle.

By displaying all 32 KB of RAM,you can fashion a bitmapped 576 ×455 monochrome video display atVGA-compatible sync frequencies.How quaint, to watch every bit onscreen!

Refer also to Figure 4, the FPGAtop-level schematic. It includes the

Part 3: System-on-a-Chip Design

Table 1—The system memory map includes eight decoded peripheralcontrol register address blocks.

Address Resource

0000-7FFF external 32-KB RAM, video frame buffer

0000 reset handler0010 interrupt handlerFF00-FFFF I/O control registers,

8 peripherals × 32 bytesFF00-FF1F 0: 16-word on-chip IRAMFF21 1: parallel port input byteFF41 2: parallel port output byteFF60-FF7F 3: unused… …FFE0-FFFF 7: unused

2 Issue 118 May 2000 CIRCUIT CELLAR ® www.circuitcellar.com

processor (P), the system memory/buscontroller (MEMCTRL), the on-chip16-bit data bus (D15:0), on-chip periph-erals (PARIN, PAROUT, and IRAM),the external SRAM interface, and theVGA video controller.

DECISIONS, DECISIONSBefore examining the design, let’s

briefly explore the on-chip bus designspace. (This is not the sort of thingyou worry about when designing tosomeone else’s microprocessor, but inan FPGA SoC, you have a little morefreedom.)

Bus design issues include howmany bus masters are permitted, howis the bus clocked and pipelined, howwide is it, does it provide byte ad-dressing, and is it split or unified withthe processor core RESULT bus.

For XSOC, the pipelined on-chip16-bit data bus D15:0 is single-mas-tered (but recall the CPU also per-forms DMA transfers), the bus clockis the CPU clock, and the on-chipdata bus is unified with the pro-cessor’s RESULT15:0 data bus. All ofthese design decisions help to keepthis project simple.

BUS CONTROLSMEMCTRL, the system bus/

memory controller, interfaces theprocessor to the on-chip and off-chipperipherals. It receives the pipelined“next transaction” memory requestsignals AN15:0, WORDN, READN,DBUSN, and ACE from the CPU.Then, it decodes the address, enablessome peripheral or memory, and laterasserts RDY in the clock cycle inwhich the memory cycle completes.I/O registers are memory mapped (seeTable 1).

There are eight transaction types:(external RAM or I/O) × (read orwrite) × (byte or word), all decodedfrom AN15:0, WORDN, and READN.

MEMCTRL manages transfers onthe on-chip data bus D15:0 and theexternal data bus XD7:0 by assertingvarious tri-state output enables (xT)and control register clock enables(xCE). These enable signals are as-serted according to the transactiontype (see Table 3).

For example, during sw r0,

0xFF00, MEMCTRL decodes an I/Owrite word request. It asserts LDTand UDT, driving the store data ontoD15:0, and asserts IRAM/LCE andIRAM/UCE, writing D15:0 into IRAM’sSRAMs:

IRAM/D15:0 := D15:0 ← DOUT15:0

Next, consider a store to externalRAM: sw r0,0x0100. Because theexternal data bus is only eight bitswide, first store the least significantbyte, then the most significant byte.First, MEMCTRL asserts LDT andXDOUTT:

XD7:0 ¬ D7:0 ¬ DOUT7:0

Later, it asserts UDLDT andXDOUTT:

XD7:0 ← D7:0 ← DOUT15:8

BUS INTERFACENow, let’s design an on-chip bus

peripheral interface to enable robustand easy reuse of peripheral cores andto prepare for an ecology of interoper-able cores to come.

It helps to distinguish betweencore users and core designers. Theformer are more numerous, while thelatter are more experienced. There-fore, I make ease-of-use tradeoffs infavor of core users.

Because FPGAs are malleable andFPGA SoC design is so new, I wantedan interface that can evolve to addressnew requirements without invalidat-ing existing designs.

With these two considerations inmind, I borrowed a few ideas from thesoftware world and defined an ab-stract control signal bus with all ofthe common control signals collected

into an opaque bus CTRL15:0.MEMCTRL drives CTRL and also

does I/O address decoding, driving theeight I/O selects SEL7:0.

Now, you need only instantiate thecore, attach CLK, CTRL, D, someSELi, any core-specific inputs andoutputs, and you’re done!

Contrast this with interfacing to atraditional peripheral IC. Each IC hasits own idiosyncratic set of controlsignals, I/O register addresses, chipselects, byte read and write strobes,ready, interrupt request, and such.They don’t call it glue logic for nothing.

Of course, we can’t just sweep allthe complexity under the rug. Eachcore must decode CTRL and recoverthe relevant control signals. This isdone with the DCTRL (CTRL de-coder) macro (see Figure 5). DCTRLinputs SELi, CTRL15:0, and CLK andoutputs local I/O register address,upper and lower byte output enables(read strobes), and clock enables(write strobes).

Within each DCTRL instance, youdo final address decoding for the spe-cific peripheral, combining its SELi

signal with the I/O select withinCTRL15:0. Here XIN8 only uses LDT(the LSB output enable). The otherDCTRL outputs are unloaded andautomatically eliminated by theFPGA implementation tools.

Using DCTRL and the on-chip tri-state bus, the typical overhead perperipheral is only one or two CLBs,and perhaps a column of TBUFs.

Control signal abstraction can alsomake bus interface evolution easy. Ifyou revise MEMCTRL and DCTRLtogether, arbitrary changes to CTRL15:0

can be made without invalidating any

Figure 1 —The system schematic depicts the subset ofthe XS40 needed for our project. The 8031 (not shown)is held in reset.

Table 2—There are a set of enables p/* within eachperipheral. DOUT15:0 is the CPU store data outputregister (see Part 1, Circuit Cellar 116).

Enable Effect

LDT D7:0 ← DOUT7:0UDT D15:8 ← DOUT15:8UDLDT D7:0 ← DOUT15:8XDOUTT XD7:0 ← D7:0LXDT D7:0 ← XDIN7:0UXDT D15:8

← XDIN15:8

p/LDT D7:0 ← p/D7:0p/UDT D15:8 ← p/D15:8p/LCE p/D7:0 := D7:0p/UCE p/D15:8 := D15:8


Table 3—Depending on the memory transaction, different bus outputenables and register clock enables are asserted.

Figure 3 —The rest of the device contains the auto-matically placed processor control unit and other logic.

existing designs. And, to add new busfeatures, simply design a new decoderDCTRL_v2, causing no changes toexisting DCTRL clients.

EXTERNAL I/O INTERFACE?There isn’t one. If it were necessary

to attach external peripherals, perhapsto the XD7:0 bus, you might designsome on-chip external peripheraladapter macros. Just like an on-chipperipheral, each adapter would takeCTRL and some SELi, but its jobwould be to use additional I/O pins tocontrol its peripheral IC’s chip selectsand so forth. Of course, as a CTRL15:0

client, it would be able to raise inter-rupts, insert wait states, and so forth.

EXTERNAL RAMThe external RAM is a classic

32-KB fast asynchronous SRAM witha 15-ns access time (tAA). Its pins in-

clude A14:0 (address), D7:0 (data in/out), /CS (chip select), /WE (writeenable), and /OE (output enable).

Refer to Figure 2 and the externalbus and SRAM interface block ofFigure 5.

XA14:1 is 14 IOBs configured asOFDXs (output flip-flops with clockenables). XA14:1 captures the next ad-dress AN14:1 at the start of each newmemory transaction. XA0 (XA_0) isthe least significant bit of the externaladdress. It is a logic output and canchange on either CLK edge.

XD7:0 is eight IOBs configured aseight sets of simultaneous OBUFTs(tri-state output buffers), IBUFs (input

buffers), and IFDs (input flip-flops).During a RAM write, XDOUTT is

asserted, RAMNOE is deasserted, andthe OBUFTs drive D7:0 out onto XD7:0.

During a RAM read, XDOUTT isdeasserted, RAMNOE is asserted, andthe RAM drives its output data ontoXD7:0. The data is input through theIBUFs and latched in the XDIN IFDs(on each falling CLK edge).

To keep the CPU busy with freshnew instructions, the system readsboth bytes of a 16-bit word in onecycle. In the first half cycle, it setsXA0=0, reading the MSB, and latchesit in XDIN. In the second half cycle,the system sets XA0=1, reading theLSB, and reads it through IBUFs. Thecatenation of these two bytes,XDIN15:0, feeds the CPU’s INSN port,the video controller’s PIX port, andD15:0 via the byte-wide tri-state buff-ers LXD and UXD.

Writes to asynchronous SRAMrequire careful design. Let’s see if wecan safely write one byte per clockcycle. The key constraints are:

• address must be valid before assert-ing /WE

• data must be valid before deassert-ing /WE

• /WE must be deasserted briefly• no adddress/data hold

time after /WE

I required a fully syn-chronous design to beable to slow or stop theclock and was unwillingto employ any asynchro-nous delay tricks.

Accomplishing thisrequires one half clock tosettle the write address,one half clock to assert /

WE, and one half clock to deassert it.Therefore, byte writes take two fullcycles, and word writes take three(e.g., a word write takes six halfcycles W1–W6):

• W1: assert XA14:1, data LSB, XA0=1• W2: assert /WE• W3: deassert /WE, hold XA and data• W4: assert data MSB, XA1=0• W5: assert /WE• W6: deassert /WE, hold XA and data

MEMCTRL DESIGNI’ve discussed the responsibilities

of MEMCTRL design: address decod-ing, on-chip bus control, and externalRAM control. Now, let’s review itsimplementation (see Figure 6).

In address decoding, if the nextaccess is a load/store to address FFxx,the access is to memory-mapped I/O,and SELIO is asserted. Otherwise, it’sa RAM access.

Within each peripheral’s DCTRLinstance, its SELi (decoded from AN7:5)and CTRLSELIO combine to develop thatperipheral’s output and clock enables.

For bus control, the current state ofthe memory transaction finite statemachine determines which controlsare asserted. The CPU asserts ACE(address clock enable) to request thenext transaction and awaits RDY.MEMCTRL decodes the request, andthe FSM enters the IO, RAMRD, orRAMWR state. The latter has threesub-states—W12, W34, and W56—corresponding to pairs of the W1–W6half-states described previously.

In the IO state, RDY is assertedunless the selected peripheraldeasserts CTRL0, the I/O ready line,thereby inserting a wait state.

In the RAMRD state, RDY is as-

Figure 2 —The RAM interface signals for three memorytransactions are: read 1234 from address 0010, writeABCD to address 0200, and read 5678 from address0012.

CLK

Read W1 W2 W3 W4 W5 W6 Read

XA[14:1] 0010 0200 0012

XA_0

12 34 CD AB 56 78XD[7:0]

/WE

/OE

Transaction Cycles Enables

RAM read byte 1 LXDTRAM read word 1 LXDT, UXDTRAM write byte 2 LDT, XDOUTTRAM write word 3 LDT or UDLDT, XDOUTTI/O read byte 1+ p/LDTI/O read word 1+ p/LDT, p/UDTI/O write byte 1+ LDT, p/LCEI/O write word 1+ LDT, UDT

p/LCE, p/UCE

PM

UX

, P

PIX

ELS

, LX

D, U

XD

AR

EG

S, A

RE

G, S

LUB

UF

BR

EG

S, B

RE

G, S

RB

UF

FW

D, A

, UD

LDB

UF,

ZH

BU

F

IMM

ED

, B, L

DB

UF,

UD

BU

FLO

GIC

, DO

UT

,LO

GIC

BU

FA

DD

SU

B, S

UM

BU

F

PC

DIS

P, Z

AD

DR

MU

X

PC

INC

R

PC

, RE

T, R

ET

BU

F

IRA

M

RN

A

CPU CTRL, SYSCTRL, MISC

RN

BR

NA


Figure 5— The XIN8 (PARIN) implementation shows the CTRLdecoder output LDT that enables the input byte to be driven onto thedata bus.

serted immediately because allRAM reads require only oneclock cycle. In the RAMWRstate, RDY is asserted on W34 forbyte stores and on W56 for wordstores.

The write controller uses flip-flops W23_45 and W45, which areclocked on CLK falling edges. So,W34 is true during W3 and W4, whileW45 is true during W4 and W5. Fromthe W* signals you derive glitch-freecontrol signals XA_0, /WE, /OE, and soon.

The rest of MEMCTRL is straight-forward. Note how E encodes (re-names) the various peripheral controlsignals to CTRL15:0.

I technology-mapped some logicusing FMAPs. Timing analysis hadrevealed poor automatic mapping ofthis logic. This change shaved a fewnanoseconds off the critical path.

Now that we’ve covered the imple-mentation of MEMCTRL, let’s turnour attention to peripherals.

PARALLEL PORT I/OI provided parallel port I/O to com-

municate with the host. The XS40board provides eight parallel port datainputs and five status outputs. Reserv-ing a few for debug I/Os, I used sixinputs and four outputs.

During lb rd,FF41, the PARINinput peripheral is selected, drivingthe inputs 00 || PAR_D5:0 onto D7:0 (seeFigure 5).

During sb r1,FF21, the PAROUToutput peripheral is selected, captur-ing the store data D3:0 in flip-flops,which drive the PC_S6:3 status outputs.

XOUT4 is as simple as XIN8. It

has a DCTRL decoder, of course, andclocks D3:0 on LCE (LSB clock enable).This parallel port requires only threeCLBs, eight TBUFs, and 10 IOBs!

ON-CHIP RAMXSOC also includes a 16 × 16-bit

RAM peripheral. It uses all of theDCTRL outputs: A4:1 to select theword to read or write, LCE and UCEas lower and upper byte write strobes,and LDT and UDT as lower and upperbyte output enables.

VIDEO CONTROLLERThe bit-mapped video controller,

based on ideas from [1], displays all32 KB of external SRAM at 576 × 455resolution, monochrome.

It runs autonomously from theCPU, and so is not a peripheral on theon-chip bus. It uses DMA to fetchvideo data, which consumes about10% of memory bandwidth.

A video signal is a series of frames;each frame is a series of lines, andeach line is a series of pixels. Thevideo controller fetches 16-pixel wordsof video memory, shifts the pixels outserially, and uses horizontal and verti-cal sync pulses to format the pixelsinto frames and lines for the monitor.

Generating VGA-compatible hori-zontal and vertical sync timings, VGA

shifts pixels out at 24MHz, twice the sys-tem clock rate, shift-ing one out when CLKis high and a secondwhen it is low. Thehorizontal and verticalsync pulses are ad-vanced a few clocks(lines) to center thedisplay in the frame(see Table 5).

The VGA ports aredescribed in Table 6.The first five ports

Photo 1 —Here’s the XS40 board, with the project design loaded into theFPGA and running a demo program that’s drawing graphics on the monitor.

request new pixel data via theDMA controller. The rest are theVGA video outputs. The red,green, and blue intensities R1,R0, G1, G0, B1, and B0 driveresistor-based 2-bit D/A convert-ers, providing up to 64 colors (4 ×4 × 4). However, at this resolu-tion, with 32 KB of RAM, you

can only support a monochrome (1-bit/pixel) display. So, each pixel bitdrives all six outputs, drawing blackor white pixels.

To generate horizontal and verticalsyncs and a video blanking signal, youneed a 9-bit horizontal cycle counterand a 10-bit vertical line counter.

After 288 clocks, it’s time to blankthe video. Assert horizontal sync after308 clocks, deassert it after 353, andreset the counter and re-enable videoafter 381 clocks (one line).

In the vertical direction, the VGAcontroller must blank video after 455lines, assert vertical sync after 486lines, deassert it after 488 lines, andreset the counter, re-enable video, andreset the video DMA address counterafter 528 lines.

The simplest way to build eachcounter is with a Xilinx library binarycounter, such as a CC16RE. But be-cause I had just about filled theFPGA, and because they’re cool, Idesigned a more compact 10-bit linearfeedback shift register (LFSR) counter.This uses a 10-bit serial shift registerwhich has an input that is the XOR ofcertain shift register output taps.

An n-bit LFSR repeats every 2n-1cycles, but you can make an arbitrarym-cycle counter by complementingthe LFSR input bit, thereby short-circuiting the full sequence when aparticular bit pattern is recognized.My LFSR counter design program canbe downloaded from the Circuit Cel-lar web site.

Referring to Figure 7, note thevideo controller contains two LFSRcounters, H and V. Each has four com-parators to compare the LFSR bitpatterns to the count patterns outputby my program.

Each of the J-K flip-flops HENN,NHSYNC, VEN, and NVSYNC are seton reaching one counter value andreset on reaching another.


design using the Xilinx tools andtested it on my XS40 board. Using aparallel port output for CLK, I wroteshell scripts to single-step the proces-sor and observe PC7:1 on the LEDs.Later, I ran the CPU at up to 20 MHz.

Starting from a core set of workinginstructions, it was easy to test therest, one at a time. If something wentawry, I could do a binary search forthe problem, insert a stop: gotostop; breakpoint into my test,recompile, and download. A real re-mote debugger would be nice!

Armed with a working CPU, it iseasy to add and test new features, oneby one. I added double-cycled reads

from externalRAM, thenMEMCTRL, thenLED output regis-ters. Writing textmessages to theseven-segment LEDwas a big mile-stone. RAM writeswere next. And,late in the project Iadded DMA, thevideo controller,and interrupts.

I want to em-phasize the impor-tance of thoroughtesting. You haveyour work cut outfor you when prop-erly testing apipelined processorand an SoC.

This has been aproof-of-conceptproject, and I have

focused on design issues. To shipsomething like this, you would needto budget as much or more time forvalidation as for the design and imple-mentation.

The final system floorplan, asplaced on our 14 × 14 CLB FPGA, isshown in Figure 3.

SERIES WRAP-UPIn this three-part series, I have

presented the complete design andimplementation of a real, full-fea-tured, pipelined microprocessor andan integrated System-on-a-Chip. Idesigned a new instruction set, porteda C compiler, and discussed how to

NHSYNC is asserted low duringclocks 308–353, and NVSYNC duringlines 486–488. HEN is the pipelinedhorizontal video enable, and VEN isthe vertical video enable. When bothare true, you fetch and shift out videodata.

In the video datapath, each clockshifts out two bits of video data. Ev-ery eight clocks, WORD goes true,and it requests a new 16-bit word ofvideo data from memory. REQ isasserted, registering a pending DMAtransfer with the CPU.

Five or fewer clocks later, the CPUperforms the DMA load, assertingACK. The video data word is latchedin the PIXELS staging register. On theeighth clock, this word is loaded intothe PMUX 8 × 2 parallel-load serial-out shift register.

Two bits shift out of PMUX duringeach clock, and feed a 2–1 mux thatdrives the 1-bit pixel each half clock.

SYSTEM BRING-UPAfter designing the CPU, I de-

signed a simple test-fixture using on-chip ROM and ran my test programsin the Foundation simulator.

After simulating test programs forhundreds of cycles, I compiled the

Figure 4— The processor (P) issues requests to MEMCTRL, accessing instruction and data via the on-chip bus D15:0 or external SRAM.Integrated peripherals provide parallel port I/O and on-chip RAM. The VGA controller fetches pixel data via DMA.

Tables 5 & 6 —The 12-MHz clock and 24-MHz pixel shift frequency determines the pixels per line and lines perframe, as well as the horizontal and vertical counter values for sync and blanking events.

Port Description

PIX15:0 next 16-bit pixel wordREQ request DMA of next wordRESET reset DMA address counterACK DMA acknowledge inputCLK system clockR1,R0 2-bit red intensityG1,G0 2-bit green intensityB1,B0 2-bit blue intensityNHSYNC active-low horizontal syncNVSYNC active-low vertical sync

Quantity Value

two-pixel clock 83.3 nsone-pixel half-clock 41.7 nsvisible pixels/line 576visible clocks/line 288horizontal sync “on” clock 308horizontal sync “off” clock 353line total clocks 381line time 31.8 msvisible lines/frame 455vertical sync “on” line 486vertical sync “off” line 488frame total lines 528frame time 16.8 ms


Figure 7— As you can see, the video controller contains two LFSR counters that each have four comparators for comparing the LFSR bit patterns to the count patterns that areoutput by the program that I wrote.

Figure 6— The memorycontroller consists of anaddress decoder, a memorytransaction state machine,and miscellaneous on-chipbus and external RAMcontrol logic.


SOFTWAREYou may download more informa-tion, including specifications,source code, schematics, and linksto related sites from the CircuitCellar web site.

REFERENCE[1] VGA Signal Generation with

the XS Board, XESS App Notewww.xess.com/fpga/vga.pdf

SOURCESXESS XS40-005XLwww.xess.com/fpga

FPGAs, Student Edition toolsXilinx, Inc.(408) 559-7778Fax: (408) 559-7114www.xilinx.com

Jan Gray is a software developerwhose products include a leading C++compiler. He has been building FPGAprocessors and systems since 1994,and he now designs for Gray Re-search LLC. You may reach him [email protected].

Please note that I do not warrantthat you have the right to buildsomething based upon the ideas dis-cussed in this series of articles underthe relevent intellectual propertylaws in your jurisdiction.


building a risc system in an fpga, circuit cellar issue 116-117, april 00

Documents

bit register

control unit

student edition

register file

bit instructions

rfwe clk

execution

return address